Grimson's Simultaneous Clustering Method Example

Grimson's method

 Description: Grimson's method may be used to detect a clustering of labeled time intervals in several time series simultaneously. The test is also described in Chapter 6 for clustering in a single time series. The time intervals possess borders, and an `adjacency' is said to exist when two time intervals share a common border. The test statistic, A, is the count of the number of pairs of labeled time intervals that are adjacent. The adjacency criteria are determined to reflect the kind of clustering under investigation. Grimson's test is sensitive to a high number of adjacencies among labeled time intervals.

 Notation:

Ø x : the total number of time intervals (both labeled and not labeled).

Ø n : the number of labeled time intervals.

Ø y : the average number of borders per time interval.

Ø Var(y): the variance of y.

Ø A : the number of pairs of labeled intervals that are adjacent.

 Null hypothesis: The objects have been labeled at random. Under this hypothesis the number of adjacencies among the labeled cells is expected to be:

tex2html_wrap_inline73

 The variance of A has two components, the regularity component (RC) and the variability component (VC). The regularity component is:

tex2html_wrap_inline77

The variability component is:

tex2html_wrap_inline79

 The notation (a)k indicates a falling factorial such that (a)k=a(a-1)...(a-k+1). For example, (4)3 is 24, obtained as 4x3x2. The variance of A is

Var(A)=RC+VC

 Data screen: Recall the measles data used earlier in this Chapter. The EMM test is based on m1, the maximum number of cases in each time series. It tests for an overall departure from a random allocation of cases within each time series. Having found significant clustering under the EMM method, we now ask whether the clusters in the individual series tend to occur at the same time. It appears that time interval 2 (1984) had many measles cases. Did measles outbreaks in different counties tend to cluster in the same year? To answer this question we focus on the 8 counties that had 2 or more measles cases:

  0  18  11   0   0  Washtenaw
  0   0   0   0   2   Lenawee
  0   0   0   2   0   Jackson
  5   1   0   0   0   Kalamazoo
  0   6   0   1   0   Eaton
  0   2   0   1   0   Ingham
  0  35   1   0   0   Livingston
  1  28  38   0   2   Oakland

Recall the four questions that must be answered to use this method:

Ø What are the data (the objects)?

Ø How are the objects labeled?

Ø What kind of clustering do we want to detect?

Ø How is adjacency defined?

 The data are counts of measles cases in counties by year. Years with 2 or more measles cases are labeled high-risk. We wish to detect a clustering of high-risk counties in any one year, and time intervals from different counties but in the same year are considered adjacent. We ask whether measles outbreaks tend to concentrate in certain years, and we've defined an outbreak as being 2 or more cases.

 We now determine x, the number of time intervals; n, the number of high risk time intervals, y the average number of adjacencies per time interval and Var(y), the variance in the number of adjacencies. x is 8 counties times 5 years, and is 40; n is the number of county-years with measles outbreaks (11). The average number of adjacencies is y=7, because a given time cell (county-year) is adjacent to the 7 other time cells that occurred in that year. All of the cells can be connected to the 7 other cells in a year, so their variance is 0, Var(y)=0.

 The number of pairs of adjacent high-risk county-years, A, is 12 and is obtained as follows: In 1983 only Kalamazoo county was high risk, and contributes nothing to A. In 1984 5 counties were high-risk (Washtenaw, Eaton, Ingham, Livingston and Oakland) and there are 10 ways to connect these 5 cells. In 1985 2 counties were high risk, and there is 1 way to connect 2 cells. In 1987 2 additional high risk counties contribute another adjacency, for A=10+1+1=12.

 Now enter these data into Stat. Select `Time' on Stat's horizontal methods menu, `Grimson' and then `Data'. Then edit the Grimson input data screen so that it looks like this (Next page). Then press F10 to exit the data entry screen and select `Run' from the Action menu. The calculations will take just a moment and the run screen will be displayed.

 Run screen: Is A=12 large relative to the value expected by randomly labeling 11 of the 40 county-year cells as high risk? The upper window shows the total number of cells (40), the mean number of adjacencies per cell (7), the number of high-risk cells (11), the variance in the number of adjacencies (0.0) and the value of the test statistic (12). E(A) and Var(A) are the expectation and variance of A under the null hypothesis. RC and VC are the regularity and variability components used to calculate Var(A) as RC+VC. The z-score is the difference between the observed and expected value of A in standard deviation units. The significance of A is evaluated using the Poisson or the normal distribution. The first assumes A is sampled from a Poisson distribution with a mean given by E(A). The second assumes the z-score is sampled from a normal distribution with mean of 0 and variance 1.0. Both approaches yield a one-tailed test describing the probability, under the null hypothesis, of obtaining a test statistic as large or larger than the one already observed.

 Clustering of high risk cells in any one year would cause an excess of adjacencies, and A would be larger than its expected value. A=12 is larger than its expected value of E(A)=9.8718. The graph in the left window plots the significance of A against the value of A. The Poisson significance is shown by the dashed line. The solid line is significance under the normal approach. The solid, vertical line is the observed A=12. The intersection between the Poisson and Normal curves and the vertical line are the P-values under the Poisson and Normal assumptions. The P-values are relatively large and there is no evidence of annual clustering.

 Notes: See Chapter 6 for a description of Grimson's test applied to a single time series.

 References:

 Grimson, R. C. 1989. Assessing patterns of epidemiologic events in space-time. In Proceedings of the 1989 Public Health Conference on Records and Statistics. National Center for Health Statistics.

 Grimson, R. C. 1991. A versatile test for clustering and a proximity analysis of neurons. Methods of Information in Medicine 30:299-303.


Website maintained by Andy Long. Comments appreciated.