Larsen's Method Example

Larsen's method

Description: Larsen's test statistic K is sensitive to a unimodal clustering of occupied cells. A unimodal cluster occurs when occupied time cells tend to occur in a sequence, one right after the other, and arise when an initial case is quickly followed by the appearance of daughter cases. This may occur after an ephemeral exposure to an infectious agent or irritant, and with certain behavior-mediated phenomena such as copy-cat suicides. This module provides two tests for temporal clustering; within the individual spatial units (using a z-score as described later) and across all spatial units simultaneously (using an overall z-score). Larsen's test can be used to answer two questions: `Within an area, do occupied time cells tend to occur in a sequence?' and is addressed using the z-score. The second question asks `Is there an unusual pattern over time which may not necessarily be the same over the individual areas?' and uses the overall z-score.

 Notation:

Ø t is the total number of time intervals

Ø m is the number of time periods with at least one case.

Ø y(i) is the time assigned to the ith cell in which a case occurred

Ø (r+1) is the index of the `central most' time cell that contained a case, r=[m/2]. The operation [x] is the least integer greater than x.

Ø K, the test statistic, measures the tendency of time periods with at least one case to form a single cluster in time. It is

A simple example clarifies how the center of a unimodal cluster is determined. Consider the time series below (refer to line 1). A `0' indicates time intervals with no cases, and a `1' indicates one or more cases. Line 2 is the index of the time interval. Is the sequence of cases from intervals 2 through 8 a unimodal cluster? The total number of time periods, t, is 17. The number of time periods with at least one case, m, is 7. Our first task is to find that time period which is at the center of all the occupied time periods. Line 3 gives the time periods in which at least one case occurred, these are called the yi. Where is the center of the yi? For these data r is the least integer greater than or equal to 7/2, and is r=3. Out of all the occupied time periods (the yi), the index of the time period in the middle is given by r+1=4. This corresponds to time period 5, as shown by the underscored, bold type.

Cases	0	1	1	1	1	0	1	1	0	0	0	0	0	0	1	0	0
Indices	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17
y_i		2	3	4	5		7	8							15

 Null hypothesis: Occupied cells occur at random across the time series.

 Significance: The expectation and variance of K under the null hypothesis are

Here d(m) is r-2 when m is odd, and d(m)=2r-1 when m is even. The test statistic K can be expressed as a z-score which is expected, under the null hypothesis of a random allocation of occupied cells across the time series, to be normally distributed with a mean of 0 and unit variance.

A z-score of 0 is consistent with a random allocation of occupied cells across the time series. K will be smaller than its expectation when occupied time intervals form a unimodal cluster, and the z score will be less than 0. A uniform distribution of occupied time intervals through time, such as `01010101' will cause K to be larger than its expectation, and the z score will be greater than 0. K will also be large when occupied time intervals form several clusters, and Larsen's method thus cannot distinguish a uniform distribution from multiple clusters. Significance therefore is evaluated as a one-tailed test describing the probability, under the null hypothesis, of obtaining a K as small or smaller than the observed. This P-value is obtained by comparing the z score to the percentiles of the normal distribution.

Data screen: Select `Time' from the horizontal method menu, `Larsen' and `data' to display the Larsen data entry screen. You need to enter the name of the time series file to be analyzed. Type `bones.tim' and press function key F10 to exit.

 Run screen: The calculations will take a moment to complete and the Larsen run screen will appear. The upper window displays the name of the file (`bones.tim'), the number of rows (time series) and the number of columns (time intervals) in the file. The lower left window is a plot of the observed statistic (K) on its expectation (E(K)). Each time series is one point on this graph. The dashed line is the function K=E(K) describing where observations would be plotted under the null hypothesis of a random distribution of occupied intervals across the time series. Unimodal clustering will cause K to be smaller than its expectation, and the observations would plot below dashed line. Multiple clustering or uniformity (cluster avoidance) will cause K to be larger than its expectation, and observations would plot above the dashed line. The bone fracture time series plots slightly above the dashed line.

The window on the right displays a table of results. The first column in the table `row' is the index of the time series in the input file. The first time series is row 1, the second row 2 and so on. `Bones.tim' is one time series, and row=1. The second column, K, is Larsen's test statistic. This is followed by its expectation (E(K)), variance (Var(K)), z score and P-value. The P-value is the probability under the null hypothesis of observing a K smaller than the one obtained from the data, and is one-tailed. For the bone fracture data K=52 is larger than its expectation E(K)=45. The z score is positive, indicating a tendency towards regularity or perhaps multiple clusters, rather than a unimodal cluster. The P-value is 0.9225, and we conclude there is no evidence of unimodal clustering in the time series of bone fractures.

Simultaneous time series: When the data consist of several time series the K statistics from each time series can be combined into an overall z score as

This grand z score tests for an overall departure from the expected values across all time series simultaneously. The individual z scores test for unimodal clustering within each time series. You must examine the individual z scores before concluding whether a significant grand z score is due to unimodal clustering in all of the time series, or to some other combination of temporal pattern across time series.

These are the dates of diagnosis of 53 children with acute leukemia in the 18 census tracts comprising metropolitan Atlanta, Georgia. They are from Table 5 of Larsen et al., 1973. Attention was brought to these data because the leukemia within census tracts appear to cluster. Upon further scrutiny single leukemia clusters were identified in 10 of the tracts (can you identify the clusters in the leukemia.tim data file? Each row represents a tract....).

Are these unimodal clusters, or is the apparent pattern consistent with the random occurrence of leukemia cases through time?

Run screen: To answer this question exit the data entry screen and select `Run' from the action menu. The calculations will take a moment to complete and the Larsen run screen will appear. The lower left window is a plot of the observed statistic (K) on its expectation (E(K)). Each time series is one point on this graph. The dashed line is the function K=E(K) describing where observations would be plotted under the null hypothesis of a random distribution of occupied intervals across the time series. Unimodal clustering will cause K to be smaller than its expectation, and observations would plot below dashed line. Multiple clustering or uniformity (cluster avoidance) will cause K to be larger than its expectation, and observations would plot above the dashed line.

Some of the leukemia time series plot above the dashed line and some of them plot below the line. Overall, is there evidence of a unimodal clustering of leukemia cases? To answer this question press the `PgDn' key to scroll the results table. Then examine the results of the significance test for the grand z-score. The grand z-score was -2.39 and was highly significant with P=0.0085. We conclude there is statistically significant unimodal clustering when all of the census tracts are considered simultaneously.

Did the amount of clustering vary across census tracts? To answer this question we examine the individual z-scores. Recall negative z-scores suggest unimodal clustering, z-scores near zero are consistent with the null hypothesis, and positive z-scores arise under uniformity and multiple clusters. The window on the right displays a table of results. The first column in the table `row' is the index of the time series in the input file. The first time series is row 1, the second row 2 and so on. The second column, K, is Larsen's test statistic, followed by its expectation, variance, z score and P-value. The P-value is the probability under the null hypothesis of observing a K smaller than the one obtained from the data, and is one-tailed.

Initially we identified possible unimodal clusters in 10 of the census tracts, as shown by the boxes around clusters of leukemia cases in the Figure on the previous page. These 10 tracts correspond to rows 1, 5 ,7, 8, 9, 11, 14, 15, 17 and 18 of the results table. They all had negative z-scores but only rows 9 (census tract 31 in Dekalb county) and 11 (census tract 35 in Dekalb county) with 4 cases each were significant. Several of the other tracts had only 2 cases each, and, while clusters of 2 resulted in negative z-scores, they were not significant because of insufficient sample size. Combining results across tracts increases the sample size and the negative grand z-score indicates significant unimodal clustering when the tracts are taken as a group.

 Notes:

The normality assumption used to evaluate significance doesn't hold when the time series is shorter than 10 intervals. Stat will analyze time series with fewer than 10 intervals, but Larsen's method won't be very powerful. For short time series consider using smaller time intervals or perhaps collecting data from additional time periods. As alternatives consider the empty cells test, or Dat's 0-1 matrix test when cases are numerous.

The method requires two or more of the time intervals to have cases. Time series with fewer than 2 occupied intervals are excluded from an analysis. Also, each time series must have at least 1 unoccupied cell. Counts are required and the method is biased by changes in population size through time. You cannot use Larsen's method with rates. The grand z-score is not biased by differences in population size across time series.

Reference:

Larsen, R. J., C. L. Holmes and C. W. Heath. 1973. A statistical test for measuring unimodal clustering: a description of the test and of its application to cases of acute leukemia in metropolitan Atlanta, Georgia. Biometrics 29:301-309.

Website maintained by Andy Long. Comments appreciated.