Cuzick and Edwards' method
Description: Cuzick and Edwards proposed a case-control test to detect spatial clustering. The test statistic, Tk, is the sum, over all cases, of the number of each case's k nearest neighbors that also are cases. When cases are clustered the nearest neighbor to a case will tend to be another case and Tk will be large. Tk will be zero when all of the cases have controls as nearest neighbors. Their approach is attractive in that it accounts for geographic variation in population density and because it allows one to account for confounders, both known and unknown, through the judicious selection of controls.
Notation:
Ø k : Number of nearest neighbors to consider
Ø N1 : Number of cases
Ø N2 : Number of controls
Ø N : Sample population size
Define
Ø if observation i is a case and 0 if it is a control
Ø if the kth nearest neighbor to i is a case, 0 otherwise
The test statistic is:
Null hypothesis: The null hypothesis states that case identity is assigned at random among the N locations comprising the sample population. This is equivalent to saying the cases and controls have a common spatial density function. The expected value of the test statistic under the null hypothesis is
E(Tk)=pkN
Here . The variance under the null hypothesis is a fairly complex expression and is given in Cuzick and Edwards (1990).
Data screen: Locations of both cases and controls are needed to conduct the test. Case locations usually are place of residence. Controls are selected from the same population from which the cases were taken. For this example select `Space' on the horizontal methods menu, then `Cuzick and Edwards' and `Data' to access the data entry screen. Names of files containing the coordinates of the cases and controls are needed. Enter `Humberca.pnt' and `Humberco.pnt'. These data describe 62 cases of childhood leukemia diagnosed in North Humberside between 1974 and 1986. 141 controls were selected from the birth register for each of these years. The locations are centroids of the postal codes of home addresses (from Table 6 of Cuzick and Edwards (1990)).
Run screen: Press `F10' to exit the data entry screen. Then select `Run' on the action menu and press `Enter'. The calculations may take a moment and the run screen will be displayed. The map shows cases as `+' and controls as squares. The arrows are first nearest neighbor links between cases. The arrows head identifies the nearest neighbor, while the arrows tail is the case being evaluated. Two-headed arrows indicate reflexive nearest neighbors.
Examine the results table on the right side of the results screen. The first column is `k', the number of nearest neighbors. The second column, `T[k]', is the test statistic, Tk. The third column, `E[T]' is the expected value of Tk under the null hypothesis; column four is the variance of under the null. Column five is a z-score calculated as
Which is (column 2 - column 3) divided by the square root of column 4, and is distributed as a normal deviate. Column six is the probability, under the null hypothesis, of observing a Tk as large or larger than the one given in column 2. The P-values from each row are combined using both the Bonferroni and Simes corrections, as described elsewhere.
`Ties' sometimes arise when the centers of areas (e.g. census districts) are used instead of exact place of residence. In these instances Stat calculates upper and lower bounds on Tk following the algorithms given by Jacquez (1994). Consult his paper for details. The Humberside leukemia data used centers of postal codes instead of exact place of residence, resulting in six 2-way ties and two 3-way ties. Stat automatically calculates bounds on the test statistic when ties are present. The results in the right-hand window of the results table are for the upper bound. Press the `PgDn' key to view results for the lower bound.
The results are summarized in the table on the following page. Upper bounds were calculated by taking cases to be nearest neighbors when faced with a tie between a case (ego) and a case and a control. Lower bounds were calculated by taking controls to be nearest neighbors when faced with a tie between ego and a case and a control. The test statistics for the lower and upper bounds are Tl and Tu, respectively. The variances and P-values for the upper and lower bounds are also in the table. The expectation of the upper and lower bounds are the same for a given number of nearest neighbors.
k |
Exp T |
Tl |
Var Tl |
Pl |
Tu |
Var Tu |
Pu |
1 |
18.72 |
24 |
17.66 |
0.1 |
25 |
17.57 |
0.07 |
2 |
37.45 |
51 |
35.58 |
0.01 |
54 |
35.72 |
0 |
3 |
56.17 |
75 |
54.41 |
0.01 |
78 |
54.34 |
0 |
4 |
74.89 |
94 |
73.04 |
0.01 |
98 |
73.56 |
0 |
5 |
93.61 |
113 |
93.41 |
0.02 |
117 |
93.15 |
0.01 |
6 |
112.34 |
127 |
112.64 |
0.08 |
129 |
112.59 |
0.06 |
7 |
131.06 |
143 |
133.03 |
0.15 |
144 |
132.41 |
0.13 |
8 |
149.78 |
159 |
155.4 |
0.23 |
161 |
155.15 |
0.18 |
9 |
168.5 |
176 |
179.65 |
0.29 |
178 |
179.35 |
0.24 |
10 |
187.23 |
194 |
203.43 |
0.32 |
194 |
203.99 |
0.32 |
How does one evaluate significance when faced with P-values for the upper and lower bounds (Pu and Pl)?
Ø There is significant spatial clustering when .
Ø There is no spatial clustering when .
Ø Judgment is reserved when .
Based on these criteria the Humberside leukemia data show significant spatial clustering for k=2, 3, 4 and 5. We conclude there is a significant spatial clustering of leukemia cases relative to the spatial distribution of the controls.
References: