Cuzick and Edwards Example Analysis

Cuzick and Edwards' method

Description: Cuzick and Edwards proposed a case-control test to detect spatial clustering. The test statistic, T_k, is the sum, over all cases, of the number of each case's k nearest neighbors that also are cases. When cases are clustered the nearest neighbor to a case will tend to be another case and T_k will be large. T_k will be zero when all of the cases have controls as nearest neighbors. Their approach is attractive in that it accounts for geographic variation in population density and because it allows one to account for confounders, both known and unknown, through the judicious selection of controls.

Notation:

Ø k : Number of nearest neighbors to consider

Ø N₁ : Number of cases

Ø N₂ : Number of controls

Ø N : Sample population size

Define

Ø if observation i is a case and 0 if it is a control

Ø if the kth nearest neighbor to i is a case, 0 otherwise

The test statistic is:

Null hypothesis: The null hypothesis states that case identity is assigned at random among the N locations comprising the sample population. This is equivalent to saying the cases and controls have a common spatial density function. The expected value of the test statistic under the null hypothesis is

E(T_k)=pkN

Here . The variance under the null hypothesis is a fairly complex expression and is given in Cuzick and Edwards (1990).

Data screen: Locations of both cases and controls are needed to conduct the test. Case locations usually are place of residence. Controls are selected from the same population from which the cases were taken. For this example select `Space' on the horizontal methods menu, then `Cuzick and Edwards' and `Data' to access the data entry screen. Names of files containing the coordinates of the cases and controls are needed. Enter `Humberca.pnt' and `Humberco.pnt'. These data describe 62 cases of childhood leukemia diagnosed in North Humberside between 1974 and 1986. 141 controls were selected from the birth register for each of these years. The locations are centroids of the postal codes of home addresses (from Table 6 of Cuzick and Edwards (1990)).

Run screen: Press `F10' to exit the data entry screen. Then select `Run' on the action menu and press `Enter'. The calculations may take a moment and the run screen will be displayed. The map shows cases as `+' and controls as squares. The arrows are first nearest neighbor links between cases. The arrows head identifies the nearest neighbor, while the arrows tail is the case being evaluated. Two-headed arrows indicate reflexive nearest neighbors.

Examine the results table on the right side of the results screen. The first column is `k', the number of nearest neighbors. The second column, `T[k]', is the test statistic, T_k. The third column, `E[T]' is the expected value of T_k under the null hypothesis; column four is the variance of under the null. Column five is a z-score calculated as

Which is (column 2 - column 3) divided by the square root of column 4, and is distributed as a normal deviate. Column six is the probability, under the null hypothesis, of observing a T_k as large or larger than the one given in column 2. The P-values from each row are combined using both the Bonferroni and Simes corrections, as described elsewhere.

`Ties' sometimes arise when the centers of areas (e.g. census districts) are used instead of exact place of residence. In these instances Stat calculates upper and lower bounds on T_k following the algorithms given by Jacquez (1994). Consult his paper for details. The Humberside leukemia data used centers of postal codes instead of exact place of residence, resulting in six 2-way ties and two 3-way ties. Stat automatically calculates bounds on the test statistic when ties are present. The results in the right-hand window of the results table are for the upper bound. Press the `PgDn' key to view results for the lower bound.

The results are summarized in the table on the following page. Upper bounds were calculated by taking cases to be nearest neighbors when faced with a tie between a case (ego) and a case and a control. Lower bounds were calculated by taking controls to be nearest neighbors when faced with a tie between ego and a case and a control. The test statistics for the lower and upper bounds are T_l and T_u, respectively. The variances and P-values for the upper and lower bounds are also in the table. The expectation of the upper and lower bounds are the same for a given number of nearest neighbors.

k	Exp T	T_l	Var T_l	P_l	T_u	Var T_u	P_u
1	18.72	24	17.66	0.1	25	17.57	0.07
2	37.45	51	35.58	0.01	54	35.72	0
3	56.17	75	54.41	0.01	78	54.34	0
4	74.89	94	73.04	0.01	98	73.56	0
5	93.61	113	93.41	0.02	117	93.15	0.01
6	112.34	127	112.64	0.08	129	112.59	0.06
7	131.06	143	133.03	0.15	144	132.41	0.13
8	149.78	159	155.4	0.23	161	155.15	0.18
9	168.5	176	179.65	0.29	178	179.35	0.24
10	187.23	194	203.43	0.32	194	203.99	0.32

How does one evaluate significance when faced with P-values for the upper and lower bounds (P_u and P_l)?

Ø There is significant spatial clustering when .

Ø There is no spatial clustering when .

Ø Judgment is reserved when .

Based on these criteria the Humberside leukemia data show significant spatial clustering for k=2, 3, 4 and 5. We conclude there is a significant spatial clustering of leukemia cases relative to the spatial distribution of the controls.

References:

Cuzick, J. and R. Edwards. 1990. Spatial clustering for inhomogeneous populations. Journal of the Royal Statistical Society Series B, 52:73-104.
Jacquez, G. M. 1994. Cuzick and Edwards' test when exact locations are unknown. American Journal of Epidemiology, 140:58-64.

Website maintained by Andy Long. Comments appreciated.