Cuzick and Edwards Example Analysis

Cuzick and Edwards' method

 Description: Cuzick and Edwards proposed a case-control test to detect spatial clustering. The test statistic, Tk, is the sum, over all cases, of the number of each case's k nearest neighbors that also are cases. When cases are clustered the nearest neighbor to a case will tend to be another case and Tk will be large. Tk will be zero when all of the cases have controls as nearest neighbors. Their approach is attractive in that it accounts for geographic variation in population density and because it allows one to account for confounders, both known and unknown, through the judicious selection of controls.

 Notation:

Ø k : Number of nearest neighbors to consider

Ø N1 : Number of cases

Ø N2 : Number of controls

Ø N : Sample population size

 Define

Ø tex2html_wrap_inline74 if observation i is a case and 0 if it is a control

Ø tex2html_wrap_inline76 if the kth nearest neighbor to i is a case, 0 otherwise

 The test statistic is:

 Null hypothesis: The null hypothesis states that case identity is assigned at random among the N locations comprising the sample population. This is equivalent to saying the cases and controls have a common spatial density function. The expected value of the test statistic under the null hypothesis is

E(Tk)=pkN

 Here tex2html_wrap_inline80. The variance under the null hypothesis is a fairly complex expression and is given in Cuzick and Edwards (1990).

Data screen: Locations of both cases and controls are needed to conduct the test. Case locations usually are place of residence. Controls are selected from the same population from which the cases were taken. For this example select `Space' on the horizontal methods menu, then `Cuzick and Edwards' and `Data' to access the data entry screen. Names of files containing the coordinates of the cases and controls are needed. Enter `Humberca.pnt' and `Humberco.pnt'. These data describe 62 cases of childhood leukemia diagnosed in North Humberside between 1974 and 1986. 141 controls were selected from the birth register for each of these years. The locations are centroids of the postal codes of home addresses (from Table 6 of Cuzick and Edwards (1990)).

Run screen: Press `F10' to exit the data entry screen. Then select `Run' on the action menu and press `Enter'. The calculations may take a moment and the run screen will be displayed. The map shows cases as `+' and controls as squares. The arrows are first nearest neighbor links between cases. The arrows head identifies the nearest neighbor, while the arrows tail is the case being evaluated. Two-headed arrows indicate reflexive nearest neighbors.

Examine the results table on the right side of the results screen. The first column is `k', the number of nearest neighbors. The second column, `T[k]', is the test statistic, Tk. The third column, `E[T]' is the expected value of Tk under the null hypothesis; column four is the variance of under the null. Column five is a z-score calculated as

tex2html_wrap_inline82

 Which is (column 2 - column 3) divided by the square root of column 4, and is distributed as a normal deviate. Column six is the probability, under the null hypothesis, of observing a Tk as large or larger than the one given in column 2. The P-values from each row are combined using both the Bonferroni and Simes corrections, as described elsewhere.

 `Ties' sometimes arise when the centers of areas (e.g. census districts) are used instead of exact place of residence. In these instances Stat calculates upper and lower bounds on Tk following the algorithms given by Jacquez (1994). Consult his paper for details. The Humberside leukemia data used centers of postal codes instead of exact place of residence, resulting in six 2-way ties and two 3-way ties. Stat automatically calculates bounds on the test statistic when ties are present. The results in the right-hand window of the results table are for the upper bound. Press the `PgDn' key to view results for the lower bound.

 The results are summarized in the table on the following page. Upper bounds were calculated by taking cases to be nearest neighbors when faced with a tie between a case (ego) and a case and a control. Lower bounds were calculated by taking controls to be nearest neighbors when faced with a tie between ego and a case and a control. The test statistics for the lower and upper bounds are Tl and Tu, respectively. The variances and P-values for the upper and lower bounds are also in the table. The expectation of the upper and lower bounds are the same for a given number of nearest neighbors.

k

Exp T

Tl

Var Tl

Pl

Tu

Var Tu

Pu

1

18.72

24

17.66

0.1

25

17.57

0.07

2

37.45

51

35.58

0.01

54

35.72

0

3

56.17

75

54.41

0.01

78

54.34

0

4

74.89

94

73.04

0.01

98

73.56

0

5

93.61

113

93.41

0.02

117

93.15

0.01

6

112.34

127

112.64

0.08

129

112.59

0.06

7

131.06

143

133.03

0.15

144

132.41

0.13

8

149.78

159

155.4

0.23

161

155.15

0.18

9

168.5

176

179.65

0.29

178

179.35

0.24

10

187.23

194

203.43

0.32

194

203.99

0.32

 How does one evaluate significance when faced with P-values for the upper and lower bounds (Pu and Pl)?

Ø There is significant spatial clustering when tex2html_wrap_inline84.

Ø There is no spatial clustering when tex2html_wrap_inline86.

Ø Judgment is reserved when tex2html_wrap_inline88.

 Based on these criteria the Humberside leukemia data show significant spatial clustering for k=2, 3, 4 and 5. We conclude there is a significant spatial clustering of leukemia cases relative to the spatial distribution of the controls.

 References:


Website maintained by Andy Long. Comments appreciated.