k-Nearest Neighbor Example

k-Nearest Neighbor method

 Description: A k nearest neighbor test for space-time interaction using point data. The test statistic, Jk , is the count of the number of case pairs that are k nearest neighbors in both space and time. When space-time interaction exists Jk will be large, since nearest neighbors in space will also tend to be nearest neighbors in time.

 Notation:

Ø NN : Short hand for `Nearest Neighbor'

Ø k nearest neighbors: the set of cases as near or nearer to a case than the kth NN.

Ø sijk : Spatial NN measure, sijk =1 if case j is a k NN of case i in space, and 0 otherwise.

Ø tijk : Time NN measure, tijk =1 if case j is a k NN of case i in time, and 0 otherwise.

Ø Jk : Cumulative test statistic,

Ø DJk : k-specific test statistic, DJk=Jk-Jk-1

 The Jk are not independent because case pairs comprising Jk when k is small are included in higher order Jk. For example, J2 is the count of the number of pairs that are second nearest neighbors in both space and time, and includes all case pairs that were first space-time nearest neighbors. DJk is the number of space-time nearest neighbors added by increasing k by 1. DJk measures space-time interaction above and beyond that observed for the k-1 nearest neighbors. Jk , on the other hand, is a cumulative measure of space-time interaction where all nearest neighbor relationships from 1 up to k are included.

 Null hypothesis: The null hypothesis states that the probability of two cases being nearest neighbors in time is independent of their status as spatial nearest neighbors. In other words, there is no association between the space and time nearest neighbor relationships.

 Significance : Significance of Jk is evaluated using an approximate randomization of the Mantel product. Let Sk denote the matrix of spatial nearest neighbor measures (the sijk), and let Tk denote the matrix of time nearest neighbor measures. The Mantel product is written as tex2html_wrap_inline79 (this matrix notation is equivalent to ). P-values are calculated by comparing the observed Jk to the distribution of Jk obtained under randomization. The elements of Tk are shuffled by permuting its rows and columns and Jk is then calculated. This procedure is repeated a fixed number of times, resulting in a distribution of Jk under the null hypothesis of no association between the space and time nearest neighbor relationships.

Data screen: Select `Space-time' on the horizontal method menu, then `k-NN' and `Data' to display the data entry screen. Enter the names of the space (`Epidemic.geo') and time (Epidemic.tmp) distance files and the location file (`Epidemic.pnt') from which they were calculated (use `Distance methods', described earlier in this Chapter, to generate distance files). The locations in Epidemic.pnt will be used to construct a linkage map showing the k nearest neighbors. Significance is evaluated by randomization, and you must enter the `Number of runs' (249).

Run screen: Press `F10' to exit the data entry screen and select `Run'. The k-NN run screen will be displayed. The top window shows the point (`Epidemic.pnt'), distance files (`Epidemic.geo',`Epidemic.tmp') and the number of simulations used to evaluate significance (249).

A linkage map is shown in the left hand window. The geographic locations of cases are shown as `+'s. The title at the top of the window indicates the nearest neighbor relationships being plotted. `k=1 space-time links' means only cases that are first nearest neighbors in both space and time are connected. Press function key `F3' to display higher order links. Notice, for k=1, that only cases within the six different disease clusters are connected. This is clear evidence of space time interaction, since it means that cases within clusters must have occurred at about the same time. But is this pattern of nearest neighbor relationships statistically significant?

Examine the table in the right hand window to answer this question. The first column is k, the number of nearest neighbors being considered. Column two is the number of space-time k nearest neighbors (Jk). Column three is the significance of Jk. Column four, DJ(k), is the number of space-time nearest neighbors added when k increased from k-1 (this is DJk). Column five is the significance, DP(k), of DJk. Each P-value reports the probability, under the null hypothesis, of observing a statistic (either Jk or DJk) as larger or larger than the one reported in the results table. The null hypothesis assumes no association between the space and time nearest neighbor relationships, and the P-values are 1-tailed.

Combining P-values : Stat! combines the P-values from the k=10 tests in 3 ways: Bonferroni, Simes and Statistical Distance. The Bonferroni and Simes P values are reported at the bottom of the table. See the section `Combining P-values' in Chapter 5 for a description of the Bonferroni and Simes techniques. Statistical Distance P is third way of combining the P-values across the k=10 tests. Let J indicate a 1 x 10 vector of the test statistics (J1,...,J10). Recall that significance of the individual Jk is evaluated by permuting the rows and columns of Tk (the time nearest neighbor relationships) and recalculating tex2html_wrap_inline79. For each randomization we obtain a J vector, which can be represented as a location in 10-space. The results under randomization form a cloud of `Number of runs' points in this 10-space. The center of the cloud is called the centroid. Significance is evaluated by comparing the statistical distance from the centroid of the observed (not randomized) vector (J) to the statistical distances from the centroid of the J vectors under randomization. The statistical distance from each point to the centroid is

tex2html_wrap_inline81

 Here di is the distance from point i to the centroid, Jil is the value of the statistic, Jk, calculated for k=1 using the data from the first randomization. The symbol s1 denotes the standard deviation of the J1 under randomization, and tex2html_wrap_inline83 is the mean of the J1. The program then counts the number of distances to the centroid that are greater than or equal to the distance from the observed J to the centroid. Call this count `NGE'. The significance is then

 Here Nruns is the number of randomizations. This P-value is a one-tailed test describing the probability, under the null hypothesis, of observing a vector of Jk (or DJk) as or more extreme than the observed.

 Determining the space-time scale of interaction: The combined P-values for `Epidemic.pnt' are all smaller than 0.05. We conclude there is significant interaction such that pairs of cases that are nearest neighbors in space tend also to be nearest neighbors in time. Now inspect the DJk to determine the scale of the space-time interaction. The P-values are significant for k=1 through k=4 (0.008, 0.004, 0.004 and 0.004), and are not significant for k equal to or larger than k=5. This demonstrates the interaction occurs at a space-time scale affecting the first through fourth nearest neighbors.

 Notes: The choice of critical distances in the Knox test can be subjective, and Mantel's test is insensitive to nonlinear associations between the space and time distances that are expected under contagion. The k nearest neighbor approach avoids these problems and is designed to be sensitive to the arrangement of spatial and temporal nearest neighbors expected under a contagious process.


Website maintained by Andy Long. Comments appreciated.