k-Nearest Neighbor Example

k-Nearest Neighbor method

Description: A k nearest neighbor test for space-time interaction using point data. The test statistic, J_k , is the count of the number of case pairs that are k nearest neighbors in both space and time. When space-time interaction exists J_k will be large, since nearest neighbors in space will also tend to be nearest neighbors in time.

Notation:

Ø NN : Short hand for `Nearest Neighbor'

Ø k nearest neighbors: the set of cases as near or nearer to a case than the kth NN.

Ø s_ijk : Spatial NN measure, s_ijk =1 if case j is a k NN of case i in space, and 0 otherwise.

Ø t_ijk : Time NN measure, t_ijk =1 if case j is a k NN of case i in time, and 0 otherwise.

Ø J_k : Cumulative test statistic,

Ø DJ_k : k-specific test statistic, DJ_k=J_k-J_k-1

The J_k are not independent because case pairs comprising J_k when k is small are included in higher order J_k. For example, J₂ is the count of the number of pairs that are second nearest neighbors in both space and time, and includes all case pairs that were first space-time nearest neighbors. DJ_k is the number of space-time nearest neighbors added by increasing k by 1. DJ_k measures space-time interaction above and beyond that observed for the k-1 nearest neighbors. J_k , on the other hand, is a cumulative measure of space-time interaction where all nearest neighbor relationships from 1 up to k are included.

Null hypothesis: The null hypothesis states that the probability of two cases being nearest neighbors in time is independent of their status as spatial nearest neighbors. In other words, there is no association between the space and time nearest neighbor relationships.

Significance : Significance of J_k is evaluated using an approximate randomization of the Mantel product. Let Sk denote the matrix of spatial nearest neighbor measures (the s_ijk), and let Tk denote the matrix of time nearest neighbor measures. The Mantel product is written as (this matrix notation is equivalent to ). P-values are calculated by comparing the observed J_k to the distribution of J_k obtained under randomization. The elements of Tk are shuffled by permuting its rows and columns and J_k is then calculated. This procedure is repeated a fixed number of times, resulting in a distribution of J_k under the null hypothesis of no association between the space and time nearest neighbor relationships.

Data screen: Select `Space-time' on the horizontal method menu, then `k-NN' and `Data' to display the data entry screen. Enter the names of the space (`Epidemic.geo') and time (Epidemic.tmp) distance files and the location file (`Epidemic.pnt') from which they were calculated (use `Distance methods', described earlier in this Chapter, to generate distance files). The locations in Epidemic.pnt will be used to construct a linkage map showing the k nearest neighbors. Significance is evaluated by randomization, and you must enter the `Number of runs' (249).

Run screen: Press `F10' to exit the data entry screen and select `Run'. The k-NN run screen will be displayed. The top window shows the point (`Epidemic.pnt'), distance files (`Epidemic.geo',`Epidemic.tmp') and the number of simulations used to evaluate significance (249).

A linkage map is shown in the left hand window. The geographic locations of cases are shown as `+'s. The title at the top of the window indicates the nearest neighbor relationships being plotted. `k=1 space-time links' means only cases that are first nearest neighbors in both space and time are connected. Press function key `F3' to display higher order links. Notice, for k=1, that only cases within the six different disease clusters are connected. This is clear evidence of space time interaction, since it means that cases within clusters must have occurred at about the same time. But is this pattern of nearest neighbor relationships statistically significant?

Examine the table in the right hand window to answer this question. The first column is k, the number of nearest neighbors being considered. Column two is the number of space-time k nearest neighbors (J_k). Column three is the significance of J_k. Column four, DJ(k), is the number of space-time nearest neighbors added when k increased from k-1 (this is DJ_k). Column five is the significance, DP(k), of DJ_k. Each P-value reports the probability, under the null hypothesis, of observing a statistic (either J_k or DJ_k) as larger or larger than the one reported in the results table. The null hypothesis assumes no association between the space and time nearest neighbor relationships, and the P-values are 1-tailed.

Combining P-values : Stat! combines the P-values from the k=10 tests in 3 ways: Bonferroni, Simes and Statistical Distance. The Bonferroni and Simes P values are reported at the bottom of the table. See the section `Combining P-values' in Chapter 5 for a description of the Bonferroni and Simes techniques. Statistical Distance P is third way of combining the P-values across the k=10 tests. Let J indicate a 1 x 10 vector of the test statistics (J₁,...,J₁₀). Recall that significance of the individual J_k is evaluated by permuting the rows and columns of Tk (the time nearest neighbor relationships) and recalculating . For each randomization we obtain a J vector, which can be represented as a location in 10-space. The results under randomization form a cloud of `Number of runs' points in this 10-space. The center of the cloud is called the centroid. Significance is evaluated by comparing the statistical distance from the centroid of the observed (not randomized) vector (J) to the statistical distances from the centroid of the J vectors under randomization. The statistical distance from each point to the centroid is

tex2html_wrap_inline81

Here d_i is the distance from point i to the centroid, J_il is the value of the statistic, J_k, calculated for k=1 using the data from the first randomization. The symbol s₁ denotes the standard deviation of the J₁ under randomization, and is the mean of the J₁. The program then counts the number of distances to the centroid that are greater than or equal to the distance from the observed J to the centroid. Call this count `NGE'. The significance is then

Here Nruns is the number of randomizations. This P-value is a one-tailed test describing the probability, under the null hypothesis, of observing a vector of J_k (or DJ_k) as or more extreme than the observed.

Determining the space-time scale of interaction: The combined P-values for `Epidemic.pnt' are all smaller than 0.05. We conclude there is significant interaction such that pairs of cases that are nearest neighbors in space tend also to be nearest neighbors in time. Now inspect the DJ_k to determine the scale of the space-time interaction. The P-values are significant for k=1 through k=4 (0.008, 0.004, 0.004 and 0.004), and are not significant for k equal to or larger than k=5. This demonstrates the interaction occurs at a space-time scale affecting the first through fourth nearest neighbors.

Notes: The choice of critical distances in the Knox test can be subjective, and Mantel's test is insensitive to nonlinear associations between the space and time distances that are expected under contagion. The k nearest neighbor approach avoids these problems and is designed to be sensitive to the arrangement of spatial and temporal nearest neighbors expected under a contagious process.

Website maintained by Andy Long. Comments appreciated.