Moran's I Example Analysis

Moran's I method

Description: Moran's I is a weighted correlation coefficient used to detect departures from spatial randomness. Departures from randomness indicate spatial patterns such as clusters. Other kinds of pattern including geographic trend.

The test statistic (Moran's I) is

Moran's I is very similar to the product-moment correlation coefficient, except for the addition of the weight terms (w_ij). The weights reflect how `connected' we think two areas are, and usually reflect geographic proximity. Moran's I is used to determine whether `connected' areas are more similar to one another than would be expected under spatial randomness. It asks: Do the rates in connected areas covary? Moran's I is greater than its expectation when connected areas have similar rates. Moran's I is smaller than its expectation when rates in connected areas are dissimilar. Weights quantify our hypothesis about how similar we think rates in the different areas ought to be.

Null hypothesis: The observed rates are assigned at random among the N counties. This null hypothesis is also called `csr' for `complete spatial randomness'. The expectation of I under the null hypothesis is:

The expectation becomes close to 0 as N increases. The variance of I is determined under two null hypotheses or assumptions (Cliff and Ord 1981): Normality (denoted N) or randomization (denoted R). Under assumption N the rates are sampled from a population whose distribution is normal. Under assumption R the rates are random samples from a population whose distribution is unknown. Assumption N is useful when we have good reason to believe the observations follow a normal distribution. Assumption R is less restrictive and, since we often don't know their theoretical distribution, is appropriate for disease rates. The variance under assumption N is

Under assumption R the variance is

(see this document for more details).

Significance: Stat evaluates the significance of Moran's I under assumptions R and N, and also by simulation. Under simulation the rates are repeatedly randomized over the areas and Moran's I is recalculated. This operation is repeated a set number of times to generate a distribution of Moran's I under simulation. Let Nruns be the number of runs in the simulation, and NGE the number of times Moran's I under simulation was greater than or equal to Moran's I obtained for the original (not randomized) data. Significance under simulation is then:

This is a 1-tailed P-value for a spatial clustering of rates. Moran's I will in some instances be smaller than its expectation, and Stat therefore reports a 2-tailed P-value, calculated as 2P_S. For assumptions N and R Stat calculates two z-scores as:

and

These z-scores express the difference between the observed and expected value of I in standard deviation units. The distribution of the z-scores is approximate normal with a mean of 0 and a variance of 1.0. Stat reports a two-tailed P value because spatial pattern is of interest both when Moran's I is positive (rates in connected areas are similar) or negative (rates in connected areas are dissimilar).

Connection weights: You must supply both the rates in the different areas as well as the weights (the w_ij) between pairs of areas. Weights can be specified in many ways, but usually are based on geographic adjacency, distance, or some other criteria reflecting the alternative spatial model.

In the absence of more detailed knowledge areas are often connected based on geographic adjacency. Two areas, i and j, are assigned a weight of 1 when they share a common border. Otherwise they are assigned a weight of 0. The alternative spatial model under this simple scheme states that rates in spatially contiguous areas are correlated. This is a reasonable model to use when you are interested in detecting contiguous areas with similar disease rates.

The weights themselves may be assigned values other than 0 or 1. A common scheme is to calculate weights based on geographic distance. For example, let dij be the geographic distance between areas i and j. Then the weight may be calculated as an inverse distance function:

The alternative spatial model then states that rates in nearby, not necessarily contiguous, areas are correlated. This is a reasonable model to use when you think processes on a large geographic scale may be causing rates in distant, as well as adjacent, areas to covary.

Finally, one may specify weights based on some other criteria reflecting the alternative spatial model. For example, suppose you suspect a common municipal water supply is responsible for an outbreak of Giardia. You might then specify weights so that pairs of counties on the municipal water supply receive a weight of 1, and 0 otherwise. This reflects your alternative spatio-epidemiologic hypothesis stating that counties receiving municipal water will have similar Giardia rates. Obviously, there are many ways to specify the weights. The rule of thumb is to use a weighting scheme which reflects your alternative hypothesis.

Input file formats: The Moran method requires two input files: a connection file and a Moran data file. The connection files describes the weights. Each row has the format:

i  w  C  j1  j2 ... jk

The i indicates the county under consideration; w is the value of the weight to be assigned; C is the number of counties that are to be connected to county i using weight w; j1 indicates the number of the first county that is be connected to county i using weight w.

For example, suppose you want to analyze Lyme disease rates in the counties comprising Georgia.

You take out the county map of Georgia and assign a unique number to each county. There are 159 counties, and you number them from 1 to 159. Be sure to assign consecutive integers as these are used by Stat as indices into the matrix of the weights. You are interested in spatial clustering, and decide to assign pairs of adjacent counties (those sharing a common border) a weight of 1, and 0 otherwise. You then write down a list of the weights based on this adjacency criteria. This has already been accomplished for you and the weights are in the file `Lyme.con'. The first line looks like this:

    1    1    4    7    8   10   13

The first `1' identifies the first county in the pair of counties you are assigning weights to. The second `1' is the weight you are assigning. The `4' indicates you are assigning weights of 1 between county 1 and 4 other counties. The numbers `7 8 10 13' are the 4 other counties. This line tells us that county 1 has common borders with counties 7, 8, 10 and 13. Weights w₁₇, w₁₈, w_1,10 and w_1,13 have been set equal to 1. Stat assumes weights are 0 unless told otherwise, and you need to record only non-zero weights.

Stat reads the weights into a symmetric matrix, which reduces the number of weights that need to be stored from N2 to N(N-1)/2. The symmetry condition means the weights are reflexive so that w_ij = w_ji. For example, if area 1 has a border with area 2 then area 2 will has a border with area 1. If for some reason your weights are not reflexive then symmetrize them using .

The Moran data file contains the disease rates, and has the format:

         
x₁  label₁
x₂  label₂
     .
     .
     .
x_N label_N

By row, x₁ is the rate in county 1, x₂ is the rate in county 2 and x_N is the rate in county N. Optional labels may be entered, but there must be one area per record, and the disease rate (x) must be the first value in each record. Lyme disease rates are in the file `Lyme.dat'. The first four rows, which have been assigned numbers 1 - 4, are:

      1.2506 Fannin
       1.717 Rabun
           0 Dade
      0.1714 Walker

It is essential that rates in the Moran data file register correctly with the numbering system in the connection file! The first row in the Moran data file is assigned the index 1, the second row index 2 and so on. Thus Fannin is county 1, and Rabun is county 2.

Input screen: Access the Moran data screen by selecting `Space' from the horizontal method menu, then `Moran' and `Data'. Enter `Lyme.dat' in the `Data file:' field and `Lyme.con' in the `Connection file:' field. Enter `99' for `Number of runs'. This is the number of simulations that will be run to evaluate significance under simulation. Then press `F10' to exit.

In our example we're using Moran's I to evaluate a possible spatial cluster of Lyme disease in 159 counties in Georgia. Lyme disease is caused by a spirochete transmitted to humans by the deer tick Ixodes scapularis. In 1989 715 cases of Lyme disease were reported in Georgia, 12 times the number of cases reported in 1988 (McKinley et al. 1990) and more than 10 times the rate observed in surrounding states. Are counties high in Lyme disease spatially clustered, or are county Lyme disease rates independent of the rates in adjoining counties? Spatial clustering may be caused by transmission of Lyme disease among a geographically wide spread white-tailed deer population, in which case intervention and control strategies would be implemented in the affected counties simultaneously. On the other hand, spatial independence suggests disease transmission on a small spatial scale, and intervention would need to focus within counties.

Run Screen: Select `Run' and press `Enter'. The Moran run screen will appear and there will be a delay as the 99 simulations are conducted. When the number of areas is greater than 25 the P-value obtained for assumption R closely approximates the P-value under simulation, and you can set the `Number of runs' to `1' to greatly reduce run time. We've set `Number of runs' to `99' to illustrate the use of the Stat's simulation feature. A counter will display the number of the simulation being conducted.

The plot in the left of the display shows the distribution of Moran's I under simulation. The thin vertical line is Moran's I for the original (not randomized) data. It is I=0.178594, and is well outside the distribution. Notice, however, the P-value under simulation (right hand window) is P=0.02. This is a necessary consequence of our selecting Nruns=99, since . Under simulation, the number of runs determines the resolution of the P-value.

The observed value of I=0.178594 is highly significant (P<<0.001) under assumption R. We conclude there is spatial clustering of Lyme disease in Georgia such that counties with common borders tend to have similar Lyme disease rates. It turns out the highest rates occur in the Piedmont mid-section, a part of Georgia which also has the highest density of white-tailed deer. The results suggest Lyme disease is transmitted among a geographically wide-spread herd. Intervention efforts designed to interrupt the life cycle of the deer tick would need to target all counties within the Piedmont.

Notes: Moran's I requires full enumeration of the connections among the observations, which may be a problem when the number of areas becomes large. When full enumeration isn't possible use Grimson's method, and estimate the Grimson input data from a sample of areas. Moran's I is biased by large differences in population size across areas. Use Ipop when population size data are available.

Website maintained by Andy Long. Comments appreciated.