Ipop Example Analysis

Moran's I adjusted for population size (Ipop)

Description: Ipop is used to detect departures from spatial randomness, but, unlike Moran's I, it accounts for differences in population size across areas. If ignored, large differences in population size decrease the ability of Moran's I to detect departures from spatial randomness. This may cause one to fail to identify a true cluster.

The test statistic is

Notice Ipop is large under two clustering scenarios. First, when cases cluster within counties (because ei-di becomes large) and second, when counties with many cases are adjacent. The range of Ipop depends on population size, and it is useful to standardize the statistic using

Null hypothesis: The null hypothesis assumes the probability of a case occurring in an area is given by the proportion of the total population in that area. In other words, geographic variation in the number of cases is expected to follow geographic variation in population size. This null hypothesis assumes the population is homogeneous through geographic space in all characteristics other than population size. Under this null hypothesis the expectation of Ipop is

X is the total population size, and the expectation is therefore very close to zero. The variance of Ipop under the null hypothesis is

Stat also calculates an approximation for the variance which is

(see this document for more details).

Significance: Significance is evaluated using three approaches: by simulation, under the randomization assumption and by approximation.

Under simulation the cases are allocated at random among the counties using a multinomial probability distribution given by the proportion of the total population in each county and Ipop is recalculated. This operation is performed `Number of runs' times, resulting in a distribution of Ipop under randomization. Let Nruns be the number of runs in the simulation, and NGE the number of times Ipop under simulation was greater than or equal to Ipop obtained for the original (not randomized) data. Significance under simulation is then:

This is a 1-tailed P-value for a spatial clustering of rates. Some spatial patterns in rates will cause Ipop to be smaller than its expectation, and Stat therefore reports a 2-tailed P-value, calculated as 2P_S.

The null hypothesis is consistent with the randomization assumption described in the previous section for Moran's I. Under this assumption Stat calculates a z-score as:

This expresses the difference between the observed and expected value of Ipop in standard deviation units. The distribution of the z-score is approximate normal with a mean of 0 and a variance of 1.0. Stat reports a two-tailed P value because spatial pattern is of interest both when Ipop is positive (rates in connected areas are similar) or negative (rates in connected areas are dissimilar).

Connection weights: The weights describe the alternative spatial model describing how `connected' the areas are. The greatest statistical power is obtained using the following weight schedule:

Ø w_ii=2/d_i, cases are in the same area

Ø , cases are from adjacent areas

Ø w_ij=0, cases are from areas that are not adjacent

This scheme captures the geography of the areas and gives increased weight to areas with smaller populations. Stat uses these rules when reading weights from a file:

Ø When 2 cases are in the same area it assigns w_ii=2/d_i, .

Ø Weights for two different areas, i and j, are set to w_ij=0, unless otherwise specified in the connection file.

Ø Weights read from the connection file are divided by .

Input file formats: Ipop uses the same connection file format as the Moran method. However, the Moran method assumes w_ii=0, while Ipop assumes w_ii=2/d_i. Also, Ipop divides the weights you enter for 2 different areas by and the Moran method does not. See the section on connection weights under `Moran's I' in this Chapter for a description of the file format.

The Ipop data file describes the number of cases and sizes of the at risk population by area, and has the format:

n₁  x₁  label₁
n₂  x₂  label₂
    .
    .
    .
n_m  x_m  label_m

Here n₁ is the population at risk in area 1, x₁ is the number of cases in area 1, n₂ and x₂ are the population size and number of cases in area 2 and so on. The label is optional, but there must be one area per record, and both n (population size) and x (number of cases) must be given for each county. Consider the first four lines of the file `LymeIpop.dat'.

       15992        2
       11648        2
       13147        0
       58340        1

The population size of county 1 is 15,992. Of these, there were 2 cases of Lyme disease reported. Stat assigns the first row in the data file the number 1, the second row the number 2 and so on. The order in which areas are entered into the file must correspond to the county numbering system used for the weights.

While the number of cases may be fairly easy to determine (count the number of cases over a defined time span in each area), the size of the population at risk requires some thought. For Lyme disease we assumed all people within a county were equally at risk, and used the county population for the size of the population at risk. This would be incorrect for an illness such as childhood leukemia, where the population at risk would be the number of children in each county.

Input screen: Consider the Lyme disease data described earlier in the section on Moran's I. Enter `LymeIpop.dat' as the Data file, and `Lyme.con' as the Connection file. Set the number of Monte Carlo runs to 99.

Run Screen: Begin calculations. You may have to wait awhile for the simulations to complete. In practice when the number of areas is greater than 25 the P-value obtained under the randomization assumption is closely approximated by the P-value obtained under simulation. Georgia has 159 counties so we could have set the `number of runs' to 0 to turn off the simulations. We set number of runs to 99 to illustrate the use of Stat's simulation feature. Results for the significance and variance under assumption R will still be displayed.

After calculations complete the distribution of Ipop is shown in the left window. Ipop=3.0622060 is shown by the solid, vertical line and is well outside the distribution obtained under simulation. It is highly significant under assumption R. This confirms the results obtained using Moran's I, and we conclude there is evidence of spatial clustering of Lyme disease cases within Georgian counties. We have increased confidence in this result because differences in county population size have been accounted for.

Notes: Ipop has greater statistical power than Moran's I. Use Moran's I when only rates are available, and Ipop when the numerator and denominator in each area (number of cases and size of the population at risk) are known.

Website maintained by Andy Long. Comments appreciated.