The following notice appeared in 1991:
The ASA Section on Statistical Graphics is sponsoring a special exposition entitled ``Statistics in Public Health Surveillance,'' as a poster session at the August 1991 Joint Statistical Meetings in Atlanta. Its purpose is to provide a forum for ASA members to present innovative graphical and analytic techniques for addressing questions of importance to public health control and prevention efforts.
The Centers for Disease Control maintains a large database as part of its National Notifiable Diseases Surveillance System. Two subsets of data from this system are available: the first describes the number of cases of mystery disease 1 reported each month by state during the years 1953-1989. The second set of data presents recent individual cases of mystery disease 2, including information on date of report, state, age, race, and sex for each case. Public health literature has documented important relationships amongst these variables.
Work alone or put together a team of data analysts to look at this data! Although the description that accompanies the data includes several questions of interest, an exploratory analysis seeking and illustrating general relationships in the data is appropriate. The data sets are moderately large (about 50K numbers), although they can be summarized to make smaller problems suitable for class projects.
Your mission, should you choose to accept it, is to use perform that exploratory data analysis, "seeking and illustrating general relationships in the data". You obviously cannot, in two hours, do much of an analysis: pick some aspect of the data, some question suggested below, and attempt to shed a ray of light on the disease(s).
The following additional background information was also provided, including some questions which you might attempt to address with your analysis. Even if you can't accomplish the full analysis in two hours, ask yourself how you MIGHT go about performing it! In the evaluation below, I ask you to also provide information on where you were stymied - what skills you wish you had had to help you analyze the data.
Reporting of cases of communicable disease is necessary for planning and evaluation of disease prevention and control programs, in the assurance of appropriate medical therapy, and in the detection of outbreaks. Systematic reporting of diseases in the United States began in 1874 when the Massachusetts State Board of Health inaugurated weekly voluntary reporting of diseases by physicians. The authority to require notification of cases of disease now resides in the respective State legislatures, State epidemiologists, or boards of health. The Centers for Disease Control in partnership with the Council of State and Territorial Epidemiologists (CSTE) operates the National Notifiable Diseases Surveillance System (NNDSS) to provide weekly provisional information on the occurrence of diseases that are defined as "notifiable" by CSTE. The NNDSS data are based on reports by State epidemiologists, who themselves receive reports from a variety of sources, such as individual practitioners, hospitals, laboratories, and health departments. Reports are received from all States, Washington, D.C., New York City, and 5 United States territories (Puerto Rico, Virgin Islands, American Samoa, Guam, and the Commonwealth of the Northern Mariana Islands).
Tools for this surveillance system are continually improving. The National Electronic Telecommunications System for Surveillance (NETSS) is a computer-based system begun in 1984 for reporting disease surveillance information to CDC. The computerized system allows more case detail and analytic capability than previously, when only aggregate case counts were available by telephone; disease distribution can now be mapped by county, onset dates of disease can be examined, and comparative information on the distribution of age, race, and sex of case patients is available.
The usefulness of surveillance data varies with the disease, but generally such data are used to monitor trends, alert health professionals to important aberrations from historical patterns, estimate the effect of morbidity, portray natural history of disease, develop and test hypotheses, evaluate control measures, monitor changes in infectious agents, detect changes in health practices, and facilitate planning. Surveillance provides information on case patients for more detailed examination, thus facilitating at the local level epidemiologic research and the follow-up of individuals resulting in the initiation of appropriate therapy. Surveillance data also provide policymakers the basis for planning and implementing prevention and control programs.
Although through participating in the surveillance process, physicians and other health care providers ensure that public health resources are effectively used, completeness of reporting varies considerably by location and disease.
Reports are considered provisional and subject to updating when more specific information becomes available.
Reference: "Mandatory Reporting of Infectious Diseases by Clinicians", Journal of the American Medical Association, December 1, 1989.
Work in groups if you'd like, using the tools that seem appropriate. We've used a lot of different software this year, and now it's time to see if you know how to use it! Can you determine from this data what the diseases are?
I've subsetted the data in a variety of ways, to assist you in analyzing it. You may find that you need to subset it in a different way for your analyses. There are files appropriate for use in ArcView and xgobi, Stat! or Gamma. Obviously both space and time are involved here.
The data for md1 contains some of the following fields:
           STATE:     State (or other reporting area) name (no embedded
                      spaces).
           YEAR:      Year (last two digits)
           MONTH:     Month the case was reported to CDC (1-12, 13=unknown).
           COUNT:     Number of cases reported.
	For md1, the following files are available:
Questions of public health importance include:
The data for md2 contains the following fields, plus some that I've added:
           Each record includes 7 fields:
           FIPS:      State Federal Information Processing Standard (FIPS)
                      code (2 digits, leading 0's; see also dataset 7).
           YEAR:      Year the case was counted by CDC (last 2 digits).
           MONTH:     Month the case was counted by CDC (1-12).
           AGE:       Age in years (98=over 97, 99=unknown)
           SEX:       Sex (1=male, 2=female, 9=unknown)
           RACE:      Race (1=white, 2=black, 3=American Indian/Alaskan
                      Native, 4=Asian/Pacific Islander, 9=unknown)
           ETHNICITY: Ethnicity (1=Hispanic, 2=non-Hispanic, 9=unknown)
I've added
           year.month: the year plus (month/12 - 1/24)
           longcent:   the longitudinal centroid of the state
           latcent:    the latitudinal centroid of the state
	For md2, the following files are available:
Questions of public health importance include:
For either problem you may want to use Stat! to test for spatial autocorrelation, in which case you may need the .con file for the United States. It is given in the same order as the data in this ESRI file of the US. If you need a different .con file, let me know and I can create one quickly for you.
When you've finished your mission, please evaluate this lab, lab 12. Were there analyses that you would have liked to carry out, but you lacked the tools?
While you're at it, evaluate lecture 12, too.
Page by Andy Long. Comments appreciated.
aelon@sph.umich.edu