# APPENDIX C: Statistical and Epidemiologic Approaches

A suspected cancer cluster investigation attempts to answer two questions: 1) is there an actual "excess" (that meets statistical and biological plausibility criteria) and 2) is this excess associated with an environmental contaminant? Addressing these questions begins by defining the study population and locating relevant cases and then determining the appropriate geographic boundaries and time period.

This section provides an outline of the basic epidemiological and statistical analysis methods that are recommended for investigating a cancer cluster. This section focuses on the methods most relevant and most commonly used in cancer cluster investigations: the SIR and confidence interval, mapping, and descriptive and spatial statistical and epidemiologic methods.

### Standardized Incidence Ratio and Confidence Interval

The measure typically used to assess whether there is an excess number of cancer cases is the SIR. This measure is explained in many epidemiologic textbooks (sometimes under standardized mortality ratio, which uses the same method but measures mortality instead of incidence rates) (*1–5*). Simply stated, the SIR is a ratio of the number of observed cancer cases in the study population to the number that would be observed (often called "expected") if the study population experienced the same cancer rates as an underlying population (often called the "reference" population). The reference population could be the surrounding census tracts, other counties in the state, or the state as a whole (not including the community under study).

The SIR can be adjusted for factors such as sex, race, and/or ethnicity, but it is most commonly used to adjust for differences in age between two populations. Various techniques can be used to account for these factors. For example, stratification, which is calculating an SIR by groups (e.g., by calendar year), is a commonly employed technique. (*6*)

### Confidence Interval

A confidence interval is calculated to determine the precision of the SIR estimate and the statistical significance. If the confidence interval includes 1.0, the SIR is not statistically significant. The narrower the confidence interval, the more confidence one has in the precision of the SIR estimate. One difficulty in cancer cluster investigations is that the population under study is generally a community or part of a community, typically resulting in a small denominator, and such small denominators frequently yield wide confidence intervals, meaning that the SIR is therefore not as precise as desired (*1*).

### Considering Alpha and Beta Level Values

The alpha is the probability of rejecting the null hypothesis when the null hypothesis is true (no difference in cancer rates between the study population and reference population). Although there are no absolute cut-points, responders often use an alpha value of 0.05 (or equivalently a 95% confidence interval).

Selection of an alpha value larger than 0.05 (e.g., 0.10: 90% confidence interval) will increase the risk of false positive results. Selection of a smaller alpha value (e.g., 0.01: 99% confidence interval) may be considered when many SIRs are computed because the number of SIRs that will be statistically significant by chance alone increases (in other words, with a 95% confidence interval, one expects to see five statistically significant results in a group of 100 results).

Beta and power are related to each other. Both are related to the sample size of the study—the larger the sample size, the larger the power. Power, or 1- β (beta), is the probability of rejecting the null hypothesis when the null hypothesis is actually false. Like alpha, the beta has no absolute cut-points; however, responders often use a beta value of 0.20 or less (or equivalently a power of 0.8 or more) (*1*).

### Power Analysis

Power analysis is useful in determining the minimum number of people (sample size) needed in a study in order to test the hypothesis and detect a possible association. In most suspected cancer cluster investigations, the cases and study population are defined prior to the analysis. Therefore, a power analysis can be used to determine if the number of cases in the investigation is sufficient, usually a power of 0.8 or greater (*3*).

### Mapping the Cancer Cluster

When considering the geographic distribution of cases, responders have various methods they can use. For example, they might develop a visual representation showing the location of each case superimposed on the underlying population density to get an approximation of the distribution of the relative rates of cancer.

It also can be useful to plot the location of suspected environmental risk factors on the map for the purpose of making a crude assessment of their proximity to the cases. However, to avoid the "Texas Sharpshooter fallacy" (i.e., a situation in which cases are noticed first and then the "affected" area is selected around them, thus making there appear to be a geographical relationship, similar to an instance in which the sharpshooter shoots the side of the barn first and then draws the bull's-eye around the bullet holes), responders must first outline their definitions, assumptions, and methods (*7*). Often, a few different spatial (e.g., spatial: census block, census tract, zip code, municipality, or county) or temporal scales (e.g., week, month, year, or several years) can be mapped to look for possible patterns related to specific space and/or time units that merit more careful investigation. This process is systematic, and procedures are outlined a priori. The patterns in such maps often differ dramatically, and they might suggest specific exposures that warrant further consideration. This practice is more useful when longer periods of time are under study, as well as larger numbers of cases (e.g., >10 cases).

Cancer registries and state health agencies typically have criteria related to release of data for small geographic areas. Because of privacy concerns, some data cannot be released to the public, unless the privacy concerns are addressed. For example, a pin-point map of a small geographic area that identifies the residence of a cancer patient should not be made public (*8*). Similarly, many health agencies are prohibited from publicly releasing a table for a small geographic area with a small population, for each table cell might have only a few cases.

### Descriptive and Spatial Statistical and Epidemiologic Methods

Frequencies, rates, and descriptive statistics are useful first steps in evaluating the suspected cancer cluster. Confidence intervals can also be calculated for rates. Epidemiologic references can explain these methods (*9*). Other statistical approaches include Poisson regression. Often, the number of cases is limited, therefore limiting the type of analysis. If an investigation progresses to a case-control study, the odds ratio can be calculated. These study designs have been discussed in detail elsewhere (*1,3,4*).

Since the publication of the 1990 Guidelines, the field of spatial epidemiology has grown, especially in environmental health. This growth is influenced by the increased availability of geocoded data and statistical software. Space/time cluster analysis methods are often used to provide evidence about the existence of a suspected cluster and to define more precisely the extent of the suspected cluster in space and time.

As with any other epidemiologic analysis, there might be methodological issues with the use of clustering tools. Many of these concerns (e.g., limitations associated with small populations, environmental data quality, disease latency periods, and population migration) have been described in this report. Census data can provide the denominators for this type of analysis, and all the limitations associated with rapidly changing populations and intercensal year estimates also apply to these spatial/time cluster methods. In addition, when exposure or outcome analysis uses aggregate data and not data collected on an individual level, responders must use caution when interpreting this type of analysis, because the association with a particular environmental contaminant might not be true for individual cases, especially if there is heterogeneous distribution of the exposure over the geographic area. The related bias is known as ecological inference fallacy. Detailed information regarding methodological issues has been published previously (*10*).

Many methods have been developed to facilitate what is termed "space/time cluster analysis." These methods assess whether cases are closer to one another than would be observed if the cases had been distributed at random. The concept of "close" might mean closer geographically, closer in time, or closer both geographically and in time. The numeric value of "close" is determined by the responder. For a responder to make a determination of clustering, the space-time distances have to be summarized and then evaluated with any of a variety of statistical techniques. This task can be performed by summarizing where and when each case occurred, typically using the individuals' residence and the reported date of incidence. Some of the simplest methods merely compare the average distances between nearby cases to the average distances between cases and nearby noncases (or controls). If, on average, the cases are sufficiently closer to other cases (in space, time, or both space and time) than they are to noncases, the situation may be described as a cluster. Clusters can be detected by use of spatial autocorrelation techniques. Global clustering statistics, such as Geary's *C* (*11*), detect spatial clustering that occurs anywhere in a study area. They do not identify where the cluster(s) occur, nor do they identify differences in spatial patterns within the area. Local clustering statistics, such as Local Indicators of Spatial Autocorrelation (LISA) (*12*), identify potential clustering within smaller areas inside a study area. Often, global techniques are used first to identify potential clustering; then, local methods are used to pinpoint the clusters in the sample area. Many global statistics have local counterparts. For example, global Moran's *I* is the summation of local Moran's *I* statistics (*13*). Clusters reported to health agencies most often are local. It is beyond the scope of this report to describe more than a few of the most commonly used methods, and even then, these methods are described only briefly.

A useful summary of these techniques has been published recently (*14*). One of the most popular techniques for detecting clusters is called the spatial scan statistic. Its most commonly used implementation is the SaTScan software (*15*) (available at http://www.satscan.org). The underlying concept for this approach is the scan statistic, which considers both spatial areas and time intervals (*16*). Other implementations include the nearest neighbor test (*17*) and the Small Area Health Statistical Unit (SAHSU)'s "Rapid Inquiry Facility" (RIF) (*18*). Additional, statistical cluster methods have been discussed elsewhere (*19*). All of these methods have strengths and weaknesses. In a choice of a statistical cluster method, it might be useful to consider several criteria, such as ease of use and availability, the clarity and transparency of the method, its statistical power to detect the cluster of interest, and the method's ability to produce the desired output (*20*). Comparisons and reviews have been published (*21*). In addition, the Appendix of the 1990 Guidelines describes additional spatial statistical methods.

### References

- Kelsey JL. Methods in observational epidemiology. New York, NY: Oxford University Press; 1996.
- Khurshid A. Statistics in epidemiology: methods, techniques, and applications: CRC; 1996.
- Selvin S. Statistical analysis of epidemiologic data. New York, NY: Oxford University Press; 1996.
- Breslow NE, Day NE. Statistical methods in cancer research. In: International Agency for Research on Cancer. The design and analysis of cohort studies. Lyon, France: International Agency for Research on Cancer Scientific Publications; 1980.
- Rothman KJ, Greenland S, Lash TL. Modern epidemiology. Philadelphia, PA: Lippincott Williams & Wilkins; 2008.
- New Jersey Department of Health & Senior Services Cancer Epidemiology Services. Fact sheet: explanation of standardized incidence ratios. Available at http://www.state.nj.us/health/eohs/passaic/pompton_lakes/pompton_lakes_fs_sir.pdf.
- Gawande A. The cancer-cluster myth. The New Yorker 1999;8:34–7.
- CDC. National Programs of Cancer Registry United States Cancer Statistics technical notes: statistical methods: suppression of rates and counts. Available at http://www.cdc.gov/cancer/npcr/uscs/technical_notes/stat_methods/suppression.htm.
- Gordis L. Epidemiology. Philadelphia, PA: Saunders Elsevier; 2009.
- Beale L, Abellan JJ, Hodgson S, Jarup L. Methodologic issues and approaches to spatial epidemiology. Environ Health Perspect 2008;116:1105–10.
- Geary RC. The contiguity ratio and statistical mapping. The Incorporated Statistician 1954;5:115–46.
- Anselin L. Local indicators of spatial association: LISA. Geographical Analysis 1995;27:93–115.
- Moran P. Notes on continuous stochastic phenomena. Biometrika 1950;37:17–23.
- Tango T. Statistical methods for disease clustering. New York, NY: Springer; 2010.
- Kulldorff M. A spatial scan statistic. Commun Stat-Theor M 1997;26:1481–96.
- Naus J. The distribution of the size of maximum cluster of points on the line. J Am Stat Assoc 1965;60:532–8.
- Cuzick J, Edwards R. Spatial clustering for inhomogeneous populations. J Roy Stat Soc B Met 1990;52:73–104.
- Aylin P, Maheswaran R, Wakefield J, et al. A national facility for small area disease mapping and rapid initial assessment of apparent disease clusters around a point source: the UK Small Area Health Statistics Unit. J Public Health Med 1999;21:289–98.
- Rogerson PA. Statistical methods for geography: a student's guide. Thousand Oaks, CA: Sage Publications Ltd; 2010.
- Robertson C, Nelson TA, MacNab YC, Lawson AB. Review of methods for space-time disease surveillance. Spatial and Spatio-temporal Epidemiology 2010;1:105–16.
- Waller LA, Gotway CA. Applied spatial statistics for public health data. New York: John Wiley and Sons; 2004.

Use of trade names and commercial sources is for identification only and does not imply endorsement by the U.S. Department of
Health and Human Services.

References to non-CDC sites on the Internet are
provided as a service to *MMWR* readers and do not constitute or imply
endorsement of these organizations or their programs by CDC or the U.S.
Department of Health and Human Services. CDC is not responsible for the content
of pages found at these sites. URL addresses listed in *MMWR* were current as of
the date of publication.

All *MMWR* HTML versions of articles are electronic conversions from typeset documents.
This conversion might result in character translation or format errors in the HTML version.
Users are referred to the electronic PDF version (http://www.cdc.gov/mmwr)
and/or the original *MMWR* paper copy for printable versions of official text, figures, and tables.
An original paper copy of this issue can be obtained from the Superintendent of Documents, U.S.
Government Printing Office (GPO), Washington, DC 20402-9371;
telephone: (202) 512-1800. Contact GPO for current prices.

**Questions or messages regarding errors in formatting should be addressed to
mmwrq@cdc.gov.