Appendix B: Mapping and Spatiotemporal Methods
Geospatial Visualization and Analysis
Geographic information systems (GIS) can be useful for all stages of evaluating unusual patterns of cancer. GIS may be used as part of proactive evaluation of cancer registry data and during Phase 2 assessments. Spatiotemporal regression and advanced spatial statistical methods are particularly useful for identifying and quantifying the relationships between risk factors and cancer cases during epidemiologic investigations (Phase 3). These processes are summarized in Figure 1.
Figure 1 shows geospatial activities suggested for different stages of the examination of unusual patterns of cancer and environmental concerns. Four stages are presented. These stages are proactive evaluation and routine monitoring, further assessment, epidemiologic studies, and communication and presentation of results.
Proactive evaluation and routine monitoring includes:
- Geocoding cancer data
- Visualize SIR
Further Assessment which is associated with Phase 2 includes:
- Mapping known/suspected environmental hazards
- Mapping risk factors and populations at risk
- Visualize geographic area of interest
Epidemiologic studies which are associated with Phase 3 includes:
- Spatial analysis of cancer case data
- Assessment of trends through spatiotemporal analysis
- Evaluation of risk factors (spatial regression)
Communication and presentation of results includes:
- Augmenting the map(s) with any additional data gathered
- Complement the communication plan with visualizations (internal)
- Use maps to communicate results at any stage (external)
Visualization, or mapping, can be used as an important communication tool to both internal and external stakeholders. As part of the routine evaluation of cancer data, maps can be shared with partners in programs such as comprehensive cancer control and environmental health for decision making. Additionally, maps can convey important information about the distribution of cases and potential environmental risk factors when engaging with the community.
Often, a first step in visualization and spatial analysis involves translating addresses collected as text in cancer registry data into coordinates that can be mapped. This process is known as geocoding, and the quality of the result is crucial as it is the basis for visualizations and analyses. Resources are available that provide detailed descriptions of this process (55–57), and tools are available at no cost for health departments from the National Cancer Institute and the North American Association of Central Cancer Registries (NAACCR) Geocoder.
Cancer registries and state health agencies typically have criteria related to release of data for small geographic areas. Because of confidentiality and privacy concerns, some data cannot be released to the public, unless these concerns are addressed. For example, a map of a small geographic area that identifies the residences of cancer patients as points should not be made public (43). Similarly, many health agencies are prohibited from publicly releasing a table for a small geographic area with a small population, since each table cell might have only a few cases and could be used to identify individuals.
Once data are geocoded, they can be mapped along with other geographic data, such as suspected environmental risk factors, for crude assessments of their proximity to the cases. Different spatial (e.g., census block, census tract, ZIP Code tabulation area, municipality, or county) or temporal scales (e.g., week, month, year, or several years) can be mapped to look for possible patterns. This practice is more useful when longer periods of time are under study, as well as when there are larger numbers of cases (e.g., >10 cases). Mapping and analyzing the data over space and time can help reveal whether changes in incidence or mortality statistics are observed and may suggest risk factors that warrant further consideration.
Varying the geographic scale or the geographic unit of aggregation can produce different patterns and results. This is known as the modifiable areal unit problem (58,59), which has also been identified relative to temporal aggregations (60). Multiple methods have been proposed to account for these issues (61–63); a common solution is to use differing scales. In doing so, variations in results can be identified if they exist. Further discussion of GIS visualization techniques and methods for the analysis of cancer data are available in the published literature (57,64,65).
The following section provides basic information on the principles of clustering methods and statistical considerations when working with spatially structured data. The National Cancer Institute (NCI) provides a variety of methods and tools for analysis of cancer statistics. Some of the tools are more statistical in nature, such as tools to calculate incidence and death rates and trends (available in SEER*Stat), while other visualization and analysis tools are geospatially focused. Links to the tools and information for further exploration are available at https://surveillance.cancer.gov/tools/. In addition to these freely available tools, the sections below detail other spatial and statistical methods and software that can be used for these analyses.
Spatial and Temporal Clusters
Spatial and temporal clusters can be detected by a variety of techniques that evaluate whether similar features, values, or observations are “close” or in close proximity to one another. These techniques can be divided into global, local, and focused methods. A table with methods and associated applications for those categories is available in the Supplemental Information for Appendix B [PDF – 115 KB]. The table is not meant to be a comprehensive review of applications but rather to provide initial guidance.
Global clustering statistics can be used to determine if there are patterns of clustering anywhere in the study area. Once clustering is deemed likely from global statistics, local clustering methods, including scan statistics, can help to identify clusters within the area of interest. It is worth noting that it is possible to detect statistically significant global clustering without evidence of local clustering and vice versa (66). In cases where there is a known point-source, focused tests can be considered. Regression analysis can then be used to understand the association between potential environmental risk factors and cases or to adjust for confounding factors such as latency in cases, mobility, and demographic variables (such as age and race). These methods are further described below, along with example use cases and locations of available software for analysis.
Global Clustering Methods
Global clustering statistics detect patterns of spatial clustering that occur anywhere in a study area. They do not identify where clusters occur, nor do they identify differences in spatial patterns within the area. One measure of global clustering is spatial autocorrelation, which is the degree of similarity of nearby features. Positive spatial autocorrelation means that features nearby one another have similar values, while negative autocorrelation signifies nearby features that have dissimilar values.
Commonly used methods for testing global clustering are Geary’s C (67), Moran’s I (68), and the Oden’s Ipop (69), which adjusts Moran’s I for differences in population. Global clustering can also be assessed using the K-function (Ripley’s) when point-level data are available (43,70). GeoDa† and R† packages are publicly available, and several global statistics are available within other proprietary software packages, such as ClusterSeer® † (BioMedware, Ann Arbor, MI), (see Supplemental Information for Appendix B [PDF – 115 KB] for more information).
Local Clustering Methods
Local clustering statistics, such as local indicators of spatial autocorrelation (LISA) (71), identify the locations of clusters or spatial outliers. Some global clustering statistics have local clustering statistic counterparts such as global and local Moran’s I statistics and the Besag-Newell R (72). Another statistic, Getis-Ord Gi*, identifies hot and cold spots based on where features with high (hot) or low (cold) values are in close proximity to one another. The Getis-Ord Gi* provides estimates of statistical significance while identifying the locations of hot and cold spots that are not confined to a specific shape.
Local versions of Moran’s I and Geary’s C are available for free within R† packages such as usdm and spdep (73). Other programs also have Moran’s I statistics and the Getis-Ord Gi* statistic, such as ArcGIS™ † tools (Esri, Redlands, CA), ClusterSEER†, and GeoDa† (74).
Spatial scan statistics can be used to scan a study region using a series of moving windows with increasing radii to identify areas where the observed cases included inside the window are greater than expected. This method can be expanded to incorporate time as an added dimension, allowing a scan for spatiotemporal clusters. To properly interpret the results of the spatial scan statistic, it is extremely important to identify the appropriate radius for the spatial scan window to avoid clusters that are too large or too small. Normally, the upper limit of the circle should not include more than 50 percent of the dataset or the study area (43,75). Additional parameter selection, such as time range and spatial scale within spatial scan statistics, can impact results.
One of the most popular scan statistics is Kulldorff’s scan statistic for spatial, temporal, and space-time analysis (76,77), freely available within SaTScan™ †software (78). The SaTScan™ † software includes analyses for different data types including case counts (77), rates (79), case/control data (77), and even survival data (80). However, spatial and temporal clusters can appear in irregular shapes, which prompted Tango and Takahashi (81) to develop a flexible space-time scan statistic, implemented in the FleXScan† software. The flexscan methodology is also available as an R† package, rflexscan [PDF – 102 KB]. This package implements both Kulldorff’s and Tango & Takahashi’s scan statistics.
An alternative scan statistic is that proposed by Besag-Newell (82), which is useful for regional data with small population sizes. It is available in the free software ClusterSeer® † This test gives results for both global and local clustering.
Focused Clustering Tests
A growing interest in recent years has been in the detection of clusters around a specific point-source, such as a single identifiable source of air, water, thermal, noise, or light pollution (82). These focused tests are usually designed to identify a particular spatial pattern of clustering around the point-source or specific geographic location. The location of the point-source of interest needs to be identified prior to the assessment, recognizing that different factors (meteorological, topographical, and others) can influence the spatial pattern of potential exposures from the point-source (83). The size, shape, and scale of the analysis can also influence the results. For example, below are five focused cluster shapes and corresponding fitted models that can be considered (83):
- Distance Decline (DD): A model where risk declines symmetrically in all directions with distance from the point-source
- Peaked Distance Decline (PDD): A model where risk peaks closest to the point-source and then declines with distance
- Direction (D): A model characterizing increased risk in a specific angle/direction from the point-source
- Distance Decline combined with Directional effect (DDIR)
- Peaked Distance Decline combined with Directional effect (PDDIR)
Widely known focused cluster tests prove to have higher relative power for different models:
- Lawson-Waller Score Test: This test provides robust results across different models. This score test is powerful against small deviations from the null in the direction of a specific alternative. (84,85)
- Bithell’s Linear Risk Score (LRS) Test: This is a distance version used for DD, PDD; direction version used for D, DDIR, and PDDIR. (86)
- Cuzick and Edwards’ Test: This test performs well for large sample sizes (N>500) and also often used for PDD, DDIR, and PDDIR. (87)
- Stone’s Maximum Likelihood Test: This is used for DD. (88)
- Tango’s Focused Test: This is used for DD. (89)
- Besag and Newell’s Test: This is used for PDD. (82)
Regression methods provide analyses and a set of tools that are complementary to cluster detection analysis. Regression analyses are commonly used in public health for two main reasons: 1) to predict an outcome, and 2) to understand the association between at least two variables (90). For example, after identifying spatial clusters of cancer cases, it may be useful to understand the relationship between potential environmental exposures and the cancer of interest while controlling for demographic and behavioral factors associated with increased risk (91,92). Alternatively, it may be of interest to predict the risk of a specific cancer across a wide geographic region if there are data on known environmental exposures (92).
Special considerations must be made when applying regression techniques to spatially structured data. Spatially structured data violate a key assumption of independence among observations due to inherent autocorrelation, where a given value is to some degree predicted by the values of its neighbors (43,93,94). General steps to overcome issues of spatial autocorrelation in regression, drawing primarily from Waller & Gotway (43) and Fotheringam & Rogerson (95), can be found found in the Supplemental Information for Appendix B [PDF – 115 KB].
Comparison of Methods
The choice of a statistical cluster detection method should take into consideration the strengths and weaknesses of the methods. Several criteria can be considered, such as the type of data (e.g., point-level data or areal data), the ease of use and availability of data or software, the transparency of the methods employed in a particular software, statistical power of the method to detect the cluster of interest, and the desired output (88). Multiple comparisons of methods and reviews of techniques have been published over the years (67,88,96–104), and additional details and discussion can be found in the Supplemental Information for Appendix B [PDF – 115 KB].
Cluster detection and other advanced spatial analysis methods are available via proprietary and free applications. Such analysis often requires specialized knowledge about the data, appropriate use of the methods, and careful interpretation of the results. Specifically, the choice of which models to use depends on the type of data, the underlying assumptions based on the distribution of the data, and geospatial considerations such as the size of the study area, spatial scale of the data, aggregation, and masking. For example, results of analysis may greatly differ when implemented at the county or the census tract level, and different models should be implemented if both case and area-level risk factors are evaluated. Specific methods may also require additional considerations such as the type and size of the spatial scan window.
The results of the GIS and spatiotemporal analysis can be used internally for decision making, can inform actions, and can be used to communicate with the public. Therefore, collaborations with GIS professionals and spatial statisticians equipped with specialized skills can help to ensure proper methods are employed and interpretations of results are appropriate. If these experts are not available within the health agency, consultation and technical assistance from CDC/ATSDR’s Geospatial Research Analysis and Services Program (GRASP) can be requested by emailing CCGuidelines@cdc.gov.
† Software noted are examples of packages that are available freely or for purchase and do not represent an endorsement of any specific product by the Centers for Disease Control and Prevention or the Agency for Toxic Substances and Disease Registry.
Additional Contributing Authors:
Liora Sahar, Marissa Grossman, Brian Lewis
ATSDR; Geospatial Research, Analysis, and Services Program