Small Area Estimation of Cancer Risk Factors and Screening Behaviors in US Counties by Combining Two Large National Health Surveys

Background National health surveys, such as the National Health Interview Survey (NHIS) and the Behavioral Risk Factor Surveillance System (BRFSS), collect data on cancer screening and smoking-related measures in the US noninstitutionalized population. These surveys are designed to produce reliable estimates at the national and state levels. However, county-level data are often needed for cancer surveillance and related research. Methods To use the large sample sizes of BRFSS and the high response rates and better coverage of NHIS, we applied multilevel models that combined information from both surveys. We also used relevant sources such as census and administrative records. By using these methods, we generated estimates for several cancer risk factors and screening behaviors that are more precise than design-based estimates. Results We produced reliable, modeled estimates for 11 outcomes related to smoking and to screening for female breast cancer, cervical cancer, and colorectal cancer. The estimates were produced for 3,112 counties in the United States for the data period from 2008 through 2010. Conclusion The modeled estimates corrected for potential noncoverage bias and nonresponse bias in the BRFSS and reduced the variability in NHIS estimates that is attributable to small sample size. The small area estimates produced in this study can serve as a useful resource to the cancer surveillance community.


Introduction
Cancer screening and risk factor data at the state and county levels are useful to cancer control planners, policy makers, and researchers for local health planning, decision making, and resource allocation. However, accurate screening and risk factor data are difficult to obtain. Cost and resource constraints make it impossible to conduct a new study for every new problem of interest.
the Centers for Disease Control and Prevention's (CDC's) National Center for Health Statistics (NCHS) since 1957, is designed to provide estimates by nation, region, and selected states with large populations. BRFSS, conducted by CDC since 1984, is designed to provide estimates for states and substate areas, such as metropolitan and micropolitan statistical areas. However, BRFSS's sample size is not large enough to provide reliable estimates for relatively small geographic areas (eg, county, state legislative district). Because stand-alone surveys may not support reliable estimation at these lower geographic levels, model-based small area estimates (SAEs) that combine information from multiple sources could be an effective alternative approach. SAE techniques have a long history (1). Research studies using advanced methods and SAE techniques have been reported in the public health literature (2)(3)(4)(5). Many of those studies used a single survey (either NHIS or BRFSS, but not both) as the data source for outcomes. In our study, we aimed to harness the strengths of both NHIS and BRFSS. We developed a novel statistical method to combine the 2 surveys to produce SAEs for smoking prevalence and cancer screening rates for years 1997-1999 and 2000-2003 (6). We extended this approach (6) to incorporate a cellular telephone-only component for data collected in years 2004-2010. We grouped multiple years of data into 2 periods, 2004-2007 and 2008-2010, so that each data period includes 1 or 2 years of NHIS data. This time period grouping enlarges sample sizes for counties with very small or no samples and by using a small time period, avoids smoothing out significant temporal trend changes in outcomes of interest. We describe the methodology used for data period 2008-2010, the most recent data period for which final estimates were calculated.

Data sources
We used 2 major data sources, NHIS 2008-2010 and BRFSS 2008-2010, to obtain estimates for 11 outcomes of interest: 1) current smokers overall, 2) current male smokers, 3) current female smokers, 4) overall ever smokers, 5) male ever smokers, 6) female ever smokers, 7) mammography screening, 8) Papanicolaou test, 9) home fecal occult blood test (HFOBT), 10) colorectal endoscopy, and 11) colorectal cancer screening (combination of HFOBT and colorectal endoscopy). NHIS is a national survey that uses a multistage area probability design that permits representative sampling of households and noninstitutional group homes for face-to-face interview regardless of household telephone status. NHIS interviews a sample of approximately 30,000 to 40,000 US adults per year and achieves an annual response rate of approximately 80% of eligible households in the sample. Approximately three-quarters of US counties have no sample by survey design.
BRFSS is a state-based system of health surveys administered by telephone. Since 2005, more than 350,000 adult interviews were completed each year, making BRFSS the largest telephone health survey in the world. The annual median state overall response rate for BRFSS is below 55%, which is typical of telephone surveys.
We used 2 external data sources to extract county-level ecological covariates for use in our small-area modeling. One was USA County Stats (7), which the US Census Bureau compiled from the 2000 and 2010 Census of Population and Housing (8), the American Community Survey (9), the Current Population Survey (10), the National Vital Statistics of the NCHS (11), and other administrative data sources.  (12). We used the National Cancer Institute (NCI's) SEER*Stat database (13) to extract county-level cancer mortality and incidence data (2008-2015) for external validation purposes.

Outcomes
Outcomes were self-reported smoking and screening for female breast cancer, cervical cancer, and colorectal cancer derived from responses to the survey questions. To be included as an outcome, the survey questions had to be consistent with NHIS and BRFSS. We defined each outcome as follows: Current smoking: Whether a person aged 18 or older reported currently smoking cigarettes some days or every day and having smoked at least 100 cigarettes in his or her lifetime by the time of interview. This category consists of 3 outcomes: current smokers overall, female current smokers, and male current smokers.
Ever smoking: Whether a person aged 18 or older smoked at least 100 cigarettes in his or her lifetime by the time of interview. This category includes 3 outcomes: ever smokers overall, female ever smokers, male ever smokers.
Mammography screening for breast cancer: whether a woman aged 40 or older had a mammogram within the 2 years preceding the interview.
Papanicolaou screening for cervical cancer: whether a woman aged 18 or older had a Papanicolaou test within the 3 years preceding the interview.
HFOBT screening for colorectal cancer (CRC): Whether a person aged 50 or older had an HFOBT within the 2 years preceding the interview.
Colorectal endoscopy screening for CRC: Whether a person aged 50 or older had at least 1 colorectal endoscopy (proctoscopy, sigmoidoscopy, or colonoscopy) at any time preceding the interview.
Ever had a CRC screening: Whether a person aged 50 or older had at least one HFOBT in the 2 years preceding the interview or at least 1 colorectal endoscopy at any time preceding the interview.

Small area model and implementation
We considered area-level SAE models that first required us to compute county-level direct estimates of our outcomes of interest. Extending the models used in Raghunathan et al (6), we developed a hierarchical multilevel mixed effects model at the county level for each outcome.
NHIS samples were grouped into 3 exclusive groups based on household telephone status (households with a landline telephone, households with a cellular telephone only, and households without a telephone). For each outcome, national-level and county-level direct survey estimates of prevalence (eg, current smoking prevalence) were produced by using responses from BRFSS and NHIS, respectively. To incorporate the survey weights and complex sample design, we used the SAS PROC SURVEY package (SAS Institute) to produce these direct estimates for counties with responses from NHIS or BRFSS.
The first level of the small area model assumed an asymptotic distribution for the direct estimate vector of the 3 NHIS estimates by telephone status and BRFSS estimates. Arcsin-square-root transformations were applied to the direct estimates to stabilize the sampling variance (14). We used an unknown adjustment factor like the one used by Raghunathan et al (6) to measure the proportionate bias in BRFSS estimates relative to NHIS estimates.
The second level of the model incorporated a set of covariates and introduced random effects at the county level, which enabled borrowing of information among counties and induced smoothing. The covariates integrated from multiple alternative sources are given in Table A1 of the Appendix Diffuse but proper prior assumptions were used for the hyperparameters. The Markov Chain Monte Carlo technique of Gibbs sampling (15) was adopted and implemented by using GAUSS programming software (Aptech Systems Inc) (Appendix).

Model validation
The small area models assumed that for all outcomes of interest, BRFSS direct estimates -after dividing by the unknown adjustment factors -were unbiased estimates of the population means for the households with landline telephones Therefore, 1 model validation was to check the ratios of the BRFSS direct estimates (after adjusting for the difference between BRFSS and NHIS) to the model-based estimates for households with landline telephones. These ratios were expected to converge to one as the BRFSS county-level sample size increased.
We also computed the summary statistics (mean, standard deviation, minimum, and maximum) of the direct estimates and the modeled estimates by household telephone status across all of the counties, to detect outliers. In addition, we aggregated the countylevel modeled estimates to the national level and compared those with the corresponding national-level direct estimates from both NHIS and BRFSS.
An external validation was also performed by linking the countylevel smoking prevalence estimates to the most recent 5-year lung cancer mortality rate data (2011-2015), extracted from NCI's SEER*Stat database (13), and by examining the relationship. We also linked cancer screening estimates with their corresponding cancer incidence (or mortality) rates and examined the relationship.

Results
We created county-level model-based estimates for the 11 outcomes for 3,112 counties in the United States. The remaining counties were excluded because some ecological covariates were not available. The final model-based SAEs of the outcomes were posted on NCI's Small Area Estimates for Cancer-Related Measures website (https://sae.cancer.gov/nhis-brfss/) and included in NCI's Surveillance, Epidemiology, and End Results (SEER) county attributes database (https://seer.cancer.gov/seerstat/variables/countyattribs/). The 2008-2010 estimates were also released via the State Cancer Profiles website (https:// statecancerprofiles.cancer.gov/), which cancer control planners visit frequently.
At the national-level, from 2008 through 2010, 75.0% of households had landline telephones, 23.2% of households had cellular telephones only, and 1.8% of households had no telephone (Table  1). Although households without telephones accounted for only a small percentage (1.8%) nationally, the percentage varied significantly across counties. For example, the county-level model-based estimates of the percentage of households without telephones varied from 0.2% to 18.1%, with a county mean of 2.0% across the 3,112 US counties included in this study. The county-level modeled estimates of percentage of cellular telephone-only households varied from 3.4% to 58.3%, with a county mean of 21.8%.
NHIS direct estimates of the 11 outcomes varied by telephone status across households. The cellular telephone-only households and the households with no telephones typically had higher PREVENTING CHRONIC DISEASE www.cdc.gov/pcd/issues/2019/19_0013.htm • Centers for Disease Control and Prevention smoking rates and lower screening rates than the households with landline telephones. For example, the current smoking prevalence estimated from the NHIS was 17.7% in households with landline telephones, 27.3% in cellular telephone-only households, and 30.9% in households without telephones. The mammography screening rate among women aged 40 or older was 69.1% in households with landline telephones, 56.6% in cellular telephoneonly households, and 42.1% in households without telephones. One exception was Papanicolaou screening where the cellular telephone-only households had the highest screening rates among the 3 household groups.
Comparing the NHIS and BRFSS estimates for the 11 outcomes, we noted that for prevalence of current smokers and ever smokers, BRFSS direct estimates (17.0%) and the NHIS direct estimates (17.7%) for households with landline telephones were almost identical; however, the BRFSS estimates were up to 2.7% lower compared with the NHIS overall estimates for some of the smoking outcomes. For cancer screening rates, BRFSS estimates were significantly higher than NHIS estimates for households with landline telephones (eg, 76.0% verse 69.1% for breast cancer screening). This is consistent with findings in the literature comparing the 2 surveys (6,16,17). Table 2 provides the summary statistics and range (minimum, 25 th percentile, median, 75 th percentile, maximum, mean, standard deviation) of the county-level modeled estimates for the 11 outcomes across the 3,112 counties. The estimates for all 11 outcomes varied across the counties. The current county-level smoking prevalence in 2008-2010 varied from 6.8% (95% confidence interval [CI], 2.7%-11.0%) to 43.0% (95% CI, 26.6%-59.5%), with an average of 25.1% across the 3,112 counties. The prevalence of breast cancer screening within the last 2 years varied from 30.7% (95% CI, 17.9%-43.5%) to 94.7% (95% CI, 87.1%-100%). The prevalence of cervical cancer screening within the past 3 years varied from 42.8% (95% CI, 29.0%-56.5%) to 96.5% (95% CI, 91.5%-100%). The prevalence of ever having a colorectal endoscopy test or an HFOBT within the past 2 years varied from 27.8% (95% CI, 15.5%-40.0%) to 88.8% (95% CI, 79.0%-98.6%) across the 3,112 counties. The modeled estimates reduced the range, with a mean estimate that was closer to the NHIS estimate than the county-level NHIS and BRFSS direct estimates.
The aggregated national modeled estimates for all 11 outcomes are similar to the corresponding NHIS national direct estimates, which is consistent with what we expected (Table 1).
We plotted the ratios of the model-based estimates for households with landline telephones to the BRFSS direct estimatesafter adjusting for the difference between the NHIS and BRFSS -against the BRFSS effective sample size (sample size divided by estimated design effect) on a log scale (Figure 1). The funnel shape indicates that the modeled estimates and the BRFSS difference-adjusted direct estimates match very well for large counties, as expected. The weighted correlation coefficient between the 2008-2010 county-level modeled current smoking prevalence and the 2011-2015 age-adjusted county-level lung cancer mortality rate is 0.741 (P < .001), using the inverse variance of the lung cancer mortality rate as the weight. We calculated lung cancer mortality rates (2011-2015) against current smoking prevalence (2008-2010) in a bubble scatter plot, where the size of the bubble displays the inverse variance of the lung cancer mortality rate ( Figure 2). Both the correlation coefficient and the scatter plot demonstrate a strong linear relationship between the modeled county-level current smoking prevalence and the county-level lung cancer mortality rates, even though they are only a few years apart. The correlation coefficient between county-level cancer screening and corresponding cancer mortality or incidence (eg, mammography and breast cancer) varies from cancer to cancer, but all are significant. This external validation is evidence that the SAE models perform well.

Discussion
We generated county-level model-based estimates for 11 cancer risk factors and screening behaviors by combining information from NHIS, BRFSS, and auxiliary variables obtained from other data sources through novel statistical models for the data period 2008-2010. The same methods were used to produce SAEs for the data period 2004-2007. Our results revealed a large disparity in smoking prevalence and cancer screening rates among households by telephone status.
Our models have several strengths: 1) they use data from 2 largescale national surveys, taking advantage of the large sample size from one (BRFSS) and the higher response rates and better coverage of all household types from the other (NHIS); 2) they incorporate cellular telephone-only households, a status that emerged rapidly during the study periods, as 1 dimension in the multivariate model structure, enabling better estimation; 3) they are built with county-level data, so survey weights and the major complex design features are incorporated before constructing the models; and 4) they include a large number of potential covariates, improving the predictive ability of the estimates. A limitation of the proposed methods is that we modeled the 11 outcomes separately and, to avoid further complicating the modeling process, didn't consider the option of modeling all outcomes simultaneously. That approach may be worthy of exploration in future research. An additional limitation is that potential multicollinearity among the covariates may exist, thus possibly bringing potential bias to estim-ates of the regression coefficients. However, our main purpose was for prediction, not trying to interpret the relationship between the outcomes and the covariates.
Cancer screening is an important element of early detection and prevention (18). The US Preventive Services Task Force (USP-STF) makes recommendations on different types of cancer screening (19). Cancer screening metrics are included in the Healthy People 2020 goals (20), and the National Colorectal Cancer Roundtable aims to increase CRC screening prevalence to 80% by 2018 (21). However, cancer screening estimates for all US counties are not available elsewhere. Our SAEs are therefore an important and useful data resource for cancer control planners and researchers (22,23). Work has been initiated by the organizations responsible for these surveys, NCHS for NHIS and CDC for BRFSS, along with NCI, to analyze data from 2011 forward, in which a modified model will be developed to incorporate further changes in the BRFSS design, which now includes cellular telephone-only households and an improved weighting methodology. In addition, we encourage others to examine our methodology and develop other methodologies, to further examine the robustness of our results.
In defining the screening outcomes, we had to make some compromises between the latest USPSTF screening guidelines and the ability to code these outcomes consistently across time. For example, the addition of human papillomavirus (HPV) testing and immunization has changed the landscape of cervical cancer screening recommendations. In colorectal cancer screening, sigmoidoscopy is now rarely used in the United States, and newer technologies have been developed (eg, CT colonography, fecal DNA tests). We chose outcomes that, while not entirely current, could be coded consistently. These screening measures could serve as independent variables in other analyses or to judge areas of need. Although their estimates might not accurately reflect the newest screening technologies and guidelines, they are likely highly correlated and would likely maintain their rank order in counties across a state. In research using more recent data from 2011 forward, we tried to add estimates for cancer screening outcomes that align with the most recent USPSTF recommendations.
Moss JL, Liu B, Feuer EJ. Urban/rural differences in breast and cervical cancer incidence: the mediating roles of socioeconomic status and provider density. Womens Health Issues 2017;27 (6)