Volume 5: No.
1, January 2008
Applying the Small-Area Estimation Method to Estimate a Population Eligible for Breast Cancer Detection Services
Kirsten Knutson, MPH, Weihong Zhang, MS, Farzaneh Tabnak, MS, PhD
Suggested citation for this article: Knutson K, Zhang W, Tabnak
F. Applying the small-area estimation method to estimate a population eligible for breast cancer detection services. Prev Chronic Dis 2008;5(1). http://www.cdc.gov/pcd/issues/2008/
jan/06_0144.htm. Accessed [date].
Populations eligible for public health programs are often narrowly defined and, therefore, difficult to describe quantitatively, particularly at the local level, because of lack of data. This information, however, is vital for program planning and evaluation. We demonstrate the application of a statistical method using multiple sources of data to generate county estimates of women eligible for
free breast cancer screening and diagnostic services through California’s Cancer Detection Programs: Every Woman Counts.
We used the small-area estimation method to determine the proportion of eligible women by county and racial/ethnic group. To do so, we included individual and community data in a generalized, linear, mixed-effect model.
Our method yielded widely varied estimated proportions of service-eligible
women at the county level. In all counties, the estimated proportion of eligible
women was higher for Hispanics than for whites, blacks, Asian/Pacific Islanders,
or American Indian/Alaska Natives. Across counties, the estimated proportions
of eligible Hispanic women varied more than did those of women of other
The small-area estimation method is a powerful tool for approximating narrowly defined eligible or target populations that are not represented fully in any one data source. The variability and reliability of the estimates are measurable and meaningful. Public health programs can use this method to estimate the size of local populations eligible for, or in need of, preventive health
services and interventions.
Back to top
At a time when more than 16% of the population of the United States (more than 47 million people) lack insurance coverage for basic medical services, an important function of public health is to provide the underserved and people disproportionately
affected by disease with access to preventive health services (1). To reach these people
effectively, public health programs are best implemented
locally, in counties or cities (2). Estimating the population eligible or targeted for specific services is often
difficult at the local level, however, because of lack of data (3-7).
Because public health programs or interventions are usually tailored to improve the health of specific underserved or high-risk groups, an individual may have to meet particular criteria (e.g., be a woman aged 40 years or older with no health insurance) to be eligible for program services or may have to belong to a target group characterized by the intervention (e.g., women with low personal
income at risk for pregnancy). Decennial census and intercensal population projections provide summary counts of local populations by various demographic characteristics, but these sources rarely contain data corresponding to the narrowly defined criteria that usually describe eligibility for public health programs. Public health surveys collect a wide range of information, but they, too,
may not contain the necessary data for generating reliable estimates of local populations (3,4,6-10). In fact, many surveys conducted statewide have so few respondents that even state estimates of small eligible or target populations are unreliable.
Because an epidemiologic description of the eligible or target population is essential
to developing and operating a public health program or intervention (11), the problem of insufficient data must be addressed. Reliably defining the service-eligible population tells program planners how many people are eligible for services, who these people are, and where they live and is central
to such activities as projecting costs, preparing budget proposals, and justifying funding requests. Equally important, reliable estimates help
planners determine the portion of the eligible population that the program cannot serve, given available resources. If funding is insufficient to reach all eligible people, estimates of subgroups of the eligible population, as defined by various demographic,
geographic, or high-risk characteristics, enable a program to identify priority target groups, establish realistic enrollment goals, and request appropriate funding (12). Reliable estimates also provide evidence for program growth and infrastructure development and provide essential data for decision making about program policy and resource allocation and for monitoring and evaluating a
program’s effectiveness (13,14).
We demonstrate how the small-area estimation method was used by the California Department of Public Health to estimate the size of local populations that are eligible for free breast cancer screening and diagnostic services through the state’s Cancer Detection Programs: Every Woman Counts (CDP:EWC). Although this method can be used in many
ways, including to adjust for census undercounts and to estimate populations in political districts, it is presented here as a reliable approach to resolving the problem encountered by public health programs of precisely estimating local populations eligible for preventive health services and interventions when no single data source is adequate for the task (4,8,15).
Back to top
We used two data sources in our analysis. The primary source was the California Women’s Health Survey (CWHS), an annual population-based telephone survey that is coordinated and conducted by the California Department of Public Health and funded in collaboration with the California Department of Mental Health, the California Department of Alcohol and Drug Programs, Lumetra (formerly
California Medical Review, Inc), the California Department of Social Services, and the Public Health Institute. The survey data are intended to provide state estimates on women’s health behaviors and attitudes.
CWHS employs a screened random-digit–dialed sampling method to select households to be called. Women aged 18 years or older residing in a contacted household are eligible to participate in the survey. From 1997, when the survey began, through 2003, CWHS conducted an annual average of 4147 interviews statewide, with an annual average (upper-bound) response rate of 72.9%. Additional
information on methods used by CWHS is available from the California Department of Public Health, Survey Research Group (16).
To obtain a sample size appropriate for stratifying by small geographic area, we aggregated CWHS data from 1998 through 2003. Our initial sample
consisted of 14,284 women aged 40 years or older who were interviewed by CWHS during this 6-year period. We excluded from analysis 445 respondents (3.1%) who did not complete the interview; 636 (4.5%) who responded “don’t know” to, or
refused to answer, questions necessary to determine health insurance and poverty status; and 23 (0.2%) who responded “don’t know” to, or refused to answer, questions used to determine racial/ethnic group, marital status, education level, or county of residence. Our final sample size was 13,180.
The second data source was Census 2000, Summary File 3 (SF 3) (17), which contains socioeconomic and housing information collected from a sample (about 1 in 6 households) of the approximately 19 million housing units nationwide that received the Census 2000 long-form questionnaire. For each of the 58 California counties, we extracted data from SF 3 that corresponded with the
socioeconomic characteristics identified as possibly associated with eligibility for CDP:EWC (15).
The dependent measure for this analysis was a binary variable representing eligibility for CDP:EWC services. Using CWHS data, we derived eligibility status for each respondent in our final sample of women aged 40 years or older, according to self-reported poverty and health insurance status. Respondents were considered eligible if they reported having an annual household income at or below
200% of the federal poverty level and having neither Medicaid nor Medicare. All other women were categorized as ineligible for services.
From CWHS data, we extracted information on each woman’s county of residence and derived each woman’s racial/ethnic group, education level, and marital status.
CWHS questions about racial/ethnic group varied over the study period. To determine ethnicity, in some years CWHS asked women if they were Hispanic and, in other years, if they were Latina, so that the same ethnicity information was collected each year, even though the wording of the question changed over time in accordance with federal guidelines for the collection of these data (18). From
1998 through 2000, women chose a single race from seven racial groups that were read to them. From 2001 through 2003, women were asked to identify their race in the same way, but they could choose one or multiple racial groups. Women who reported being of multiple races were then asked to choose the group
with which they most identified. We categorized as Hispanic (an ethnicity) all respondents
who identified themselves as either Hispanic or Latina, regardless of their racial group. We categorized non-Hispanic respondents by their reported racial group, with respondents giving multiple races in the 2001 through 2003 surveys being categorized according to the racial group
with which they most identified.
We divided education status into two categories: high school or less
for respondents who reported no more
education than completing high school or obtaining a GED (general education development) certificate,
college or more for those who
reported any amount of college or technical school.
We also divided marital status into two categories: married/partnered
for respondents who reported being married or separated and
unmarried/unpartnered for those who
reported being a member of an unmarried couple, divorced, widowed, or never
We extracted county data on per capita income from Census 2000, SF 3, Table P82 and median household income from SF 3, Table P53. Both variables represented county residents of all ages.
We defined the county unemployment rate as the proportion of women in the labor force aged 35 to 64 years (available age group) who were unemployed at the time of census and derived this information from SF 3, Table P35. The denominator comprised all women in a county in this age group.
We derived the percentage of women living in poverty in each county from data in SF 3, Table PCT49, by dividing the number of women aged 35 to 64
years (available age group) who were living below the federal poverty level by
the total number of women in a county in this age group.
All variables were continuous.
We used the small-area estimation method (4,8,19,20) to generate regression-based estimates of the proportion of women eligible for CDP:EWC services. To demonstrate the usefulness of this method for estimating local and sparse target populations, we calculated estimates of service-eligible women by county and by racial/ethnic group within each county. We performed the
regression analysis using SAS Version 8 and a corresponding macro, GLIMMIX, (SAS Institute Inc, Cary, North Carolina) (21).
To obtain the parameter estimates, we fitted the model using the restricted/residual pseudolikelihood method (22). All county variables were standardized to observe the mean and standard deviations. We included individual and county variables as covariates in a generalized, linear, mixed-effect model with eligibility status as the outcome variable. To account for the
variation not explained by the regression variables, a county random effect variable, ai, was included in the model:
Logit [p(yij = 1/ai)] = Xijb + ai .
In the model, Xij is the jth observation in county i for racial/ethnic group, educational level, marital status, unemployment rate, percentage of women living in poverty, median household income, and the interaction terms between these variables; yij is a Bernoulli random-response variable with probability pij; and
ai is assumed to be normally distributed with a mean of zero and a variance equal to
During preliminary analysis, we compared the Akaike Information Criterion (AIC) values of the model variables to assess their relative contribution to the model. Education level and marital status, which had the lowest AIC values and did not contribute to the model selection, were not included as variables. The racial/ethnic and county variables and the interaction terms were maintained
in the preliminary model.
Next we used backwards selection (23) to determine which variables and interactions of variables to select for the final model. To increase predictability, we set the selection criteria for the model at a = 0.30, rather than at a lower level (24). The variables representing unemployment rate, percentage of women living in poverty, median household income, each racial/ethnic group, and
significant interaction terms remained in the final model (Table
We used the Monte Carlo method (25) to estimate the proportion of eligible women in each racial/ethnic group in each county and
the bootstrap method (26) to calculate the standard error of the estimated proportions in each racial/ethnic group and in all races combined. We calculated 95% confidence intervals for each standard error. We computed the coefficient of variation (CV) to assess the
reliability of the estimated prevalence points (4,27) and considered proportions with a CV greater than 0.23 unreliable. All county estimates were found to be reliable.
Back to top
The estimated county percentages of eligible women varied from a minimum of 5.5% (Marin County) to a maximum of 35.3% (Imperial County)
(Table 2). The estimated percentage at the 25th percentile was 11.1%; at the 50th percentile (median), 13.6%; and at the 75th percentile, 15.9%. The mean of the estimates was 13.9%. The estimated proportions were not normally distributed, but skewed to
The small-area estimation method yielded a wide range and considerable variability in the estimated proportions across counties (Figure). The estimated proportions of eligible Hispanic women varied more than did those of women of other races. Even so, the range of estimates across counties in each racial group was more than 10%. Imperial County, one of the outliers in the figure, had the highest
proportion of eligible women of all races combined. In the second outlier, Del Norte County, an estimated 23.2% of black women aged 40 years or older were eligible for CDP:EWC services.
Figure. Estimated proportions of women aged 40 years or older in California counties eligible for breast cancer screening and diagnostic services through Cancer Detection Programs: Every Woman Counts, by racial/ethnic group, 1998–2003. [A tabular version of this figure is also available.]
Asterisk (*) indicates value suspected as outlier; plus (+), mean of county proportions.
a Imperial County.
b Del Norte County.
Note. The bottom line of each box represents the 25th percentile of the
estimated proportions; the middle line, the 50th percentile (median), and the top line,
the 75th percentile. The endpoints of the whiskers are the most extreme values not identified as suspected outliers. We identified as suspected outliers county proportions exceeding the 75th percentile plus 1.5 times the interquartile range, or falling short
of the 25th percentile minus 1.5 times the interquartile range.
Estimated proportions of eligible women showed considerable variability by race within counties
(data not displayed graphically). In every county, the estimated proportion of women aged 40 years or older who were eligible for CDP:EWC services was higher for Hispanics than for whites, blacks, Asian/Pacific Islanders, and American Indians/Alaska Natives. In Imperial County, an estimated
45.2% of Hispanic women aged 40 years or older were eligible for program services, compared with 18.6% of black women, 15.8% of American Indian/Alaska Native women, 12.1% of white women, and 10.2% of Asian/Pacific Islander women. In Los Angeles County, which is the most populous of the state’s counties and had the fifth highest estimated proportion of eligible women (20.0%), an estimated
40.2% of Hispanic women, 19.2% of American Indian/Alaska Native women, 15.7% of black women, 13.3% of Asian/Pacific Islander women, and 8.4% of white women were eligible. These proportions were all reliable. Some of the estimated proportions of eligible Asian/Pacific Islander and American Indian/Alaska Native women were not reliable, however, because of the small sample sizes in the CWHS for
women in these racial/ethnic groups.
Back to top
When calculating reliable estimates directly from survey or population data is not possible, the ability to combine multiple sources of data, each with different facets of the necessary information, is a strength of
the small-area estimation method. In our example, available survey data contained information corresponding to CDP:EWC eligibility criteria, but they were appropriate only for
state estimates. With the small-area estimation method, we were able to supplement statewide survey data with community census data by means of statistical modeling and produce reliable estimates for each California county.
Although the term small-area estimation suggests that this method is used to estimate populations living in small geographic areas,
this method is also useful in identifying sparse target populations. Describing the distribution of a narrowly defined characteristic in racial/ethnic groups is a common problem because of the small number of people in some of these groups. With small-area
estimation, however, we were able to calculate reliable estimates of service-eligible women in five racial/ethnic groups for most California counties.
Public health professionals have synthetically calculated local estimates when data with an adequate sample size to directly calculate local estimates are unavailable (3,4,9,10). In our demonstration, for example, we could have calculated a direct estimate of the proportion of eligible women in California from CWHS data and then multiplied each county’s census population by
this proportion to estimate the local numbers of eligible women. The resulting estimates would be based on the assumption that the demographic characteristics that define program eligibility are present in every county in the same proportion as they are in the state (4,8). This would be a poor assumption, however, because the synthetic method would estimate that 16.3% of women aged 40
years or older in each county were eligible for CDP:EWC services, whereas the small-area
estimation method that we used yielded widely varying estimated proportions by county.
Another benefit of the small-area estimation method is that variability and reliability can be measured, and these statistics are informative. Although the standard error and confidence interval can be calculated for each synthetically generated point estimate (i.e., proportion of eligible women), these measures are not meaningful because the estimates themselves are limited by the flawed
assumption we have described.
As with any means of estimation, however, obtaining statistically reliable results depends on factors such as sample size. When generating local estimates in the absence of sufficient local data,
the small-area estimation method allows the researcher to borrow strength from available data (9,20). For some sparse local populations, however, no amount of supplemental
information can compensate for the small number of survey respondents sampled, and model-based estimates for these populations will be unreliable.
A major limitation of small-area estimation statistics is that diagnostics for checking nonlinear models are few and not well-developed (8). Even so, comparing model-based with directly calculated survey-based estimates of the target population in the large area (i.e., the aggregate of local areas) can provide some indication of the performance of a model (28). For example, our method
estimated that 15.3% of California women aged 40 years or older were service-eligible, whereas the direct method yielded an estimate of 16.3%. For practical purposes, the two estimates are similar, and without a gold standard, observing similar values resulting from two different methods can be a qualitative confirmation of methods and analysis. Although the statistical method
that we used has been validated (4,8), a model-based overall estimate that was vastly different from the survey-based direct estimate would be a signal to the researcher to reassess the analysis.
California’s CDP:EWC program has benefited by knowing of the wide variation in numbers and percentages of eligible women in the state’s counties. For instance, the county estimates inform decisions related to the dissemination of resources and funds to the community partnerships that assist the program with public education, outreach, and clinical quality assurance
measures. The estimates by racial/ethnic group are useful in developing culturally appropriate messages and educational materials and in improving access to high quality screening services.
Other public health programs that have difficulty describing the distribution of their target populations because of a general lack of local data on health insurance status may also benefit from applying the method we have described. For example, other states that participate in the National Breast and Cervical Cancer Early Detection Program (NBCCEDP) (http://www.cdc.gov/cancer/nbccedp/)
have eligibility criteria similar to those of California’s CDP:EWC program and could produce meaningful estimates of eligible local populations by racial/ethnic and age groups by applying
the small-area estimation method using a state survey or the Current Population Survey (a national survey that contains health insurance information [www.census.gov/cps/]) and census data (29).
WISEWOMAN (Well-Integrated Screening and Evaluation for Women Across the Nation [www.cdc.gov/wisewoman/]), a state-based program offering NBCCEDP-enrolled women free or low-cost risk-factor screening, lifestyle interventions, and referral services aimed at preventing cardiovascular and other chronic diseases (30), could use this method to determine local estimates of the eligible
population by demographic group to help identify provider sites and to determine the number of potential WISEWOMAN recruits.
One might think that in this age of information, data to describe any population of interest would be easy to obtain. This is not always the case, however, particularly when
a population is narrowly defined, either by residence in a small geographic area or by specific characteristics. Small-area estimation statistics, as applied in our example, give public health programs a means of
obtaining reliable estimates of their local or sparse target populations, even when no data seem to be available.
Back to top
A portion of the data for these analyses was provided by the California Women’s Health Survey Group. Analyses, findings, and conclusions described in this report are not necessarily endorsed by the CWHS Group.
We acknowledge Dr. Georjean Stoodt for her helpful comments and support of this project. We also thank Lawrence Portigal for his thorough review and editorial recommendations.
Back to top
Corresponding Author: Kirsten Knutson, MPH, California Department of Public Health, CDIC/Cancer Detection Section, MS 7203, PO Box 997413, Sacramento, CA 95899-7413. Telephone: 916-449-5305. E-mail: Kirsten.Knutson@cdph.ca.gov.
Author Affiliations: Weihong Zhang, Farzaneh Tabnak, California Department of Public Health, Cancer Detection Section, Sacramento, California.
Back to top
- DeNavas-Walt C, Proctor
BD, Lee CH. Income, poverty, and health insurance coverage in the United
States: 2005. In: U.S. Census Bureau. Current population reports.
Washington (DC): US Government Printing Office; 2006. p. 60-231.
- Promising practices in chronic disease prevention and control: a public health framework for action. Atlanta (GA): U.S. Department of Health and Human Services, Centers for Disease Control and Prevention; 2003.
- MacKenzie EJ, Shapiro S, Yaffe R.
The utility of synthetic and regression estimation.
Techniques for local health planning. Med Care 1985;23(1):1-13.
- Jia H, Muennig P, Borawski E.
Comparison of small-area analysis techniques for estimating county-level outcomes. Am J Prev Med 2004;26(5):453-60.
- Elston JM, Koch GG, Weissert WG.
Regression-adjusted small area estimates of functional dependency in the noninstitutionalized American population age 65 and over. Am J Public Health 1991;81(3):335-43.
- Ponce N, Teleki S, Brown ER. California’s uninsured children: a closer look at the local level. Berkeley (CA): University
of California Berkeley School of Public Health, Center for Health and Public Policy Studies, Health Insurance Policy Program; 2000. http://chpps.berkeley.edu/publications/HIPP%20Policy%20Alert%2003_00.pdf.* Accessed March 30, 2004.
- Malec D, Davis WW, Cao X.
Model-based small area estimates of overweight prevalence using sample selection adjustment. Stat Med 1999;18(23):3189-200.
- Ghosh M, Rao JNK. Small area estimation: an appraisal. Stat Sci 1994;9:55-93.
- Spasoff RA, Strike CJ, Nair RC, Dunkley GC, Boulet JR.
Small group estimation for public health. Can J Public Health 1996;87(2):130-4.
- Lafata JE, Koch GG, Weissert WG.
Estimating activity limitation in the noninstitutionalized population: a method for small areas. Am J Public Health 1994;84(11):1813-7.
- Rossi P, Freeman H. Evaluation: a systemic approach. Newbury Park (CA): Sage Publications; 1993.
- Porter EJ.
Defining the eligible, accessible population for a phenomenological study. West J Nurs Res 1999;21(6):796-804.
- Bartholomew LK, Parcel G, Kok G, Gottlieb N. Intervention mapping: designing theory- and evidence-based health promotion programs. Mountain View (CA): Mayfield Publishing Company; 2001.
- Bitler M, Currie J, Scholz J. WIC eligibility and participation. J Hum Resources 2003;38(S):1139-79.
- Brown ER, Meng Y, Mendez CA, Yu H. Uninsured Californians in assembly and senate districts, 2000. Los Angeles (CA): UCLA Center for Health Policy Research; 2001.
- California Women’s Health Survey SAS dataset documentation and technical report 2003. Sacramento: California Department of Health Services, Cancer Surveillance Section, Survey Research Group; 2004.
- Census 2000 Summary File 3 — California. Washington (DC): U.S. Census Bureau; 2002.
http://factfinder.census.gov/servlet/DatasetMainPageServlet?_program=DEC&_lang=en&_ts. Accessed April 25, 2005.
- Provisional guidance on the implementation of the 1997 standards for federal data on race and ethnicity, 2000. Washington (DC): Office of Management and Budget; 2000. http://www.whitehouse.gov/omb/inforeg/re_guidance2000update.pdf. Accessed March 16, 2005.
- Borawski E, Jia H. State and county estimates of severe work disability among Missouri adults, aged 18–64, 1993–1996 BRFSS. Jefferson City: Missouri Department of Health and Senior Services; 1998.
- Andrews HF, Kerner JF, Zauber AG, Mandelblatt J, Pittman J, Struening E.
Using census and mortality data to target small areas for breast, colorectal, and cervical cancer screening. Am J Public Health 1994;84(1):56-61.
- Wolfinger R, O’Connell M. Generalized linear mixed models: a pseudo-likelihood approach. J Stat Computation Simulation 1993;48:233-43.
- Ericksen EP. A regression method for estimating population changes of local areas. J Am Stat Assoc 1974;69(348):867-75.
- Rothman KJ, Greenland S. Modern epidemiology. 2nd ed. Philadelphia (PA): Lippincott Williams & Wilkins; 1998.
- Shtatland ES, Cain E, Barton MB. The perils of stepwise logistic regression and how to escape them using information criteria and the output delivery system. Proceedings from the
26th Annual SAS Users Group International Conference. 2001 Apr 22-25; Long Beach, CA.
- Thisted RA. The elements of statistical computing: numerical computation. New York (NY): Chapman and Hall/CRC; 1988.
- Shao J, Tu D. The jackknife and bootstrap. New York (NY): Springer; 1996.
- Tabnak F, Tholandi M, Kuniholm M. A spatial study of AIDS surveillance data by demographic subgroups in California. Sacramento: California Department of Health Services, Office of AIDS; 2001.
- Brugal MT, Domingo-Salvany A, Maguire A, Cayla JA, Villalbi JR, Hartnoll R.
A small area analysis estimating the prevalence of addiction to opioids in Barcelona, 1993. J Epidemiol Community Health 1999;53(8):488-94.
- Tangka FK, Dalaker J, Chattopadhyay SK, Gardner JG, Royalty J, Hall IJ, et al.
Meeting the mammography screening needs of underserved women: the performance of the National Breast and Cervical Cancer Early Detection Program in 2002–2003 (United States). Cancer Causes Control 2006;17(9):1145-54.
- WISEWOMAN — Well-Integrated Screening and Evaluation for Women Across the Nation. Atlanta (GA): US Department of Health and Human Services, Centers for Disease Control and Prevention.
http://www.cdc.gov/wisewoman/. Accessed December 18, 2006.
Back to top