Estimating Prevalence of Overweight or Obese Children and Adolescents in Small Geographic Areas Using Publicly Available Data

Introduction Interventions for pediatric obesity can be geographically targeted if high-risk populations can be identified. We developed an approach to estimate the percentage of overweight or obese children aged 2 to 17 years in small geographic areas using publicly available data. We piloted our approach for Georgia. Methods We created a logistic regression model to estimate the individual probability of high body mass index (BMI), given data on the characteristics of the survey participants. We combined the regression model with a simulation to sample subpopulations and obtain prevalence estimates. The models used information from the 2001–2010 National Health and Nutrition Examination Survey, the 2010 Census, and the 2010 American Community Survey. We validated our results by comparing 1) estimates for adults in Georgia produced by using our approach with estimates from the Centers for Disease Control and Prevention (CDC) and 2) estimates for children in Arkansas produced by using our approach with school examination data. We generated prevalence estimates for census tracts in Georgia and prioritized areas for interventions. Results In DeKalb County, the mean prevalence among census tracts varied from 27% to 40%. For adults, the median difference between our estimates and CDC estimates was 1.3 percentage points; for Arkansas children, the median difference between our estimates and examination-based estimates data was 1.7 percentage points. Conclusion Prevalence estimates for census tracts can be different from estimates for the county, so small-area estimates are crucial for designing effective interventions. Our approach validates well against external data, and it can be a relevant aid for planning local interventions for children.


Introduction
Obesity is considered an urgent health challenge and a winnable battle by the Centers for Disease Control and Prevention (CDC) (1). Because overweight or obese children are at a higher risk than normal-weight children for health problems, they are a target for intervention (2). There is evidence of disparities in pediatric obesity; Bethell et al (3) studied differences in obesity rates by race/ethnicity, insurance, and income and found within-and across-state disparities. Each of these factors can vary significantly across a city or county, so identifying small geographic areas with children at greatest risk for high body mass index (BMI) can be helpful in delivering cost-effective interventions.
BMI data are obtained through direct measurement or self-reported survey data. Direct measurement results in more accurate data but is a more challenging and costly method; self-reporting results in inaccuracies and is generally biased among children younger than 12 years (4,5). Some cities or states began initiatives to measure height and weight in schools, but these systemic efforts are practiced in only a few places in the United States, such as Arkansas (6) and New York City (7).
Approaches to estimating the prevalence of health conditions in small geographic areas are not new. Some researchers address uncertainty by using Bayesian approaches, which assume knowledge of the behavior being estimated (eg, psychiatry [8], hip and knee replacement [9]). Choy et al stressed the relevance of using publicly available information and presented an estimation method for an infectious disease but did not estimate variability (10). Methods for estimating the prevalence of adult obesity in small areas (11)(12)(13)(14) cannot be easily applied for estimating prevalence among children. Using nonpublic data sets, Zhang et al (15) estimated the prevalence of obesity among American youths aged 10 to 17; they did not estimate the prevalence among younger children or the prevalence of overweight and obesity combined. Finally, to our knowledge, none of the previous studies described the prevalence of obesity among populations younger than 10 years or validated their estimates by comparing them with external measurement data.
Our objective was to describe a method that can be used to provide baseline estimates of the prevalence of children and adolescents with high BMI (either overweight and obese or obese only) at the census-tract level. We used US Census data and direct measurement data from the National Health and Nutrition Examination Survey (NHANES), and we piloted the method in Georgia's 159 counties. The development of our approach originally responded to the need of a large health care provider to geographically target a large-scale campaign to reduce high BMI among children in Georgia. The same method can be applied to generate baseline estimates at other geographic levels, and using publicly available data makes our method easy and cost-effective to replicate.

Methods
Our study used data on children and adolescents aged 2 to 17 years who were either overweight (BMI at or above the 85th percentile and lower than the 95th percentile for children of the same age and sex) or obese (BMI at or above the 95th percentile for children of the same age and sex) (16,17). We developed a model for predicting the probability of an individual child having high BMI using data from continuous NHANES surveys (2001)(2002)(2003)(2004)(2005)(2006)(2007)(2008)(2009)(2010). We used R statistical software (18-20) to develop the model, generate samples, and map results. We created a simulation using C++ software (Intel Corporation) and data from the 2010 US Census to obtain prevalence estimates.

Logistic regression model
In fitting the logistic model, we estimated Pr(Y = 1|X), where Y is the binary response and X is the vector of covariates.

Derivation of the dependent variable Y
We calculated BMI as a person's weight in kilograms divided by the square of the person's height in meters. The BMI-for-age charts adopted by the CDC in 2000 define children's population percentiles by sex (16). We used charts for each age; 1 study (21) found that the BMI-for-age metric had less predictive power for children aged 3 to 5 years than it had for older children but did not invalidate BMI as a metric for obesity in very young children. We defined high BMI as a BMI in the 85th percentile or more (overweight or obese), which agrees with previous literature and CDC guidelines. We applied similar approaches for BMI values at or above the 95th percentile (obese).

Model Covariates: X
On the basis of previous findings (3,22), we used covariates related to socioeconomic and demographic status and potentially related to high BMI that are used in NHANES, the 2010 Census, and the American Community Survey (ACS). The variables considered were sex, race/ethnicity, age in months, education level of the household representative (level 1, <9th grade; 2, 9th-11th grade; 3, high school graduation or equivalent; 4, some college; 5, college graduate or above), household size (2 to ≥7 people), and family income or family poverty level (using the thresholds from ACS tables). Race/ethnicity and sex were treated as categorical variables.

Implementation of the logistic regression model
The main logistic regression model included 6 variables: 3 binary variables for race/ethnicity (Hispanic, non-Hispanic black, non-Hispanic white, and other non-Hispanic), age of child in months, household size, and education level of household representative. The reference group was non-Hispanic white, a household size of 2 people, a child aged 2 years, and less than a 9th grade education.
We linearly scaled all variables into a [0,1] interval for numerical stability and comparison across covariates. We selected the covariates using backward stepwise variable elimination. The fitted logistic regression provided estimates for the conditional distribution of Y|X. We used bootstrap resampling to obtain realizations of the empirical distribution of the regression coefficients.

Generating virtual populations in geographic areas
We obtained demographic and socioeconomic data on census tracts from the US Census Bureau. Data on the distribution of the covariates were used to generate a virtual population of a geographic area. To more precisely characterize the population in a geographic area, we considered the interdependence of some population characteristics. We used multiple tables provided by the Census Bureau, each stratified by race and ethnicity (PCT20 and QTP2 from the 2010 Census for household size and age, respectively; B15001 from the 2010 ACS 5-year estimates for education).

Linking the individual high-BMI regression model to small-area-level data
For each virtual individual j we estimated Pr(Y = 1|X = Y j * ) by sampling from the empirical distribution of the regression coefficients and evaluating the logistic probability function with the characteristics of the virtual individual. We use this probability to simulate Y j * , the binary weight status of the individual. The prevalence estimate of high BMI is then where B is the population count in the geographic area. We repeated the simulation 1,000 times and obtained the standard deviation of the estimate. This simulation model allowed for variations due to model estimation and individual randomness. It took less than an hour to produce the estimates for Georgia.

Identification of priority areas
When limited resources are assigned to improve an overall system indicator, a common approach is to allocate most resources to where the largest overall benefit is obtained. This idea is known as the Pareto principle (23). In our context, 2 indicators were relevant for classifying small areas by priority. The first indicator was the estimated baseline prevalence for the area. The second was the estimated number of children with a high BMI (the estimated prevalence × the population of children) in each area. We used the Pareto principle to select priority areas for intervention -ie, areas with the largest number of children with high BMI. We assigned priority to the counties that accounted for approximately 77% of the total population of children with high BMI.

Model validation
We developed 3 analyses to validate our modeling approach. One, we modeled the population aged 10 to 17 years in Georgia and compared our state-level outcomes to the state-level prevalence estimates of the 2007 National Survey of Children's Health (24). Two, we modeled the population of adults and compared our county-level results with CDC's 2007 county-level obesity estimates for Georgia (25). Three, we modeled obesity among children aged 5 to 17 years in Arkansas by county and compared our data with the 2010-2011 school measurements in that state (6). For the adult validation model, additional variables were added for a better fit, and for the Arkansas validation model, income was added to capture variation in demographics in that state.

Logistic regression model
In Georgia, non-Hispanic black children and Hispanic children were more likely to have high BMI than non-Hispanic white children, and other non-Hispanics were less likely ( Table 1). The probability of high BMI increased with age. The probability of high BMI decreased as the education level of the household representative increased. The probability of a high BMI also decreased with household size. For predicting obesity only, the model variables were the same, but the coefficients were different (Table 1).

Simulation model
Among a sample of census tracts in Georgia (Table 2), the prevalence of high BMI was significantly lower in census tract no. 20300 (27.4%) than in neighboring census tract no. 20600 (36.4%) in DeKalb County. One explanation is that the tracts had different socioeconomic and demographic characteristics, except for household size ( Table 3).
The estimated prevalence of high BMI at the census-tract level in Georgia varied from 26% to 42%. Counties with the greatest variability in census-tract prevalence estimates also tended to be the counties with the largest populations (Supplementary Figure [Ap pendix]). In Fulton County, for example, the prevalence of a high BMI by census tract ranged from 26% to 41%. The prevalence by county varied from 31% to 40% (Supplemental Table 4 [Appendix]). According to the Census Bureau, census tracts are generally defined according to observable characteristics and features, whereas counties are usually larger and may include areas with more diverse characteristics (26); these differences may explain the differences in prevalence ranges. The prevalence of high BMI was low in the northern part of Atlanta, whereas prevalence was higher in some areas of the eastern, western, and southern parts of the city (Figure 1). The correlation at the census-tract level between the prevalence estimates for obesity and prevalence estimates for overweight or obesity was 0.972.

Identification of priority areas
Approximately 77% of children with high BMI resided in 39 counties (25% of counties) in Georgia ( Figure 2). Areas of high BMI included densely populated areas, such as metropolitan Atlanta, smaller cities such as Augusta, Macon, Savannah, and Rome, as well as rural areas.

Model validation
Our baseline prevalence estimate of high BMI among children aged 10 to 17 years in Georgia was 37.5%. The 2007 state-level estimate in the National Survey of Children's Health was 37.3% (95% confidence interval, 31.7%-42.9%) (24).
When we compared our county-level estimates for adults in Georgia with CDC's estimates (25), we obtained a 0.92 spatial correlation; the median difference between counties in the 2 sets of estim-PREVENTING CHRONIC DISEASE ates was less than 1.3 percentage points. When we compared our county-level estimates of overweight or obesity among children aged 5 to 17 years in Arkansas with the 2010-2011 county measurements (6), we obtained a spatial correlation of 0.77; the median difference between counties was 1.7 percentage points. Supplementary Table 5 and Table 6 (Appendix) provide details on the adult and Arkansas models.

Discussion
Results are not surprising for the race/ethnicity variables in the model predicting an individual's probability of high BMI, because they are consistent with prior research on the subject (27). The variable of income was not selected in the best-fitting models for Georgia, but alternative models were possible with that variable. There is no obvious explanation for why household size affects the probability of high BMI, but this question is worth future study. Other factors associated with high BMI in adults are cardiovascular diseases and smoking (28); for children, we considered alternative models with a variable to indicate whether anyone in the child's home smokes, but we found no significant improvement in the regression model; this factor may have been captured indirectly by other variables.
The selection of independent variables for a model depends on the estimate or region being studied. For our analysis of Georgia, the great majority of the population comprised 3 main racial/ethnic groups: black, Hispanic, and white. In Arkansas, race/ethnicity was more homogeneous than in Georgia, so our model there included income. If our approach is to be used in any other state, the selection of the model variables should match the composition of that state's population.
Validations indicated that our modeling methodology can provide reasonable estimates, with high correlation with reference values and good accuracy. However, the model-based estimates had smaller ranges than the validation data. External data were not available to validate prevalence estimates for children younger than 5 years.
Data showing differences in prevalence estimates among census tracts in a single county support the importance of generating estimates for small areas. Because the populations of census tracts and counties in our study were heterogeneous in several respects, small-area estimations provided better information than county-based measures. Small-area estimations can help to target interventions aimed at high BMI and can also be used for targeting interventions for other public health problems. The findings described here, including the strategy for identifying priority areas, were used by a health provider in Georgia to prioritize interventions for children with high BMI throughout the state.
Small-area estimates are useful for informing intervention strategies, but they are more difficult to use for evaluating interventions. One reason for the difficulty is that data may not change quickly enough to drive new estimates. This is especially true for a condition like obesity, the data for which change slowly. On the other hand, small-area estimates for diseases such as human immunodeficiency virus may be more dynamic, especially if the estimates are able to incorporate local information.
Our study has several limitations. Because we included individuallevel variables only and not local context (13), our results could be over-smoothed and could underestimate geographic variations in a geographic unit such as a county. Many factors related to high BMI among children may be specific to a geographic area, and data on these factors cannot be entirely captured without local sampling (29). Our model can capture precisely only those interactions among population variables that are publicly available in the US Census, potentially introducing bias to the estimates when assuming partial independence among some of the input variables. Additionally, complete information is not available on some small groups in census tracts, adding levels of approximation to our estimates in smaller areas. Our model is limited to the validity of the data used. For example, Eto et al (30) found that BMI had low sensitivity but high specificity for predicting obesity in children aged 3 to 5 years. The ACS provides yearly estimates for every variable used in our model; these estimates can be updated annually. The NHANES data used to develop our model were from 2001 through 2010, which ignores the temporal trends in pediatric obesity; however, a recent study found childhood obesity has not considerably changed during the past decade (27).
We presented a cost-effective and sound method for estimating the prevalence of obesity in small geographic areas. The method is based on publicly available data and can be used in the absence of local surveillance data; it can be used to inform interventions for children with high BMI. The prevalence estimates generated by our model served to build maps of baseline estimation of obesity prevalence in Georgia; used with appropriate caution, the model can help build baseline estimations for other states or diseases without publicly available small-area estimates. To the best of our knowledge, we are the first to generate obesity estimates for children younger than 10 years and the first to validate the accuracy of small-area estimates of childhood obesity prevalence with extern-PREVENTING CHRONIC DISEASE

Appendix. Supplementary Figure and Tables
Supplementary Figure. Prevalence-estimate ranges across census tracts for the 50 counties in Georgia with the largest ranges. The vertical line in the middle of each box indicates the median; the left and right borders of the box represent the 25th and 75th percentiles, respectively; the right and left whiskers mark the minimum and maximum, respectively; the dots represent outliers as determined by 1.5 times the difference between the 25th and 75th percentiles.