Models for Count Data With an Application to Healthy Days Measures: Are You Driving in Screws With a Hammer?

Introduction Count data are often collected in chronic disease research, and sometimes these data have a skewed distribution. The number of unhealthy days reported in the Behavioral Risk Factor Surveillance System (BRFSS) is an example of such data: most respondents report zero days. Studies have either categorized the Healthy Days measure or used linear regression models. We used alternative regression models for these count data and examined the effect on statistical inference. Methods Using responses from participants aged 35 years or older from 12 states that included a homeownership question in their 2009 BRFSS, we compared 5 multivariate regression models — logistic, linear, Poisson, negative binomial, and zero-inflated negative binomial — with respect to 1) how well the modeled data fit the observed data and 2) how model selections affect inferences. Results Most respondents (66.8%) reported zero mentally unhealthy days. The distribution was highly skewed (variance = 58.7, mean = 3.3 d). Zero-inflated negative binomial regression provided the best-fitting model, followed by negative binomial regression. A significant independent association between homeownership and number of mentally unhealthy days was not found in the logistic, linear, or Poisson regression model but was found in the negative binomial model. The zero-inflated negative binomial model showed that homeowners were 24% more likely than nonowners to have excess zero mentally unhealthy days (adjusted odds ratio, 1.24; 95% confidence interval, 1.08–1.43), but it did not show an association between homeownership and the number of unhealthy days. Conclusion Our comparison of regression models indicates the importance of examining data distribution and selecting models with appropriate assumptions. Otherwise, statistical inferences might be misleading.

Medscape, LLC designates this Journalbased CME activity for a maximum of 1 AMA PRA Category 1 Credit(s)™. Physicians should claim only the credit commensurate with the extent of their participation in the activity.
All other clinicians completing this activity will be issued a certificate of participation. To participate in this journal CME activity: (1) review the learning objectives and author disclosures; (2) study the education content; (3) take the post-test with a 75% minimum passing score and complete the evaluation at www.medscape.org/journal/pcd ; (4) view/print certificate.

Learning Objectives
Upon completion of this activity, participants will be able to: • Distinguish characteristics of different tools for data analysis • Analyze how data regarding self-reported health can be skewed in the Behavioral Risk Factor Surveillance System (BRFSS) survey • Evaluate results of different evaluation tools on count data from the BRFSS survey

Introduction
Researchers of chronic disease often gather data that are measured on a continuum rather than as a "present-absent" or "yes-no" dichotomy. Examples include the following: episodes of a symptom; number of sick days, cigarettes smoked, or alcoholic drinks consumed; measures of health care use, such as number of doctor visits or days of hospitalization; and costs incurred (in dollars). Such measures are referred to as "count" data; that is, the observations can have only nonnegative integer values (0, 1, 2, 3, . . . ). Such data are most often gathered during a specified period of time (eg, the past month or year). For some of these measures, most study participants may have a zero count (eg, no episode of a symptom, no cigarettes smoked, no use of health care services). These data are typically not normally distributed, and the positive skew in their distribution cannot be resolved by data transformation. The Centers for Disease Control and Prevention's (CDC's) health-related quality of life (HRQOL) Healthy Days measure (1) is an example of such count data.
The Behavioral Risk Factor Surveillance System (BRFSS) questionnaire includes an HRQOL section composed of 3 questions related to respondents' healthy days. These questions ask respondents to report the number of days in the previous 30 days when 1) their physical health was not good, 2) their mental health was not good, and 3) poor physical or mental health kept them from doing their usual activities (2). Responses to the Healthy Days questions are count data because the response must be an integer. For each of the Healthy Days questions, most respondents report zero days (2), and most of the nonzero responses are concentrated in the left side of the distribution, producing a skewed distribution with large variance.
Two simple and familiar methods have often been used to analyze Healthy Days data. The first categorizes the data into 2 (eg, ≥14 vs <14 d) (3)(4)(5)(6) or more (eg, 0 d, 1-13 d, and ≥14 d) categories (7). Although categorizing these data may simplify the statistical analyses, there may be drawbacks (8)(9)(10)(11)(12), including the loss of information and power (8,10,11). Categorization does not make use of within-category information, and all participants above or below a particular cut point are treated equally even though the outcome among participants within a particular category may vary significantly: for example, 1 bad mental health day in the previous 30 days is quite different from 12 bad days, even though 1 and 12 are both in the category of less than 14 days. In addition, the selection of cut points is often arbitrary, making it difficult to compare results among studies and hampering meta-analysis. Furthermore, categorizing a continuous variable may bias results (9,12).
The second most common method of analyzing the association between various risk factors and the number of reported physically and mentally unhealthy days uses linear regression models and keeps the outcome in its original scale of 0 to 30 days (13)(14)(15). These approaches often violate the assumption of normal distribution of errors, which can distort true relationships and render significance tests invalid (16,17). Several regression models are appropriate for analyzing count data, including Poisson, negative binomial, zero-inflated Poisson, and zero-inflated negative binomial regression (18); however, they have not been used widely in analyzing Healthy Days data (19).
This study used data from the 12 states that included a question on homeownership in their 2009 BRFSS to examine the independent relationship between homeownership and number of mentally unhealthy days. Studies have shown that homeownership is associated with several health outcomes (20,21), but we are not aware of any study that has examined the relationship between homeownership and HRQOL. Our objective was to determine whether using different analytic methods produced different findings. We compared 5 multivariate regression models -logistic, linear, Poisson, negative binomial, and zero-inflated negative binomial -with respect to 1) how well the modeled data fit the observed data and 2) how model selections affect inferences.

Data source
BRFSS is a state-based system of annual health surveys (22). Data are collected monthly in all 50 states, the District of Columbia, Puerto Rico, the Virgin Islands, and Guam. More than 300,000 interviews are completed each year. The survey uses a multistage design based on random-digit-dialing methods to gather a representative sample from each state's noninstitutionalized civilian resident population aged 18 years or older. The BRFSS questionnaire consists of core component questions asked in all states and optional questions (modules) asked at the discretion of the states. In 2009, a social context module including a homeownership question was asked in 12 states: Alabama, Arkansas, California, Hawaii, Illinois, Kansas, Louisiana, Nebraska, New Mexico, Oklahoma, South Carolina, and Wisconsin. Response rates for the 12 states included in this analysis had a median of 59% and ranged from 43% to 67%.
The independent variable for this study was homeownership, based on the following question in the BRFSS: "Do you own or rent your home?" The response options are own, rent, or other arrangement (such as group home or staying with friends or family without paying rent). We classified respondents who rented a home or lived by other arrangement as nonhomeowners. The outcome measure was the number of days reported by respondents to the question: "Now thinking about your mental health, which includes stress, depression, and problems with emotions, for how many days during the past 30 days was your mental health not good?" Covariates included age, sex, race/ethnicity, education, household income, marital status, household size, and employment status. The 2009 BRFSS questionnaire is available at www.cdc.gov/brfss/questionnaires/pdf-ques/2009brfss.pdf.

Data analysis
There were 68,258 adults aged 18 or older who responded to both the homeownership and mentally unhealthy days questions in the 12 states. We limited the analysis to the 60,113 people aged 35 or older, because those younger than 35 were unlikely to own a home. We excluded 550 (0.9%) people who had missing data for any of these covariates: education, marital status, household size, and employment status. People with missing data on household income (n = 6,582, 7.5%) were classified as a separate category ("unknown") and were not excluded from the analysis. The analyzed sample included 59,563 adults (22,568 men and 36,995 women).
We first examined the distribution of mentally unhealthy days, including the frequency of zero, mean, median, skew, and variance. We then examined the associations between homeownership and number of mentally unhealthy days by using 5 models: Model 1: Logistic regression. This model has been used in previous HRQOL studies (3,5). As was done in previous studies (3)(4)(5), we dichotomized the data into 2 categories of mentally unhealthy days (≥14 d vs <14 d).
Model 2: Ordinary least-squares (OLS) linear regression. This model also has been used in previous HRQOL studies (13)(14)(15). This is not a primary model for count data because standard OLS regression makes key assumptions about the data, such as the linearity of the relationship between the predictors and the outcome variable and normality of errors (residuals) (23).
Model 3: Poisson regression. This regression model is popular and also the simplest regression model for count data. It assumes a Poisson distribution, characterized by a positive skew and a variance that equals the mean (18).
Model 4: Negative binomial regression. This model is used when count data are overdispersed (ie, when the variance exceeds the mean). Overdispersion, caused by heterogeneity or an excess number of zeros (or both) to some degree is inherent to most Poisson data (18). We tested alpha (α), an overdispersion parameter in the negative binomial model and also used the likelihood ratio test to determine a preference between the Poisson regression and the negative binomial regression.
Model 5: Zero-inflated negative binomial regression. This model provides a way of modeling the excess number of zeros (with respect to a Poisson distribution or negative binomial distribution) in addition to allowing for count data that are skewed and overdispersed. It is a 2-component model, which combines the logistic regression model and the negative binomial model. The first component of the model, logistic regression for excess zeros, predicts the probability of having excess zero unhealthy days. The second component, negative binomial regression for the full range of counts, including random zeros, predicts the frequency of the unhealthy day count (18). We used the Vuong test, a likelihood-ratio-based test, to compare the zero-inflated negative binomial model with an ordinary negative binomial regression model (24). A significant z-test indicates that the zero-inflated model is preferred.
For each model, we plotted the sample (observed) percentage distribution of the number of unhealthy days (from 0 to 30) against the distribution predicted by the model. If the percentage distribution predicted by a model closely matched the observed distribution in the plot, the model was considered a good fit to the data.
In the modeling, we simultaneously adjusted for age (35-44, 45-54, 55-64, and ≥65), sex, race and ethnicity (non-Hispanic white, non-Hispanic black, Hispanic, and all others), education level (less than high school, high school graduate to <4 y of college, and ≥4 y of college ), household income (<25,000, 25,000 to <50,000, ≥50,000, and unknown), marital status (married, divorced/widowed/separated, and never married), household size (1 or 2, 3 or 4, 5 or 6, and ≥7), employment status (employed, unemployed, homemaker, retired, and unable to work). In the univariate analyses, all of these covariates were significantly associated with homeownership and significantly associated with the number of mentally unhealthy days. We considered these covariates as confounders in the relation between homeownership and number of unhealthy days and therefore included them in our multivariate models.
We used Stata version 12 (StataCorp LP, College Station, Texas) to perform all statistical analyses and take into account the complex sampling design of the survey.

Results
Among adults aged 35 years or older, about four-fifths (79.3%) owned a home ( Table 1). The mean number of mentally unhealthy days was 3.3 days and the median was 0 days, indicating a positive skew. An exact Poisson distribution having a mean of 3.3 days predicted that about 4% of the participants would have zero unhealthy days during the 30day time frame. However, about two-thirds of individuals (66.8%) reported no mentally unhealthy days, indicating an excess of zeros. The variance was 58.7, which is much greater than the mean (3.3 d).
The logistic regression analysis found no significant association (P = 0.22) between homeownership and having 14 or more mentally unhealthy days in the previous month ( Table 2). The parameter estimate (regression coefficient) of homeownership was −0.139 (adjusted odds ratio = 0.87, 95% confidence interval [CI], 0.70-1.09).
Both linear and Poisson regression models underestimated the percentage of nonoccurrence (0 days) and overestimated the percentage in the category 1 to 9 days (Figure 1). The parameter estimates (regression coefficients) of homeownership in these 2 models were not significantly different from zero (Table 2), indicating homeownership was not significantly associated with the number of mentally unhealthy days in either model. Negative binomial regression resulted in a better fit of the data than did either linear or Poisson regression ( Figure 2). The overdispersion parameter (α) in the negative binomial model was 7.2, which is significantly greater than zero (P < .001), indicating that the data were overdispersed. The likelihood-ratio test was 430,000 (P < .001), suggesting that negative binomial regression is preferred over Poisson regression. The parameter estimate of homeownership was −0.137 in the negative binomial model (  The zero-inflated negative binomial regression provided a better fit of the data than did negative binomial regression ( Figure 2). The z value of the Vuong test was 42.5 (P < .001), confirming that the zero-inflated model fit the data better than the non-zero-inflated model. The parameter estimate in the logistic component of the model was 0.216 (P = .003) (

Discussion
In studying the association between homeownership and CDC's Healthy Days measure as an example, we demonstrated how different models can influence statistical inference -the process of drawing conclusions from empirical data. We did not find an independent association between homeownership and number of mentally unhealthy days by logistic, linear, or Poisson regression models. The negative binomial model showed that homeowners had a moderate but significantly lower number of unhealthy days than nonhomeowners. The zeroinflated negative binomial model indicated an association between homeownership and whether individuals reported any mentally unhealthy days but not the number of unhealthy days.
We found that a zero-inflated negative binomial model fit the observed number of mentally unhealthy days reported in BRFSS data better than any of the other models we tested. Despite its ability to model count data, Poisson regression did not fully address the problem of overdispersion. Overdispersion may result in misleading inferences about regression parameters (18). Likewise, negative binomial regression may be less able than zero-inflated negative binomial regression to address the problem of excess zeros. We did not test all possible models in this study. Other models (eg, Hurdle regression, zero-inflated Poisson) can be used to model count data, and there are many methodological deviations of the models we applied (18). Researchers should ensure that their analytic methods fit the data and also use statistical techniques that lead to meaningful interpretations (25). For example, a researcher may find that a zero-inflated negative binomial distribution best fits the data but that a negative binomial distribution without the zero-inflation also meets all statistical assumptions and lends itself to more practical interpretations. In such cases, we advise that researchers consider parsimony and practical interpretation of a model when choosing an analytical method.
The main purpose of this data analysis was not to establish or affirm the "true" relationships between homeownership and number of mentally unhealthy days. We applied various models to BRFSS Healthy Days data as an example to illustrate the importance of appropriate model selection. The study has several limitations. First, it was based on selfreported data from 12 states that elected to include the social context module in its 2009 BRFSS. Second, the survey was conducted through telephone interviews; people without telephones and those who used only cell phones were excluded; these people may be less likely to be homeowners. Third, the BRFSS is a cross-sectional survey: information on the outcome measure (number of mentally unhealthy days) and characteristics (eg, homeownership) of the respondents were assessed at a single point in time. Hence, determining whether the association of characteristics with outcomes preceded or followed the outcomes was not possible.
Any statistical inference requires some assumptions, and incorrect assumptions can invalidate statistical inference (26). Some researchers may ignore the underlying assumptions of their statistical approaches or select a simpler or familiar method as long as the results support their hypothesis. These approaches go against the primary goal of observational epidemiology, which is to assess the detail, strength, direction, shape, and pattern of the relationships between exposures and outcomes. This goal cannot be accomplished without using appropriate statistical methods.
We believe that when the assumptions of analytic techniques are carefully matched to the nature of the data distribution, the results will be more accurate and compelling. False results can mislead researchers, the public, and policy makers and are potentially detrimental to public health. The selection of data analytic techniques is not a trivial statistical matter. Using appropriate analytic procedures will maximize the accuracy and utility of the findings on factors that are of great importance in clinical, policy, and fiscal decisions.

Post-Test Information
To obtain credit, you should first read the journal article. After reading the article, you should be able to answer the following, related, multiple-choice questions. To complete the questions (with a minimum 75% passing score) and earn continuing medical education (CME) credit, please go to http://www.medscape.org/journal/pcd . Credit cannot be obtained for tests completed on paper, although you may use the worksheet below to keep a record of your answers. You must be a registered user on Medscape.org. If you are not registered on Medscape.org, please click on the "Register" link on the right hand side of the website to register. Only one answer is correct for each question. Once you successfully answer all post-test questions you will be able to view and/or print your certificate. For questions regarding the content of this activity, contact the accredited provider, CME@medscape.net. For technical assistance, contact CME@webmd.net. American Medical Association's Physician's Recognition Award (AMA PRA) credits are accepted in the US as evidence of participation in CME activities. For further information on this award, please refer to http://www.ama-assn.org/ama/pub/about-ama/awards/ama-physicians-recognition-award.page . The AMA has determined that physicians not licensed in the US who participate in this CME activity are eligible for AMA PRA Category 1 Credits™. Through agreements that the AMA has made with agencies in some countries, AMA PRA credit may be acceptable as evidence of participation in CME activities. If you are not licensed in the US, please complete the questions online, print the AMA PRA CME credit certificate and present it to your national medical association for review.