Estimating Undercoverage Bias of Internet Users

Introduction In the last decade, response rates to the Behavioral Risk Factor Surveillance System (BRFSS) surveys have been declining. Attention has turned to the possibility of using web surveys to complement or replace BRFSS, but web surveys can introduce coverage bias as a result of excluding noninternet users. The objective of this study was to describe undercoverage bias of internet use. Methods We used data from 402,578 respondents who completed BRFSS questions in 2017 on internet use, self-reported health, current smoking, and binge drinking. We examined undercoverage bias of internet use by partitioning it into a product of 2 components: proportion of noninternet use and difference in the prevalences of interest (self-reported health, current smoking, and binge drinking) between internet users and noninternet users. Results Overall, the weighted proportion of noninternet use overall was 15.0%; the proportion increased with an increase in age and a decrease in education and, by race/ethnicity, was lowest among non-Hispanic white respondents. The overall relative bias was −19.2% for self-reported health, −4.0% for current cigarette smoking, and 8.4% for binge drinking. For all 3 variables of interest, we found large biases and relative biases in some demographic subgroups. Conclusion Undercoverage bias of internet use existed in the 3 studied variables. Both proportion of noninternet users and difference in prevalences of studied variables between internet users and noninternet users contributed to the bias to different degrees. These findings have implications on helping health-related behavioral risk factor surveys transition to more cost-effective survey modes than telephone only.


Introduction
In the last decade, response rates to the Behavioral Risk Factor Surveillance System (BRFSS) surveys have been declining. Attention has turned to the possibility of using web surveys to complement or replace BRFSS, but web surveys can introduce coverage bias as a result of excluding noninternet users. The objective of this study was to describe undercoverage bias of internet use.

Methods
We used data from 402,578 respondents who completed BRFSS questions in 2017 on internet use, self-reported health, current smoking, and binge drinking. We examined undercoverage bias of internet use by partitioning it into a product of 2 components: pro-portion of noninternet use and difference in the prevalences of interest (self-reported health, current smoking, and binge drinking) between internet users and noninternet users.

Introduction
Each year, the Behavioral Risk Factor Surveillance System (BRFSS), the world's largest continuously run telephone-based health survey, collects data on participants' health, use of preventive services, health care access, and health-related behavioral risk factors, such as tobacco and alcohol use, sedentary lifestyle/regular physical activity, and other behaviors in 50 states, the District of Columbia, and participating US territories (1). The target population of BRFSS is noninstitutionalized US residents aged 18 years or older. As found in many other surveys, response rates of BRFSS (2) have been decreasing, and the costs of maintaining an adequate response rate have been increasing correspondingly, leading to high costs.
Web surveys save field operational costs, which is a major advantage over traditional telephone or face-to-face interviews. Their common disadvantage, because they are likely voluntary samples, is that they underrepresent some subgroups in a target population, which presents a challenge for estimating characteristics of the target population. Because internet use has become more prevalent in recent years (3,4), the application of web surveys to BRFSS could be an alternative to the current mode of data collection. For example, one could use telephone surveys to select a probability sample and send a link to potential respondents to participate in a web survey or conduct a telephone survey (as BRFSS has been doing) and switch to a web survey when a potential respondent discontinues the telephone survey. Not everyone uses the internet, however, and little is known about the degree of bias caused by noninternet use in surveillance of general health and health-related behavioral risk factors. The 2017 BRFSS included a question about internet use. In this study, we estimated undercoverage bias by using the information on internet use collected from an aggregated state-representative sample.

Methods
Data used in this study are from the 2017 BRFSS and were aggregated from all 50 states and the District of Columbia. A total of 450,016 eligible people completed the telephone interviews. We used binary answers (yes or no) to the question "Have you used the internet in the past 30 days?" to differentiate between internet users and noninternet users. We studied 3 outcome variables in BRFSS: self-reported fair or poor health, current cigarette smoking, and binge drinking of alcoholic beverages. Survey participants' self-rated health was dichotomized into fair or poor and good/very good/excellent. Current smoking was defined as participants who had smoked at least 100 cigarettes during their lifetime and were still smoking at the time the survey was conducted. Binge drinking was defined as participants who had an episode of consuming at least 5 drinks (for men) or 4 drinks (women) on 1 occasion during the previous 30 days. In this study, we used 402,578 observations that had complete information on the 4 variables: internet use, self-reported fair or poor health, current cigarette smoking, and binge drinking.
To assess the coverage bias, we assumed that all internet users in the past 30 days would participate in a web survey of similar content and that the response rate would be 100%. Furthermore, let P, P I , and P NI be the prevalence of an outcome variable among adults overall, among internet users only, and among noninternet users only, respectively; let N and N NI be the total number and number of noninternet users, respectively. We used to assess the bias of undercoverage (B) when only internet users were included (ie, the difference between prevalences for internet users and for both internet users and noninternet users). In addition, we used relative bias (RB) to measure the effect of web survey on undercoverage. The estimation of bias ( ) and relative bias ( ) were based on a correspondent formula of the sample: and where lowercase letters represent the statistics that estimate correspondent parameters using the BRFSS sample and correspondent sample sizes. All estimates of prevalences, proportions, biases, and relative biases were weighted. Variances of these statistics were estimated by using Taylor linearization methods while taking complex survey design into account.

Results
Among 402,578 respondents in the 2017 BRFSS, 15.0% reported not using the internet in the 30 days before the interview. The proportion of noninternet users increased with an increase in age (Table 1): it was lowest (3.9%) in the group aged 18 to 34 and highest (49.6%) in the group aged 75 or older. We found the highest proportion of the noninternet users among respondents with the lowest level of education (<high school diploma, 45.3%) and the lowest level of household income (<100% federal poverty level, 29.7%) and, by race/ethnicity, among Hispanic (23.7%) and non-Hispanic black (22.0%) respondents.
Overall, the difference in prevalence between internet users and noninternet users was −23.4% for self-reported fair or poor health (Table 2), −4.3% for current cigarette smoking (Table 3), and 9.6% for binge drinking (Table 4). For self-reported fair or poor health, the largest absolute differences in the prevalence between internet users and noninternet users were among those aged 45 to 54 (−25.3%), aged 55 to 64 (−27.1%), and Hispanic (−25.9%), and the prevalences were lower among internet users than noninternet users in all demographic subgroups. For current cigarette smoking, the pattern of the differences of prevalence by age groups was similar to that for self-reported health, but we also observed larger absolute differences among men (−7.8%), non-Hispanic black respondents (−7.4%), and non-Hispanic other respondents (−9.3%). Among those who had less than a high school education, the prevalence of current cigarette smoking was higher among internet users than noninternet users (30.5% vs 22.3%). For binge drinking, we found large differences between internet users and nonin-ternet users among respondents aged 18 to 34 (12.6%) and 35 to 44 years (6.5%), non-Hispanic white respondents (11.4%), Hispanic respondents (10.7%), and those with at least some college (>10%).
The overall bias and relative bias were −3.5% and −19.2% for self-reported fair or poor health (Table 2), −0.7% and −4.0% for current smoking (Table 3), and 1.4% and 8.4% for binge drinking ( Table 4). The bias varied by demographic characteristics. For self-reported fair or poor health, we found the largest bias and relative bias occurred among the groups aged 55 or older, non-Hispanic black and Hispanics respondents, those with a low level of education, and those with a low household income. For current smoking, the bias and relative bias had a similar pattern to that for self-reported fair or poor health. For binge drinking, the largest relative bias was found among respondents aged 75 or older, women, Hispanic respondents, those with less than a high school education, and those with low household income.

Discussion
We found that undercoverage bias of internet use existed in the 3 studied variables. Both proportion of internet use and the differences in prevalences of studied variables between internet users and noninternet users contributed to the bias to different degrees.
Our study showed that the proportion of noninternet use increased as age group increased and education level decreased and was lowest, by racial/ethnic group, among non-Hispanic white respondents, which is consistent with a previous report (5). Undercoverage in a web survey would not be a problem if internet users did not differ systematically from noninternet users. We found differences, however, between internet users and noninternet users. Noninternet users, therefore, could not be considered a random sample from the target population, meaning that valid conclusions could not be drawn from the web survey.
We found 3 types of significant biases. First, both the proportion of noninternet users and the absolute difference in prevalence between internet users and noninternet users was large in some groups, for example, among the group aged 55 to 64 and the group with less than a high school diploma for self-reported fair or poor health and current cigarette smoking. Second, the proportion of noninternet users was large, but the difference in prevalence between internet users and noninternet users was small in some groups, for example, among adults aged 75 years or older for current smoking and binge drinking. Third, the proportion of noninternet users was small, but the difference in prevalence between internet users and noninternet users was large in some groups, for example, among adults aged 18 to 34 years for binge drinking. The findings among these subgroups suggest we need to examine both the proportion of noninternet users and the difference between internet users and noninternet users at the same time when we evaluate undercoverage bias, rather than focus on the proportion of noninternet users only (3).
Although bias indicates the difference between prevalence among the internet users only and overall prevalence, the relative biases, after rescaling the biases with respect to overall prevalence, reflect the strength and direction of the biases. Self-reported health is a key variable in health surveys such as BRFSS and the National Health Interview Survey. If only internet users were included in a probability sample, the prevalence of self-reported fair or poor health would be underestimated with a large negative relative bias, possibly because self-reported fair or poor health and not using the internet are associated with lower socioeconomic status. The relative bias for self-reported fair or poor health was much higher than for current cigarette smoking and binge drinking. For current cigarette smoking, another important indicator and a modifiable behavioral risk factor, the relative biases were similar to those for self-reported health among older participants but indicated overestimation rather than underestimation among those who had not completed high school. Binge drinking is another modifiable behavioral risk factor, but unlike for current smoking, the relative biases for binge drinking indicated overestimation for all sociodemographic subgroups. Because these findings could have a major effect on planning web health surveys, survey designers, samplers, and interviewers need to put more effort into including those subgroups.
Findings from this study could be useful for some types of surveys on health-related risk factors. For example, one could select a probability sample using a telephone survey mode and then provide a link to a web survey in a text message. Self-reported health and/or current smoking are often used as key variables in health-related risk factor surveys. The difference in the prevalences between internet use and noninternet use in some demographic groups indicate that studied internet users are not a random sample of a general population; therefore, study designers should pay particular attention when conducting calibrations in weighting. Our findings could help web survey designers recruit larger sample sizes in certain demographic groups. Underestimation of self-reported health and overestimation of binge drinking could also help analysts to interpret the web survey results. Another example would be a mixed mode survey. A participating state can conduct a much shorter version of BRFSS to increase the response rate and switch to a web survey to ask additional questions on smoking and drinking. In this scenario, our findings could be useful for bias corrections. Our study has some limitations. First, it was based on a sample survey, BRFSS, instead of population-level data, such as a complete population registry. Second, BRFSS is a telephone survey; raking and other adjustments for a relatively large nonresponse rate added additional uncertainty (6). Third, we assumed that all adults who had used the internet during the previous month would respond to a web survey. This is generally not true in practice. Fourth, BRFSS nonrespondents likely also included internet users. Some people who decline to participate in a survey by telephone might agree to participate in a web-based format if they were reached and motivated (7). The inclusion of BRFSS nonrespondents in a cumulative sample of internet users might change our estimates in a general situation. Our study, however, examined a part of undercoverage biases.

PREVENTING CHRONIC DISEASE
The term "web survey" has so many meanings so that it conveys little information about how a survey is conducted (8). Our study was a preliminary study of undercoverage biases of internet use that used data from a large population-based survey on gauging self-reported health, current cigarette smoking, and binge drinking, 3 important measures in public health. We detected different levels of undercoverage biases for overall prevalence and subgroup prevalences. Both proportions of noninternet users and differences in prevalence between internet users and noninternet users play a role in assessing undercoverage biases. Our research can affect the design of web surveys intended to complement or replace the existing BRFSS. Because the applicability of a new survey mode for BRFSS is open, we need studies to collect new empirical evidence. Areas of future research could include the assessment of web survey response rates; the testing of mixed mode, mail/web, telephone/web, and other approaches to data collection and analysis; and the evaluation of the validity and reliability of future BRFSS questionnaire items with an external data source. As web surveys are increasingly used, the findings of our study and future research have implications for helping health and healthrelated behavioral risk factor surveys transition to more costeffective survey modes.