Impact of Data Editing Methods on Estimates of Smoking Prevalence, Global Youth Tobacco Survey, 2007–2009

Accuracy of self-reported data may be improved by data editing, a mechanism to produce accurate information by excluding inconsistent data based on a set number of predetermined decision rules. We compared data editing methods in the Global Youth Tobacco Survey (GYTS) with other editing approaches and evaluated the effects of these on smoking prevalence estimates. We evaluated 5 approaches for handling inconsistent responses to questions regarding cigarette use: GYTS, do-nothing, gatekeeper, global, and preponderance. Compared with GYTS data edits, the do-nothing and gatekeeper approaches produced similar estimates, whereas the global approach resulted in lower estimates and the preponderance approach, higher estimates. Implications for researchers using GYTS include recognition of the survey’s data editing methods and documentation in their study methods to ensure cross-study comparability.


Objective
Accurate monitoring of cigarette smoking status among youth is important in addressing the tobacco use epidemic globally (1). However, the accuracy of self-reported health-risk behaviors in questionnaires may be compromised because of difficulties in recall, social desirability, and sensitivity of the question itself (2). Data editing is a mechanism to produce accurate information by excluding inconsistent data based on a set number of predetermined decision rules. Research suggests that editing procedures have potential effects on point estimates and cross-study comparability (3)(4)(5). This exploratory study compares the data editing method used in the Global Youth Tobacco Survey (GYTS) with other data editing approaches and evaluates the effect of these on estimates of smoking prevalence in GYTS to inform collaborators globally. Methods GYTS, a self-administered school-based survey, uses a 2-stage cluster sample design that is grade-based and produces representative samples of students with ages ranging from 10 to 17 years. A subset of students aged 13 to 15 years is used for comparing the data within and across Word Health Organization (WHO) regions. In countries, such as small islands, where all students in the selected grades were surveyed, a census rather than a 2-stage cluster sample is conducted. The survey methods are described in detail elsewhere (6,7).
Eligible countries were selected on the basis of the following inclusion criteria: a nationally representative sample, recent completion of GYTS (2007GYTS ( -2009, large sample size (≥3,000 participants), and GYTS data publicly released. Of 35 eligible countries that met the inclusion criteria, 1 country from each WHO region was randomly selected for this study. Data analysis was performed on a subset of participants aged 13 to 15 years (n) among all ages in the grades selected for the survey (N). The selected countries and the year GYTS was conducted (values for n and N) are as follows: Ghana, 2009 (n/N = 4,171/8,295); Guatemala, 2008 (n/N = 3,838/5,565); Saudi Arabia, 2007 (n/N = 2,574/3,829); the Philippines, 2007 (n/N = 3,278/5,919);Slovakia, 2007 (n/N = 4,176/4,696);and Thailand, 2009 (n/N = 7,649/9,963). Some questions from the GYTS presented the opportunity for participants to contradict themselves when responding (Table 1). Self-reported cigarette smoking on 1 or more of the past 30 days was used to determine cigarette smoking status. For this series of questions, 5 approaches were taken for handling inconsistent responses to questions regarding cigarette use: GYTS, do-nothing, gatekeeper, global, and preponderance (Table 1).
We used Stata 11 software (StataCorp LP, College Station, Texas) to account for complex survey design and to calculate weighted point estimates and standard error (SE) of the estimates. Estimates with a relative SE (ratio of the SE of the estimate to the estimate, multiplied by 100) greater than 30% were considered statistically unreliable. Adjusted Wald tests were used to evaluate for statistical differences between point estimates derived from the GYTS approach and the 4 other data editing approaches. Significance was set at P < .05.

Results
Overall response rates of students interviewed (calculated as the school response rate multiplied by the class and student response rates) for all 6 countries were the following: 84.0% (Ghana), 79.6% (Guatemala), 82.1% (Saudi Arabia), 80.9% (Philippines), 86.1% (Slovakia), and 93.1% (Thailand). Data edit approaches resulted in variation of prevalence estimates of cigarette use; estimates ranged from 2.3% to 5.1% in Ghana, 8.9% to 12.4% in Guatemala, 4.9% to 6.5% in Saudi Arabia, 12.3% to 17.0% in the Philippines, 21.6% to 25.0% in Slovakia, and 9.6% to 11.9% in Thailand ( Table 2). The global approach resulted in lower estimates and the preponderance approach, in general, higher estimates. The do-nothing and gatekeeper approaches produced estimates similar to those of the GYTS approach. The range and magnitude of differences in estimates derived from the global and preponderance approaches compared with those of the GYTS approach were greater among girls than boys. All comparisons of GYTS estimates were significantly different (P < .05) from estimates derived with the 4 other approaches, with several exceptions ( Table 2). Consistent with the overall estimates, the global approach resulted in lower estimates, the preponderance approach higher estimates, and the do-nothing and gatekeeper approaches similar estimates, by sex across all selected countries.

Discussion
We demonstrated the effect of decision rules for handling data inconsistencies in GYTS data to assist collaborators globally. Smoking prevalence estimates generated from surveys can vary with the data editing approach used. Compared with the GYTS data edits, the global approach resulted in lower estimates and the preponderance approach, higher estimates. It is noteworthy that the do-nothing and gatekeeper approaches produced estimates similar to those of the GYTS data editing method. In comparison to the GYTS approach (7 logic checks), data editing methods in the National Youth Tobacco Survey and Youth Risk Behavior Survey are more extensive (more than 30 logic checks for each), suggesting a need to provide a more comprehensive list of logic checks to account for all possible combinations of inconsistencies in GYTS data (8,9).
This study shows how different ways of removing inconsistent data influence the degree to which cigarette smoking is estimated. Clearly described methods for handling inconsistent data are necessary for reproducibility and comparability of GYTS results. Multiple researchers across WHO regions use and publish GYTS data, and accurate comparisons between 2 studies can be made only if the same approach in handling inconsistent data is used. Resolving issues with data inconsistency may include piloting surveys before implementation and incorporating built-in skip patterns if electronic versions of the survey are explored in the future. A limitation of this study is that the list of sampled countries is not representative of, and therefore not generalizable to, all countries conducting GYTS.
Data cleaning and management, as essential aspects of quality assurance and determinants of study validity, require transparency and proper documentation of all procedures (10). Implications for researchers using GYTS include recognition of its data editing approach and documentation in their study methods to ensure cross-study comparability.  a) I have never smoked cigarettes; b) 7 years old or younger; c) 8 or 9 years old; d) 10 or 11 years old; e) 12 or 13 years old; f) 14 or 15 years old; g) 16 years old or older 3. During the past 30 days, how many days did you smoke cigarettes? a) 0 days; b) 1 or 2 days; c) 3 to 5 days; d) 6 to 9 days; e) 10 to 19 days; f) 20 to 29 days; g) All 30 days 4. During the past 30 days, on the day(s) you smoked, how many cigarettes did you usually smoke? a) I did not smoke cigarettes during the past 30 days (1 month); b) Less than 1 cigarette per day; c) 1 cigarette per day; d) 2 to 5 cigarettes per day; e) 6 to 10 cigarettes per day; f) 11 to 20 cigarettes per day; g) More than 20 cigarettes per day

Survey Question
Response Options 5. During the past 30 days, how did you usually get your own cigarettes? a) I did not smoke cigarettes during the past 30 days (1 month); b) I bought them in a store, shop, or from a street vendor; c) I bought them from a vending machine; d) I gave someone else money to buy them for me; e) I borrowed them from someone else; f) I stole them; g) An older person gave them to me; h) I got them some other way 6. During the past 30 days, did anyone refuse to sell you cigarettes because of your age? a) I did not try to buy cigarettes during the past 30 days (one month); b) Yes, someone refused to sell me cigarettes because of my age; c) No, my age did not keep me from buying cigarettes

GYTS
Logic checks for age in question 2 and logic checks for smoking status between questions 1 and 2, 1 and 3, 3 and 4. Inconsistent responses were considered missing.
Do-nothing Response to each question was taken as the truth for that question, and inconsistent responses were disregarded.

Gatekeeper
The response to the first question was taken as the truth, and all subsequent inconsistent responses were considered missing. If the response to question 1 (ever smoker) was no, regardless of the responses to subsequent questions, the current cigarette smoking status was assigned as noncurrent smoker. If the response to question 1 was yes, then current cigarette use status was defined by the response to question 3.

Global
Responses to all 6 questions were required to be consistent, and any inconsistent responses were considered missing.

Preponderance
Current cigarette smoking status, as defined by the answer to question 3, was assigned based on "preponderance of evidence" as determined by evaluation of responses. Responses to question 3 required consistency with responses on questions 4 through 6 regarding the past 30 days; otherwise, current cigarette use status was considered missing. Conversely, inconsistent or missing responses on current cigarette use status from question 3 could be reassigned if responses from questions 4 through 6 regarding the past 30 days were consistent.