Task 1c: How to Identify and Recode Missing Data in NHANES Using Stata
The
first task is to identify missing data and recode it. Here are the
steps:
- Identify missing and unavailable values
- Recode unavailable values as missing
- Evaluate extent of missing data
Step 1: Identify missing and unavailable values
In this step, you will use the tabstat and nmissing
commands to check for missing, minimum and maximum values of
continuous variables, and the tabulate command to look at
the frequency distribution of categorical variables in your
master analytic dataset. The output from these commands provides the
number and frequency of missing values for each variable listed in
the procedure statement.
|
Typically the commands, tabstat or
summarize are used for continuous variables, and tabulate
is used for categorical variables. In the following example, tabstat
and tabulate commands are provided on the same set of variables without distinguishing continuous and categorical variables. If you use the
tabulate command on a continuous variable with many values, the output could be extensive. |
Use the tabstat
and nmissing commands to determine the minimum values (min),
and maximum values (max), and the number of missing observations for
the selected variables for participants who were interviewed and
examined in the MEC and who were age 20 years and older.
tabstat bpq* mcq* if (ridageyr >=20 &
ridageyr <.) & ridstatr==2, stat(n min max)
nmissing bpq* mcq* if (ridageyr >=20 &
ridageyr <.) & ridstatr==2
Use the tabulate
command to determine the frequency of each value of the variables listed for
participants who were interviewed and examined in the MEC and who were age 20
years and older. Use the missing option to display the missing values.
tabulate bpq010 if (ridageyr >=20 &
ridageyr <.) & ridstatr==2, missing
Highlighted items from the commands tabstat, nmissing and tabulate
output:
- The row labeled "N” shows the number of observations
with data. This example has 9,376 observations for the variable,
BPQ.020, labeled "Ever told had high blood pressure”.
- The missing indicates the number of observations
without data. This example has 95 missing observations for the variable
BPQ.020.
- Each response value of a variable has a corresponding
frequency (check the codebook to determine the definition for each value).
In this example, the variable BPQ.010, labeled "Last blood pressure reading
by a doctor,” has seven possible response values labeled ".” (missing), "1,” "2,” "3,” "4,” "5,” and "9”.
- The column labeled "Freq” indicates the frequency
with which a particular response value occurs in the dataset. In this
example, two observations have a " .” (missing) value and 6,759 observations
have a value of "1”.
- The column labeled "Percent” indicates the percentage
for which each value of the variable accounts, out of the total. The ".”
(missing) and "1” response values of BPQ.010 account for 0.02 and 71.37% of
the total, respectively.
- Note for the variable BPQ.070, labeled as "When blood
cholesterol last checked”, one observation has a value of "7” and 52
observations have a value of "9”. These represent the frequency of "refused” and "don't know” responses that were obtained for this question.
These observations will need to be recoded as missing, which will be covered
in the next step.
Step 2: Recode unavailable values as missing
Two options can be used to recode the missing data:
- assign missing values one variable at a time
using if qualifier, or
- assign missing values by group using the
foreach loop command.
Option 1 – Assign Missing Values
One Variable at a Time
Use the if qualifier to
recode "7" and "9" values of a variable as missing.
replace bpq010=. if bpq010==7 | bpq010==9
Option 2 -
Assign Missing Values by Group
Use the foreach loop command to
recode "7" and "9" values of a variable as missing.
Use this option when you want to recode
multiple variables that use the same numeric value for "refused" and "don't
know". Use the save command to create a new dataset with the recoded
values.
foreach i in bpq020 bpq050a bpq100d bpq070
bpq080 mcq160b mcq160c mcq160d mcq160e mcq160f {
replace `i' =. if `i' >=7
save C:\Nhanes\Data\demo_bp1, replace
Step 3: Evaluate extent of missing data
In this step you will use the tabulate command to
ensure that the recoding done in the previous step was done
correctly. As a general rule, if 10% or less of your data
for a variable are missing from your analytic dataset,
it is usually acceptable to continue your analysis
without further evaluation or adjustment. However, if
more than 10% of the data for a variable are missing, you
may need to determine whether the missing values are
distributed equally across socio-demographic
characteristics, and decide whether further imputation of
missing values or use of adjusted weights are necessary.
(Please see
Analytic Guidelines for more information.)
Check the extent
of missing data
Use the
tabulate command to determine the frequency of each
value of the variables listed for participants who were
interviewed and examined in the MEC and who were age 20
years and older. Use the missing option to display
the missing values. Use the foreach loop command to
get the frequency of multiple variables.
tabulate bpq010 if (ridageyr >=20 &
ridageyr <.) & ridstatr==2, missing
foreach i in bpq020 bpq070 bpq080 mcq160b
mcq160c mcq160d mcq160e mcq160f {
tabulate `i' if (ridageyr >=20 &
ridageyr <.) & ridstatr==2, missing
}
Highlighted items from the tabulate output for
recoding missing values:
- In this example, the variable
BPQ.010, labeled as "Last blood pressure reading by a
doctor”, now has only five response values instead of
the original six observed before recoding
— the value "9”
is no longer present. Also note that there
are now a total of 18 missing values (instead of two
originally).
- Review of this output indicates
that the "9” values have been successfully
recoded and are now classified as missing (.).
- Note that variable BPQ.030 ("Told had high blood
pressure - 2+ times") still
has a "9” value present, indicating "don't know”
responses. This value was not recoded because this
variable is part of a skip pattern. It is important to
NOT assign "refused” or "don't know” values as missing
for variables in a skip pattern, such as BPQ.030,
because missing values for skipped variables have an
entirely different meaning than missing values for
variables that are not part of a skip pattern. You will
review how to identify and treat skip patterns in the
next task.
- Note that 34.71% of the
observations for variable BPQ.070, labeled as "When
blood cholesterol last checked”, have missing values.
Close Window