## Task 1: How to Identify and Recode Missing Data in NHANES II Data

The first task is to identify missing data and recode it. Here are the steps:

### Step 1 Identify missing and unavailable values

In this step, you will use the proc means procedure to check for missing, minimum and maximum values of continuous variables, and the proc freq procedure to look at the frequency distribution of categorical variables in your master analytic dataset. The output from these procedures provides the number and frequency of missing values for each variable listed in the procedure statement.

WARNING

Typically, proc means is used for continuous variables, and proc freq is used for categorical variables. In the following example, we provide proc means and proc freq procedures on the same set of variables without distinguishing continuous and categorical variables. If you perform a proc freq on a continuous variable with many values, the output could be extensive.

proc means
Statements Explanation
Proc means data=demo1_nh2 N Nmiss min max;

Use the proc means procedure to determine the number of missing observations (Nmiss), minimum values (min), and maximum values (max) for the selected variables.

where n2ah0047>=20 ;

Use the where statement to select the participants who were age 20 years and older.

var n2pe0411 n2pe0771 n2pe0414 n2pe0774 n2bm0412 n2bm0418 n2lb0421; run;

Use the var statement to indicate the variables of interest.

proc freq
Statements Explanation
Proc freq data=demo1_nh2;

Use the proc freq procedure to determine the frequency of each value of the variables listed.

where n2ah0047>= 20 ;

Use the where statement to select the participants who were age 20 years and older.

Table n2ah0495 n2ah0491 n2ah1089 n2ah0625 n2ah0626 n2ah0062 n2ah0064 n2sh0785 n2ah0055 n2ah0056 n2ah0260 n2ah1059 n2ah1060 n2ah1069 /list missing;

run;

Use the table statement to indicate the variables of interest. Use the list missing option to display the missing values.

Highlighted items from proc means and proc freq output.

• The column labeled "N" shows the number of observations with data. This example has 11,864 observations for the variable n2lb0421, labeled "Serum cholesterol (mg/dL)."
• The column labeled "N Miss" indicates the number of observations without data. This example has 3,500 missing observations for the variable n2lb0421.
• Each response value of a variable has a corresponding frequency (check the codebook to determine the definition for each value).  In this example, the variable n2sh0785, labeled "Are you pregnant now?" has five possible response values labeled ".," "Yes," "No," "Blank, but applicable," and "Don't know."
• The column labeled "Frequency" indicates the frequency with which a particular response value occurs in the dataset. In this example, 10,320 observations have the " ." value, 99 observations have a value of "Yes,"  4,896 observations have a value of "No,"  27 observations have the "Blank, but applicable" value, and 22 observations have the "Don't know" value.
• The column labeled "Percent" indicates the percentage for which each value of the variable accounts, out of the total. The "Don't know" and "." response values of n2sh0785 account for 0.14% and 67.17% of the total, respectively.
• Note for the variable n2sh0785, labeled as "Are you pregnant now?" 10,320 observations have the "." value, 27 observations have the "Blank, but applicable" value, and 22 observations have the "Don't know" value. These represent the frequency of "missing," "Blank but applicable,"  and "Don't know" responses that were obtained for this question. These observations will need to be recoded as missing, which will be covered in the next step.

### Step 2 Recode unavailable values as missing

Two options can be used to recode the missing data:

• assign missing values one variable at a time using an if…then statement, or
• assign missing values by group using an array statement.
Option 1 - Assign Missing Values One Variable at a Time
Statements Explanation

Data temp2_nh2;
set demo1_nh2;

Use the data statement to create a new dataset from your existing dataset; the name of the existing dataset is listed after the set statement.

if n2ah0062= 88 then n2ah0062= . ;

if n2sh0785 in ( 8, 9) then n2sh0785=.;

Use the if…then statement to recode "8" and "9" values of a variable as missing.

Option 2 - Assign Missing Values by Group Using an Array
Statements Explanation
Data demo2_nh2; set demo1_nh2;

Use the data statement to create a new dataset from your existing dataset; the name of the existing dataset is listed after the set statement.

array _rdmiss n2ah0064 n2sh0785 n2ah1060 n2ah1089 ;

over _rdmiss;

if _rdmiss in ( 8 , 9 ) then _rdmiss= . ;

end ;

array _rgmiss n2pe0411 n2pe0771 n2pe0414 n2pe0774 n2bm0418 n2lb0421 ;

do over _rgmiss;

if _rgmiss in ( 888, 999 ) then _rgmiss= . ;

end ;

Use the array statement to recode "8" and "9" values of a variable as missing.  In this example, _rdmiss designates the name of the array. Use this option when you want to recode multiple variables that use the same numeric value for "refused" and "Don't know". Assign missing values to the remaining variables one at a time.

### Step 3 Evaluate extent of missing data

In this step we will use the proc freq procedure to ensure that the recoding in the previous step was done correctly. As a general rule, if 10% or less of your data for a variable are missing from your analytic dataset, it is usually acceptable to continue your analysis without further evaluation or adjustment. However, if more than 10% of the data for a variable are missing, you may need to determine whether the missing values are distributed equally across socio-demographic characteristics, and decide whether imputation of missing values or use of adjusted weights are necessary. (Please see Analytic Guidelines for more information.)

Check the extent of missing data
Statements Explanation
Proc freq data =demo2_nh2;

Use the proc freq procedure to determine the frequency of each value of the variables listed.

where n2ah0047 >= 20 ;

Use the where statement to select the study group who were age 20 years and older.

table n2ah0062 n2ah0064 n2ah1059 n2ah1060 n2ah1069 n2ah1089 n2pe0411 n2pe0771 n2pe0414 n2pe0774 n2sh0785 n2bm0418 n2lb0421 / list missing ;
run ;

Use the table statement to indicate the variables of interest. Use the list missing option to display missing values.

Highlighted items from the proc freq output for recording missing values:

• In this example, the variable n2ah0064, labeled as "Completed that grade" , now has only two response values instead of the original three observed before recoding — the value " Blank, but applicable" is no longer present.  Also note that there are now a total of 494 missing values (instead of 141 originally).
• Review of this output indicates that the "Blank, but applicable" and "Don't know" values have been successfully recoded and are now classified as missing (.).
• Note that 11.72% of the observations for variable n2ah1089, the question labeled "Have you ever had a stroke", has missing values. Since more than 10% of observations have missing values, you should determine whether the missing values are distributed equally by socio-demographic characteristics, and decide whether imputation of missing values or use of adjusted weights are necessary.