## Task 1: How to Identify and Recode Missing Data in NHANES I

The first task is to identify missing data and recode it. Here are the steps:

### Step 1 Identify missing and unavailable values

In this step, you will use the proc means procedure to check for missing, minimum and maximum values of continuous variables, and the proc freq procedure to look at the frequency distribution of categorical variables in your master analytic dataset. The output from these procedures provides the number and frequency of missing values for each variable listed in the procedure statement.

WARNING

Typically, proc means is used for continuous variables, and proc freq is used for categorical variables. In the following example, we provide proc means and proc freq procedures on the same set of variables without distinguishing continuous and categorical variables. If you perform a proc freq on a continuous variable with many values, the output could be extensive.

proc means for Continuous Variables
Statements Explanation
proc means data =demo1_nh1 N nmiss min max

Use the proc means procedure to determine the number of missing observations (nmiss), minimum values (min), and maximum values (max) for the selected variables.

where N1BM0101 >= Span class="teal">20 ;

Use the where statement to select the participants who were age 20 years and older.

var N1ME0228 N1ME0231 N1ME0718 N1ME0721 N1BM0260 N1BM0266 N1LB0237;

run ;

Use the var statement to indicate the variables of interest.

proc freq for Categorical Variables
Statements Explanation
proc freq data =demo1_nh1;

Use the proc freq procedure to determine the frequency of each value of the variables listed.

where N1BM0101>= 20

Use the where statement to select the participants who were age 20 years and older.

tables N1AH0284 N1AH0287 N1AH0288 N1AH0290 N1AH0293 N1AH0294 N1AH0423

N1AH0472 N1BM0101 N1BM0103 N1BM0104 N1BM0112 N1GM0378 N1GM0379

/ list missing ;
run ;

Use the table statement to indicate the variables of interest. Use the list missing option to display the missing values.

Highlighted items from proc means and proc freq output:

• The column labeled "N" shows the number of observations with data. This example has 16,165 observations for the variable n1lb0237, labeled "Serum cholesterol (mg/dL)."
• The column labeled "N Miss" indicates the number of observations without data. This example has 0 missing observations for the variable n1lb0237.
• Each response value of a variable has a corresponding frequency (check the codebook to determine the definition for each value).  In this example, the variable n1ah0290, labeled "Has a doctor ever told you that you ... [high blood pressure]?," has five possible response values labeled "1," "2," "3," "9" , and " ." .
• The column labeled "Freq" indicates the frequency with which a particular response value occurs in the dataset. In this example, 3,070 observations have the " ." value, 1,910 observations have a value of " 1" , 10,224 observations have a value of " 2", 686 observations have the "3" value, and 275 observations have the " 9" value.
• The column labeled "Percent" indicates the percentage for which each value of the variable accounts, out of the total. The "9" and "." response values of n1ah0290 account for 1.70% and 18.99% of the total, respectively.
• Note for the variable n1ah0284, labeled as "Has a doctor ever told you that ...[heart failure]?" , 12 observations have the "." value, and 17 observations have the "9" value. These represent the frequency of "missing" and "blank but not applicable" responses that were obtained for this question. These latter observations will need to be recoded as missing, which will be covered in the next step.

### Step 2 Recode unavailable values as missing

Two options can be used to recode the missing data:

• assign missing values one variable at a time using an if…then statement, or
• assign missing values by group using an array statement.
Option 1 — Assign Missing Values One Variable at a Time
Statements Explanation

Data temp1_nh1;
set demo1_nh1;

Use the data statement to create a new dataset from your existing dataset; the name of the existing dataset is listed after the set statement.

if n1ah0284= 9 then n1ah0284= . ;

if n1ah0472 in ( 8, 9) < then n1ah0472=.;

Use the if…then statement to recode "8" and "9" values of a variable as missing.

Option 2 - Assign Missing Values by Group Using an Array
Statements Explanation
Data demo2_nh1; set demo1_nh1;

Use the data statement to create a new dataset from your existing dataset; the name of the existing dataset is listed after the set statement.

array _rdmiss n1ah0284 n1ah0290 n1ah0472 n1ah0423 ;

over _rdmiss;

if _rdmiss in ( 8 , 9 ) then _rdmiss= . ;

end ;

array _rgmiss n1me0228 n1me0231 n1me0718 n1me0721 ;

do over _rgmiss;

if _rgmiss in ( 888, 999 ) then _rgmiss= . ;

end ;

if n1bm0112 in ( 88, 99 ) then n1bm0112= . ;

end ;

if n1bm0266 = 8888 then n1bm0266= . ;

end ;

if n1bm0237 in ( 8888, 99 99 ) then n1bm0237= . ;

end ;

if n1bm0260 = 88888 then n1bm0260= . ;

end ;

Use the array statement to recode "8" and "9" values, etc ... of a variable as missing.  In this example, _rdmiss designates the name of the array. Use this option when you want to recode multiple variables that use the same numeric value for "blank, but applicable" and "blank, but not applicable." Assign missing values to the remaining variables one at a time.

### Step 3 Evaluate Extent of missing data

In this step we will use the proc freq procedure to ensure that the recoding in the previous step was done correctly. As a general rule, if 10% or less of your data for a variable are missing from your analytic dataset, it is usually acceptable to continue your analysis without further evaluation or adjustment. Generally, if more than 10% of the data for a variable are missing, you may need to determine whether the missing values are distributed equally across socio-demographic characteristics, and decide whether imputation of missing values or use of adjusted weights are necessary.

However, in NHANES I some of the data items were obtained only for a particular subsample. Consequently some of these items appear to have a great deal of missing data (coded as blank) due to nonresponse, but in fact the data is missing because the design of NHANES I dictated that the item was to be obtained only for a particular subsample (see detailed notes for tape positions 158-193 in the documentation). To alert the user to this fact, asterisks were put on the tape description. One asterisk denotes that the data was obtained only on examinees at locations 1-65 and two asterisks denote that it was obtained only at locations 66-100. (Please see Analytic Guidelines for more information.)

Check the extent of missing data
Statements Explanation
Proc freq data =demo2_nh1;

Use the proc freq procedure to determine the frequency of each value of the variables listed.

where n1bm0101 >= 20 ;

Use the where statement to select the study group who were age 20 years and older.

table n1ah0284 n1ah0290 n1ah0472 n1ah0423 n1me0228 n1me0231 n1me0718 n1me0721 n1bm0266 n1lb0237 n1bm0260 n1bm0112 / list missing ;
run ;

Use the table statement to indicate the variables of interest. Use the list missing option to display the missing values.

Highlighted items from the proc freq output for recording missing values:

• In this example, the variable n1ah0290, labeled as "Has doctor ever told you that you ... [high blood pressure]?", now has only four response values instead of the original five observed before recoding — the value "9" is no longer present.  Also note that there are now a total of 3,345 missing values (instead of 3,070 originally).
• Review of this output indicates that the "8" and "9" values have been successfully recoded and are now classified as missing (.).
• Note that 20.69% of the observations for variable n1ah290 have missing values.

Because there are so many missing values, it is important to investigate further the reasons for those missing values. In reading the documentation, you learn that most of the missing values for this question are due to the question only being asked of examinees at locations 1-65 (denoted by the single asterisk). Looking at the counts for the sample weights in the tape description, you are reminded that 3,059 persons aged 25-74 were included in stands 66-100, and 3,059 persons had blank values for stands 1-65. This means that the true number of missing values is now 286 or less (i.e. 3,345-3,059=286), or approximately 1.8% or less (This tutorial will not investigate the possibility of other reasons since it is for illustrative purposes only).