Task 1: How to Identify and Recode Missing Data

Here are the steps for identifying and recoding missing data:

 

Step 1: Identify Missing and Unavailable Values

In this step, you will use the PROC MEANS procedure to check for missing, minimum, and maximum values of continuous variables, and the PROC FREQ procedure to look at the frequency distribution of categorical variables in your analytic dataset. The output from these procedures provides the number and frequency of missing values for each variable listed in the procedure statement. 

 

Info iconIMPORTANT NOTE

 The following examples will use the PROC MEANS and PROC FREQ procedures on the same set of variables without distinguishing continuous and categorical variables. However, if you use the PROC FREQ procedure on a continuous variable with many values, the output could be extensive.

 

In the examples below, you will check for missing values as well and minimum and maximum values for the osteoporosis variables used in the sample "Supplement" program.

 

Program to Identify Missing and Unavailable Values By Means

Sample Code

*----------------------------------------------------------------;
* Use the PROC MEANS procedure to determine the number of        ;
* observations (N), the number of missing observations (Nmiss),  ;
* minimum values (min), and maximum values (max) for the         ;
* selected variables. Use the WHERE  statement to select the INT ;
* sample weight and to select females who are 20 years of age    ;
* and older.  Use the VAR statement to list the variables of     ;
* interest.                                                      ;
*----------------------------------------------------------------;

proc means data =DEMOOST N Nmiss min max ;
where WTINT2YR > 0 and RIAGENDR= 2 and RIDAGEYR >= 20 ;
     var OSQ060 OSQ070; run ;

 

Program to Identify Missing and Unavailable Values By Frequency

Sample Code

*-----------------------------------------------------------------;
* Use the PROC FREQ procedure to determine the frequency of each  ;
* value of the variables listed.  Use the WHERE statement to      ;
* select the INT sample weight and to select females who are 20   ;
* years of age and older.  Use the TABLES statement to list the   ;
* variables of interest.  Use the list missing statement option   ;
* to display the missing values.                                  ;
*-----------------------------------------------------------------;

proc freq data =DEMOOST;
where WTINT2YR > 0 and RIAGENDR= 2 and RIDAGEYR >= 20 ;
tables OSQ060 OSQ070/ list missing ;
run ;

 

Output of Programs

Click here to view program output and highlights

 

 

 

Step 2: Recode Unavailable Values as Missing

To recode missing data, assign missing values one variable at a time using an IF…THEN statement, as demonstrated in the excerpt of the "Supplement" program below. 

 

Program to Assign Missing Values One Variable at a Time

Sample Code

*-------------------------------------------------------------;
* Recode DONT KNOW responses to missing for OS1060 and OSQ070 ;
*-------------------------------------------------------------;

data DEMOOST;
    set DEMOOST; 
    if OSQ060= 9 then OSQ060= . ;
    if OSQ070= 9 then OSQ070= . ;
run ;

 

Step 3: Evaluate Extent of Missing Data

In this step you will use the PROC FREQ procedure to ensure that the variables were recoded correctly in the previous step. This example is from the "Supplement" program.

Program to Check the Extent of Missing Data

Sample Code

*-------------------------------------------------------------------;
* Use the PROC FREQ procedure to determine the frequency of each    ;
* value of the variables listed.  Use the WHERE statement to select ;
* the INT sample weight and to select females who are 20 years of   ;
* age and older.  Use the TABLES statement to list the variables of ;
* interest.  Use the list missing statement option to display the   ;
* missing values.                                                   ;
*-------------------------------------------------------------------;

proc freq data =DEMOOST;
where WTINT2YR > 0 and RIAGENDR= 2 and RIDAGEYR >= 20 ;
tables OSQ060 OSQ070/ list missing ;
run ;

 

Output of Program

Click here to view program output and highlights

 

 

 

close window icon Close Window to return to module page.