Task 4, Step 1: How to Identify, Recode, and Evaluate Missing Data

In the PAQ dataset, four variables are used to identify whether a participant engaged in household/yard, transportation, or leisure physical activity:

Physical Activity Variables
Physical Activity Domain Variable Name SAS Label Age Range of Study Participants
Household/yard

PAQ100

Tasks around home/yard past 30 days

16 years and older

Transportation PAD020 Walked or biked over past 30 days 12 years and older
Leisure PAD320 Moderate activity over past 30 days 12 years and older
PAD200 Vigorous activity over past 30 days 12 years and older

Identify and Recode Missing Data

Each variable may be assigned one of the following values:

  • 1 - Yes
  • 2 - No
  • 3 - Unable to do activity
  • 7 - Refused
  • 9 - Don’t know

For the “PAQMSTR” dataset, we recode variables with responses of 7 (Refused) or 9 (Don’t know) as “.” (missing). For variables that have a value of 3 (unable to do activity), we interpret this response as “No” and recode to a value of 2 (No).

Your SAS code should resemble the following when embedded in your SAS program.  Note the creation of four new variables (RPAD020, RPAQ100, RPAD200, RPAD320, where the prefix “R” indicates “Recoded”).

Sample Code

if PAD020 = 3 then RPAD020 = 2 ;
     else if PAD020 = 7 then RPAD020 = . ;
     else if PAD020 = 9 then RPAD020 = . ;
     else RPAD020=PAD020;

   if PAQ100 = 3 then RPAQ100 = 2 ;
     else if PAQ100 = 7 then RPAQ100 = . ;
     else if PAQ100 = 9 then RPAQ100 = . ;
     else RPAQ100=PAQ100;

   if PAD200 = 3 then RPAD200 = 2 ;
     else if PAD200 = 7 then RPAD200 = . ;
     else if PAD200 = 9 then RPAD200 = . ;
     else RPAD200=PAD200;

   if PAD320 = 3 then RPAD320 = 2 ;
     else if PAD320 = 7 then RPAD320 = . ;
     else if PAD320 = 9 then RPAD320 = . ;
     else RPAD320=PAD320;

Evaluate Missing Data

Questions in the Physical Activity Questionnaire are age-specific.  Thus, when evaluating missing data, you must clearly define the proportion of the study population expected to have been eligible to give a response.   

Use the PROC FREQ procedure to determine the frequency of each value of the variables listed.  Use the tables statement to list the variables of interest.  Finally, use the “list missing” statement option to display missing values. Make sure that the weight variable has a value > 0 for this dataset.  This will ensure that you are selecting study participants who were sampled for the PAQ.

As a general rule, if 10% or less of your data for a variable are missing from your analytic dataset, it is usually acceptable to continue your analysis without further evaluation or adjustment.

Sample Code

proc freq data =paq;
where WTINT4CD > 0 and RIDAGEYR >= 16 ;
table RPAQ100/ list missing ;
run ;

proc freq data =paq;
where WTINT4CD > 0 and RIDAGEYR >= 12 ;
tables RPAD020/ list missing ;
run ;

proc freq data =paq;
where WTINT4CD > 0 and RIDAGEYR >= 12 ;
tables RPAD200/ list missing ;
run ;

proc freq data =paq;
where WTINT4CD > 0 and RIDAGEYR >= 12 ;
tables RPAD320/ list missing ;
run ;

 

References