## Task 4a: How to Recode Variables Based on Alternate Definitions Using SAS

This task reviews how to recode variables so they are appropriate for your analytic needs and how to check your derived variables.

### Step 1: Recode based on alternate definitions

Recoding is an important step for preparing an analytical dataset.  In this step, you will view programs that recode variables using different techniques for each of the scenarios listed on the Clean & Recode Data: Key Concepts about Recoding Variables in NHANES page. In the summary table below, each statement required for recoding is listed on the left with explanations on the right.

Program to Recode as Necessary Based on Alternate Definitions
Statements Explanation

data demo_BP3;

set demo_BP2b;
Use the data and set statements to refer to your analytic dataset.

if ridreth1= 3 then raceth= 1 ; /*Non-Hispanic White*/

else if ridreth1= 4 then raceth= 2 ; /*Non-Hispanic Black*/

else if ridreth1= 1 then raceth= 3 ; /*Mexican American*/

else raceth= 4 ; /*Other*

Use the if, then, and else statements to create a new, derived variable (e.g., raceth) based on re-grouping the ridreth1 values.

IMPORTANT NOTE

Note that the variable values are recoded in the new variable, and a new " Other” category is created by combining the left over categories.

if ( 20 <= ridageyr <= 39 ) then age3cat= 1 ;

else if ( 40 <= ridageyr <= 59 ) then age3cat= 2 ;
else if ridageyr >= 60 then age3cat= 3 ;

Use the if, then, and else statements statement to create an age categorical variable (age3cat) from a continuous variable.

n_sbp = n(of bpxsy1-bpxsy4);
n_dbp = n(of bpxdi1-bpxdi4);
*Setting DBP values of 0 as missing for calculating average;
array _DBP bpxdi1-bpxdi4;
do over _DBP;
if (_DBP = 0 ) then _DBP = . ;
end

Use these function statements to count the number of systolic and diastolic blood pressure readings. Then use the array statement (where _DBP is the name of the array) to set any diastolic blood pressure readings of "0" to missing, so that a reading of "0" does not affect the blood pressure means.

mean_sbp = mean(of bpxsy1-bpxsy4);

mean_dbp = mean(of bpxdi1-bpxdi4);
Use these function statements to calculate mean systolic and diastolic blood pressures.

if BPQ050a= 1 then HBP_trt= 1 ;

else if BPQ020 in ( 1 , 2 ) and BPQ050a < 7 then HBP_trt= 0 ;

if n_sbp> 0 and n_dbp> 0 then do ;
if mean_sbp>= 140 then SBP140= 1 ;
else SBP140= 0 ;
if mean_dbp>= 90 then DBP90= 1 ;
else DBP90= 0 ;
end ;
if HBP_trt>= 0 and SBP140>= 0 and DBP90>= 0 then do ;
if HBP_trt= 1 or SBP140= 1 or DBP90= 1 then HBP= 1 ;
else HBP= 0 ;
end;

Use the if, then, and else statements to define a new variable, hbp (high blood pressure = 1 or 0), based on a series of conditions that indicate hypertension from the questionnaire and examination variables.

if BPQ100d= 1 then HLP_trt= 1 ;
else if BPQ080 in ( 1 , 2 ) and BPQ100d< 7 then HLP_trt= 0 ;
if lbxtc>= 240 then HLP_lab= 1 ;
else if lbxtc>= 0 then HLP_lab= 0 ;
if HLP_lab>= 0 and HLP_trt>= 0 then do ;
if HLP_lab= 1 or HLP_trt= 1 then HLP= 1 ;
else HLP= 0 ;

end ;
run ;

Use the if, then, and else statements to define a new variable, hlp (hyperlipidemia = 1 or 0), based on a series of conditions that indicate high lipid levels from the questionnaire and examination variables.

### Step 2: Check Recodes

In this step, you will check to confirm that derived and  recoded variables correctly correspond to the original variables.

Program to Check Recodes Using Cross Tabulations or proc means
Statements Explanation

proc freq data =demo_BP3;
where ridstatr= 2 and ridageyr>= 20 ;
table raceth*ridreth1/ 1 missing ;
table HBP_trt*BPQ020*BPQ050a HBP*HBP_trt*SBP140*DBP90/ 1 missing ;
table HLP_trt*BPQ080*BPQ100d HLP*HLP_trt*HLP_lab/ 1 missing ;
title 'Check regroup/recode/definitions categorical variables' ;

run ;
Use the proc freq procedure to create a cross tabulation of the original categorical variables for race/ethnicity, high blood pressure and hyperlipidemia by their respective recoded variables. Use the where statement to select the participants who were interviewed and examined in the MEC and who were age 20 years and older.

proc means data =demo_BP3 N min max ;
where ridstatr= 2 and ridageyr>= 20 ;
var ridageyr;
class age3cat;
title 'Check if each age category is in the corrected age range' ;

proc means data =demo_BP3 N min max ;
where ridstatr= 2 and ridageyr>= 20 ;
var mean_SBP;
class SBP140;
title 'Check if SBP >=140 is defined correctly' ;

proc means data =demo_BP3 N min max ;
where ridstatr= 2 and ridageyr>= 20 ;
var mean_DBP;
class DBP90;
title 'Check if DBP >=90 is defined correctly' ;

proc means data =demo_BP3 N min max ;
where ridstatr= 2 and ridageyr>= 20 ;
var lbxtc;
class HLP_lab;
title 'Check if TC>=240 is defined correctly' ;

run ;
Use the proc means procedure to calculate the mean, minimum, and maximum values for the original continuous variables. Use the where statement to select the participants who were interviewed and examined in the MEC and who were age 20 years and older. The class statement will separate the original continuous variable into categories of the derived variables. This is done to check that coding of the derived variable, based on cut-off points of the continuous variable, is correct.

Highlighted items comparing recoded or derived variables to original variables:

• The output from the frequency tables (proc freq) shows that the derived categorical variables were assigned correctly.  For example, 5,794 survey participants who were not treated for hypertension (HBP_trt = 0), who had systolic blood pressures less than 140 (SBP140 = 0), and who had diastolic blood pressures less than 90 (DBP90 = 0) were all assigned to HBP = 0 (no high blood pressure).  Any observations with missing values for HBP_trt, SBP140 or DBP90 were assigned missing values for the HBP derived variable.  Any observations with a "1" values for HBP_trt, SBP140 or DBP90 were assigned a value of "1" for the HBP derived variable, indicating the presence of conditions used to define high blood pressure.
• The output from the proc means procedure shows that the newly derived categorical variables were assigned correctly, based on the cut-off points of the original continuous variables.  For example, the derived categorical variable, age3cat, had values of "1," "2," and "3," which corresponds correctly to the selected cut-off points for age (20-39, 40-59, and 60-85) of the original continuous variable.