## Task 4c: How to Recode Variables Based on Alternate Definitions Using Stata

This task reviews how to recode variables so they are appropriate for your analytic needs and how to check your derived variables.

### Step 1: Recode based on alternate definitions

Recoding is an important step for preparing an analytical dataset.  In this step, you will view programs that recode variables using different techniques for each of the scenarios listed on the Clean & Recode Data: Key Concepts about Recoding Variables in NHANES page. In the code below, each statement required for recoding is listed with explanations.

#### Program to Recode as Necessary Based on Alternate Definitions

Use the generate command to create a new, derived variable (e.g., raceth) based on re-grouping the ridreth1 values. Use the recode command to create an age categorical variable (age3cat) from a continuous variable.

gen     raceth=1 if ridreth1==3
replace raceth=2 if ridreth1==4
replace raceth=3 if ridreth1==1
replace raceth=4 if ridreth1==2 | ridreth1==5

recode ridageyr (min/19=.) (20/39 = 1) (40/59 = 2) (60/85 = 3), generate(age3cat)

Use this set of functions to count systolic and diastolic blood pressure readings. Use the foreach loop command to set any diastolic blood pressure readings of "0" to missing.

gen n_sbp= !missing(bpxsy1)+ !missing(bpxsy2)+ !missing(bpxsy3)+ !missing(bpxsy4)

gen n_dbp= !missing(bpxdi1)+ !missing(bpxdi2)+ !missing(bpxdi3)+ !missing(bpxdi4)

foreach i in bpxdi1 bpxdi2 bpxdi3 bpxdi4 {
replace `i' =. if `i'==0
}

Use the egen command with the rowmean function to calculate the mean systolic and diastolic blood pressures.

egen mean_sbp = rowmean(bpxsy1 bpxsy2 bpxsy3 bpxsy4)
egen mean_dbp = rowmean(bpxdi1 bpxdi2 bpxdi3 bpxdi4)

Use the following set of commands to define a new variable hbp (high blood pressure=1 or 0), based on a series of conditions that indicate hypertension from the questionnaire and examination variables.

gen hbp_trt=1 if bpq050a==1
replace hbp_trt=0 if hbp_trt !=1 & (bpq020==1 | bpq020==2) & (bpq050a !=7 | bpq050a !=9)

gen sbp140=1 if mean_sbp>=140 & mean_sbp<. & ((n_sbp >0 & n_sbp <.) & (n_dbp >0 & n_dbp <.))
replace sbp140=0 if sbp140 !=1 & ((n_sbp >0 & n_sbp <.) & (n_dbp >0 & n_dbp <.))
gen dbp90=1 if mean_dbp>=90 & mean_dbp<. & ((n_sbp >0 & n_sbp <.) & (n_dbp >0 & n_dbp <.))
replace dbp90=0 if dbp90 !=1 & ((n_sbp >0 & n_sbp <.) & (n_dbp >0 & n_dbp <.))
gen hbp=1 if (hbp_trt==1 | sbp140==1 | dbp90==1) & ((hbp_trt>=0 & hbp_trt<.) & (sbp140>=0 & sbp140<.) & (dbp90>=0 & dbp90<.))
replace hbp=0 if hbp !=1 & ((hbp_trt>=0 & hbp_trt<.) & (sbp140>=0 & sbp140<.) & (dbp90>=0 & dbp90<.))

Use the following set of commands to define a new variable, hlp (hyperlipidemia = 1 or 0), based on a series of conditions that indicate high lipid levels from the questionnaire and examination variables.

gen hlp_trt=1 if bpq100d==1
replace hlp_trt=0 if hlp_trt !=1 & (bpq080==1 | bpq080==2) & (bpq100d !=7 | bpq100d !=9)
gen hlp_lab=1 if lbxtc>=240 & lbxtc <.
replace hlp_lab=0 if hlp_lab !=1 & (lbxtc>=0 & lbxtc <.)
gen hlp=1 if ((hlp_lab >=0 & hlp_lab <.) & (hlp_trt >=0 & hlp_trt <.)) & (hlp_lab==1 | hlp_trt==1)
replace hlp=0 if hlp !=1 & ((hlp_lab >=0 & hlp_lab <.) & (hlp_trt >=0 & hlp_trt <.))

Use the save command to save all the derived variables to a new dataset (demo_bp3.dta).

save C:\Nhanes\Data\demo_bp3, replace

### Step 2: Check Recodes

In this step, you will check to confirm that derived and  recoded variables correctly correspond to the original variables.

#### Program to Check Recodes Using Cross Tabulations or proc means

Use the tabulate command with the bysort command to create a cross tabulation of the original categorical variables for race/ethnicity, high blood pressure and hyperlipidemia by their respective recoded variables for participants who were interviewed and examined in the MEC and who were age 20 years and older.

tab raceth ridreth1 if (ridageyr >=20 & ridageyr <.) & ridstatr==2, missing
>bysort hbp_trt: tab bpq020 bpq050a if (ridageyr >=20 & ridageyr <.) & ridstatr==2, row missing
bysort hbp hbp_trt: tab sbp140 dbp90 if (ridageyr >=20 & ridageyr <.) & ridstatr==2, row missing
bysort hlp_trt: tab bpq080 bpq100d if (ridageyr >=20 & ridageyr <.) & ridstatr==2, row missing
bysort hlp: tab hlp_trt hlp_lab if (ridageyr >=20 & ridageyr <.) & ridstatr==2, row missing

Use the tabstat command to calculate the mean, minimum, and maximum values for the original continuous variables for participants who were interviewed and examined in the MEC and who were age 20 years and older. The by option will separate the original continuous variable into categories of the derived variables. This is done to check that coding of the derived variable, based on cut-off points of the continuous variable, is correct.

tabstat ridageyr if (ridageyr >=20 & ridageyr <.) & ridstatr==2, by(age3cat) stat(n min max)
tabstat mean_sbp if (ridageyr >=20 & ridageyr <.) & ridstatr==2, by(sbp140) stat(n min max)
tabstat mean_dbp if (ridageyr >=20 & ridageyr <.) & ridstatr==2, by(dbp90) stat(n min max)
tabstat lbxtc if (ridageyr >=20 & ridageyr <.) & ridstatr==2, by(hlp_lab) stat(n min max)

Highlighted items comparing recoded or derived variables to original variables:

• The output from the tables shows that the derived categorical variables were assigned correctly.  For example, 5,794 survey participants who were not treated for hypertension (HBP_trt = 0), who had systolic blood pressures less than 140 (SBP140 = 0), and who had diastolic blood pressures less than 90 (DBP90 = 0) were all assigned to HBP = 0 (no high blood pressure).  Any observations with missing values for HBP_trt, SBP140 or DBP90 were assigned missing values for the HBP derived variable.  Any observations with a "1" value for HBP_trt, SBP140 or DBP90 were assigned a value of "1" for the HBP derived variable, indicating the presence of conditions used to define high blood pressure.
• The output from the tabstat command shows that the newly derived categorical variables were assigned correctly, based on the cut-off points of the original continuous variables.  For example, the derived categorical variable, age3cat, had values of "1," "2," and "3," which corresponds correctly to the selected cut-off points for age (20-39, 40-59, and 60-85) of the original continuous variable.