This task reviews how to recode variables so they are appropriate for your analytic needs and how to check your derived variables.

Recoding is an important step for preparing an analytical
dataset. In this step, you will view programs that **recode variables
using different techniques** for each of the scenarios listed on
the Clean & Recode Data:
Key Concepts about Recoding Variables in NHANES page. In the
code below, each statement required for recoding is listed
with explanations.

Use the *generate* command to create
a new, derived variable (e.g., *raceth*) based on re-grouping the *
ridreth1* values. Use the *recode* command to create an age categorical
variable (*age3cat*) from a continuous variable.

gen raceth=1 if ridreth1==3

replace raceth=2 if ridreth1==4

replace raceth=3 if ridreth1==1

replace raceth=4 if ridreth1==2 |
ridreth1==5

recode ridageyr (min/19=.) (20/39 = 1)
(40/59 = 2) (60/85 = 3), generate(age3cat)

Use this set of functions to count
systolic and diastolic blood pressure readings. Use the *foreach* loop
command to set any diastolic blood pressure readings of "0" to missing.

gen n_sbp= !missing(bpxsy1)+
!missing(bpxsy2)+ !missing(bpxsy3)+ !missing(bpxsy4)

gen n_dbp= !missing(bpxdi1)+
!missing(bpxdi2)+ !missing(bpxdi3)+ !missing(bpxdi4)

foreach i in bpxdi1 bpxdi2 bpxdi3 bpxdi4 {

replace `i' =. if `i'==0

}

Use the *egen* command with the *
rowmean* function to calculate the mean systolic and diastolic blood
pressures.

egen mean_sbp = rowmean(bpxsy1 bpxsy2
bpxsy3 bpxsy4)

egen mean_dbp = rowmean(bpxdi1 bpxdi2
bpxdi3 bpxdi4)

Use the following set of commands to define
a new variable *hbp *(high blood pressure=1 or 0), based
on a series of conditions that indicate hypertension from the questionnaire and
examination variables.

gen hbp_trt=1 if bpq050a==1

replace hbp_trt=0 if hbp_trt !=1 &
(bpq020==1 | bpq020==2) & (bpq050a !=7 | bpq050a !=9)

gen sbp140=1 if mean_sbp>=140 & mean_sbp<.
& ((n_sbp >0 & n_sbp <.) & (n_dbp >0 & n_dbp <.))

replace sbp140=0 if sbp140 !=1 & ((n_sbp >0
& n_sbp <.) & (n_dbp >0 & n_dbp <.))

gen dbp90=1 if mean_dbp>=90 & mean_dbp<. &
((n_sbp >0 & n_sbp <.) & (n_dbp >0 & n_dbp <.))

replace dbp90=0 if dbp90 !=1 & ((n_sbp >0 &
n_sbp <.) & (n_dbp >0 & n_dbp <.))

gen hbp=1 if (hbp_trt==1 | sbp140==1 |
dbp90==1) & ((hbp_trt>=0 & hbp_trt<.) & (sbp140>=0 & sbp140<.) & (dbp90>=0 &
dbp90<.))

replace hbp=0 if hbp !=1 & ((hbp_trt>=0 &
hbp_trt<.) & (sbp140>=0 & sbp140<.) & (dbp90>=0 & dbp90<.))

Use the following set of commands to
define a new variable, *hlp* (hyperlipidemia = 1 or 0), based on a
series of conditions that indicate high lipid levels from the questionnaire and
examination variables.

gen hlp_trt=1 if bpq100d==1

replace hlp_trt=0 if hlp_trt !=1 &
(bpq080==1 | bpq080==2) & (bpq100d !=7 | bpq100d !=9)

gen hlp_lab=1 if lbxtc>=240 & lbxtc <.

replace hlp_lab=0 if hlp_lab !=1 & (lbxtc>=0
& lbxtc <.)

gen hlp=1 if ((hlp_lab >=0 & hlp_lab <.) &
(hlp_trt >=0 & hlp_trt <.)) & (hlp_lab==1 | hlp_trt==1)

replace hlp=0 if hlp !=1 & ((hlp_lab >=0 &
hlp_lab <.) & (hlp_trt >=0 & hlp_trt <.))

Use the *save* command to save all the
derived variables to a new dataset (*demo_bp3.dta*).

save C:\Nhanes\Data\demo_bp3, replace

- Watch animation of program
- Can't view the demonstration? Try our Tech Tips for troubleshooting help.

In this step, you will check to confirm that derived and recoded
variables correctly **correspond to the original variables**.

Use the *tabulate *
command with the *bysort* command to create a cross tabulation of
the original categorical variables for race/ethnicity, high blood
pressure and hyperlipidemia by their respective recoded variables
for participants who were interviewed and examined in the MEC and
who were age 20 years and older.

tab raceth ridreth1 if (ridageyr >=20 &
ridageyr <.) & ridstatr==2, missing

>bysort hbp_trt: tab bpq020 bpq050a if (ridageyr
>=20 & ridageyr <.) & ridstatr==2, row missing

bysort hbp hbp_trt: tab sbp140 dbp90 if (ridageyr
>=20 & ridageyr <.) & ridstatr==2, row missing

bysort hlp_trt: tab bpq080 bpq100d if (ridageyr
>=20 & ridageyr <.) & ridstatr==2, row missing

bysort hlp: tab hlp_trt hlp_lab if (ridageyr
>=20 & ridageyr <.) & ridstatr==2, row missing

Use the *tabstat* command
to calculate the mean, minimum, and maximum values for the original continuous
variables for participants who were interviewed and examined in the MEC and who
were age 20 years and older. The *by *option will separate the original
continuous variable into categories of the derived variables. This is done to
check that coding of the derived variable, based on cut-off points of the
continuous variable, is correct.

tabstat ridageyr if (ridageyr >=20 &
ridageyr <.) & ridstatr==2, by(age3cat) stat(n min max)

tabstat mean_sbp if (ridageyr >=20 &
ridageyr <.) & ridstatr==2, by(sbp140) stat(n min max)

tabstat mean_dbp if (ridageyr >=20 &
ridageyr <.) & ridstatr==2, by(dbp90) stat(n min max)

tabstat lbxtc if (ridageyr
>=20 & ridageyr <.) & ridstatr==2, by(hlp_lab) stat(n min max)

Highlighted items comparing recoded or derived variables to original variables:

- The output from the tables shows that the derived categorical variables were assigned correctly. For example, 5,794 survey participants who were not treated for hypertension (HBP_trt = 0), who had systolic blood pressures less than 140 (SBP140 = 0), and who had diastolic blood pressures less than 90 (DBP90 = 0) were all assigned to HBP = 0 (no high blood pressure). Any observations with missing values for HBP_trt, SBP140 or DBP90 were assigned missing values for the HBP derived variable. Any observations with a "1" value for HBP_trt, SBP140 or DBP90 were assigned a value of "1" for the HBP derived variable, indicating the presence of conditions used to define high blood pressure.
- The output from the
*tabstat*command shows that the newly derived categorical variables were assigned correctly, based on the cut-off points of the original continuous variables. For example, the derived categorical variable,*age3cat*, had values of "1," "2," and "3," which corresponds correctly to the selected cut-off points for age (20-39, 40-59, and 60-85) of the original continuous variable.