Task 2c: How to Use Stata Code to Perform Logistic Regression

In this module, you will use simple logistic regression to analyze NHANES data to assess the association between gender (riagendr) — the exposure or independent variable — and the likelihood of having hypertension (based on bpxsar, bpxdar) — the outcome or dependent variable, among participants 20 years old and older.  You will then use multiple logistic regression to assess the relationship after controlling for selected covariates.  The covariates include age (ridageyr), cholesterol (lbxtc), body mass index (bmxbmi) and fasting triglycerides (lbxtr).

 

warning iconWARNING

There are several things you should be aware of while analyzing NHANES data with Stata. Please see the Stata Tips page to review them before continuing.

Step 1: Use svyset to define survey design variables

Remember that you need to define the SVYSET before using the SVY series of  commands. The general format of this command is below:

svyset [w=weightvar], psu(psuvar) strata(stratavar) vce(linearized)

 

To define the survey design variables for your cholesterol analysis, use the weight variable for four-yours of MEC data (wtmec4yr), the PSU variable (sdmvpsu), and strata variable (sdmvstra) .The vce option specifies the method for calculating the variance and the default is "linearized" which is Taylor linearization.  Here is the svyset command for fur years of MEC data:

svyset [w= wtmec4yr], psu(sdmvpsu) strata(sdmvstra) vce(linearized)

Step 2: Create dependent dichotomous variable

For continuous variables, you have a choice of using the variable in its original form (continuous) or changing it into a categorical variable (e.g. based on standard cutoffs, quartiles or common practice).  The categorical variables should reflect the underlying distribution of the continuous variable and not create categories where there are only a few observations. 

For the dependent variable, you will create a dichotomous variable, hyper, which defines people as having (or not having) hypertension. Specifically, a person is said to have hypertension if their systolic blood pressure (measured in the MEC) exceeds 140 or their diastolic blood pressure exceeds 90 or if they are taking blood pressure medication.  Remember for logistic regression to work in Stata,  this variable needs to be defined as 0 (meaning outcome did not occur, here person does not have hypertension) or 1 (outcome occurs, here person has hypertension). The code to create this variable is below:

gen hyper=1 if (bpxsar>=140 & bpxsar<. | bpxdar>=90 & bpxdar<.) | bpq050a==1
replace hyper=0 if hyper !=1 & (bpxsar !=. & bpxdar !=.)

 

Step 3: Create independent categorical variables

In addition to creating the dichotomous dependent variable, this example will also create additional independent categorical variables (age, hichol, bmigrp) from the age, cholesterol, and BMI  categorical variables to use in this analysis.

Code to generate independent categorical variables
Independent variable Code to generate independent categorical variables
Age

gen age=1 if ridageyr >=20 & ridageyr <40
replace age=2 if ridageyr >=40 & ridageyr <60
replace age=3 if ridageyr >=60 abd ridageyr <.

High cholesterol

gen hichol =1 if lbxtc >=240 & lbxtc<. | bpq100d==1
replace hichol =0 if hichol ~=1 & lbxtc !=.

BMI category

gen bmigrp=1 if bmxbmi<25
replace bmigrp=2 if bmxbmi>=25 & bmxbmi <30
replace bmigrp=3 if bmxbmi>=30 & bmxbmi <.

 

Step 4: Transform highly skewed variables

Because the triglycerides variable (lbxtr) is highly skewed, you will use a log transformation to create new variable to use in this analysis.

gen logtrig = log(lbxtr)

 

Step 5: Choose reference groups for categorical variables

For all categorical variables, you need to decide which category to use as the reference group. If you do not specify the reference group options, Stata will choose the lowest numbered group by default. You can use the following general command to tell Stata the reference group:

char var [omit] reference_group_value

 

For your analyses, use the following commands to specify the following reference groups:

Cholesterol

Code to specify reference groups
Variable Code to specify reference group Reference group

Gender

char riagendr [omit] 2

Women

Age

char age [omit] 2

40-59 year olds

BMI

char bmigrp [omit] 2

overweight (bmi25-29)

char hichol [omit] 1

low cholesterol (<240mg/dL)

 

Step 6: Create eligibility variable

Because not every participant in NHANES responded to every question asked, there may be a different level of item non-response to each variable.  To ensure that your analyses are done on the same number of respondents, create a variable called eligible which is 1 for individuals who have a non-blank value for each of the variables used in the analyses, and 0 otherwise.  Although this is a univariate analysis using only exam variables, the fasting subsample weight (wtsaf4yr) is included in determining the eligible variable. This is because you will be conducting a multivariate analysis using the triglycerides variable later and will limit the sample to persons included in both analyses. The Stata code defining eligible is:

gen eligible=1 if wtsaf4yr!=. & hyper!=. & riagendr!=. &age!=. & hichol!=. & bmigrp!=. & logtrig~=. &wtsafyr!=0

 

Step 7: Create simple logistic regression model to understand relationships

The association between the dependent (or outcome) and independent (or exposure) variables is expressed using the svy:logit command. The dependent variable must be a dichotomous variable and the independent variables may be either discrete, ordinal, or continuous.

The general form of the command to get beta coefficients is:

xi: svy, subpop(condition): logit depvar i.indvar

 

To get odds ratios with the logit command, use the or option:

xi: svy, subpop(condition): logit depvar i.indvar, or

 

Odds ratios are automatically produced by the logistic command:

xi: svy, subpop(condition): logistic depvar i.indvar

 

An example command analyzing the relationship between gender and hypertension using the logistic commend is shown below:

xi: svy, subpop(if eligible==1): logistic hyper i.riagendr

 

In this example, the output for the logistic command is:

output for the logistic command

Highlights in the output include:

 

 

Step 8: Specify multiple logistic regression model

Multiple logistic regression uses the same command structure but now includes other independent variables. If you want to create indicator variables for categorical variables, you will want to use the xi option. However, the general structure remains the same:

xi: svy, subpop(condition): logistic depvar indvar i.indvar

 

For this example, you will be using these commands to analyze the effects of gender, age, high cholesterol, BMI, and triglycerides on hypertension. Please note that the svyset commands is using the subsample weight, wtsat4yr, because this analysis includes the triglycerides variable that was only collected on a subsample of the survey.

svyset [w=wtsaf4yr], psu(sdmvpsu) strata(sdmvstra)

xi: svy, subpop(if eligible==1): logistic hyper i.riagendr i.age i.hichol i.bmigrp logtrig

 

In this example, this output is:

output of the effects of gender, age, high cholesterol, BMI, and triglycerides on hypertension

Highlights from the output include:

 

 

Step 9: Compare results of simple and multiple linear regressions

To understand how much adjustment matters, it is helpful to compare the odds ratio from the simple and multiple regression models. The following tables summarize the results.

Table Comparing Differences between Crude Analysis (Simple Logistic Regression) and Adjusted Analysis (Multiple Logistic Regression) - Sex
Sex Crude Analysis
% with hypertension
Crude Analysis
Odds Ratio*
(95% CI)
Adjusted Analysis Odds Ratio*
(95% CI)
Crude Analysis
p value
Adjusted Analysis
p value
men

27%

0.89
(0.75 - 1.05)

0.94
(0.76 - 1.16)

 0.16

  0.55

women

30%

Reference Group

Reference Group

Reference Group

Reference Group

 

Table Comparing Differences between Crude Analysis (Simple Logistic Regression) and Adjusted Analysis (Multiple Logistic Regression) - Age
Age
(years)
Crude Analysis
% with hypertension
Crude Analysis
Odds Ratio*
(95% CI)
Adjusted Analysis Odds Ratio*
(95% CI)
Crude Analysis
p value
Adjusted Analysis
p value
20-39

9%

0.25
(0.18-.034)

0.28
(0.21 - .038)

<0.001

<0.001

40-59

28%

Reference Group

Reference Group

Reference Group

Reference Group

60+

66%

4.87
(3.76 - 6.3)

5.27
(4.00 - 6.94)

<0.001

<0.001

 

Table Comparing Differences between Crude Analysis (Simple Logistic Regression) and Adjusted Analysis (Multiple Logistic Regression) - BMI
BMI Crude Analysis
% with hypertension
Crude Analysis
Odds Ratio*
(95% CI)
Adjusted Analysis
Odds Ratio*
(95% CI)
Crude Analysis
p value
Adjusted Analysis
p value
underweight/
normal

18%

0.58
(0.46 - 0.72)

0.67
(0.51- 0.87)

<0.001

0.004

overweight

28%

Reference Group

Reference Group

Reference Group

Reference Group

obese

42%

1.85
(1.52 - 2.25)

2.18
(1.70 - 2.80)

<0.001

<0.001

 

Table Comparing Differences between Crude Analysis (Simple Logistic Regression) and Adjusted Analysis (Multiple Logistic Regression) - Cholesterol
Cholesterol Crude Analysis
% with hypertension
Crude Analysis
Coefficient*
(95% CI)
Adjusted Analysis
>Coefficient*
(95% CI)
Crude Analysis
p value
Adjusted Analysis
p value
High

43%

Reference Group

Reference Group

Reference Group

Reference Group

Low/Normal

24%

0.41
(0.34 - 0.49)

0.78
(0.62 - 0.97)

<0.001

0.028

 

Table Comparing Differences between Crude Analysis (Simple Logistic Regression) and Adjusted Analysis (Multiple Logistic Regression) - Triglycerides

Triglycerides

Crude Analysis
% with hypertension

Crude Analysis
Coefficient*
(95% CI)

Adjusted Analysis
Coefficient*
(95% CI)

Crude Analysis
p value

Adjusted Analysis
p value

Triglycerides

N/A

1.98
(1.65 - 2.37)

1.28
(1.03 - 1.61)

<0.001

0.029

 

Step 10: Post-estimation

You may want to know whether different comparisons (other than the reference categories you specified) are significant. In that case, you can use a post-estimation command (i.e. a command that can only be run after you have run the logit model command). This takes the general form, if you do not want the unadjusted Wald F:

test vargroup, nosvyadjust

 

This example will be using this command to test that the youngest age group has a statistically significant different likelihood of having hypertension than the oldest age group:

test _Iage_1 = _Iage_3, nosvyadjust

 

The results for this example are:

F(1, 29) = 443.30; Prob > F = 0.0000

 

close window icon Close Window