In this example, you will assess the
association between high density lipoprotein (HDL) cholesterol — the outcome variable —
and body mass index (*bmxbmi*) — the exposure variable —
after controlling for
selected covariates in NHANES 1999-2002. These covariates include
gender (*riagendr*), race/ethnicity (*ridreth1*), age (*ridageyr*), smoking (*smoke*r, derived
from *SMQ020* and *SMQ040; smoker* =1 if non-smoker, 2 if
past smoker and 3 if current smoker) and education (*dmdeduc*).

WARNING

There are several things you should be aware of while analyzing NHANES data with Stata. Please see the Stata Tips page to review them before continuing.

Remember that you need to define the SVYSET before using the SVY series of commands. The general format of this command is below:

svyset [w=weightvar], psu(psuvar) strata(stratavar) vce(linearized)

To define the survey design variables for your high density lipoprotein
cholesterol analysis, use the weight variable
for four-years of MEC data (*wtmec4yr*), the PSU variable (*sdmvpsu*),
and strata variable (*sdmvstra*) .The* vce* option specifies the
method for calculating the variance and the default is "linearized" which is
Taylor linearization. Here is the *svyset* command for
four years of MEC data:

svyset [w= wtmec4yr], psu(sdmvpsu) strata(sdmvstra) vce(linearized)

For continuous variables, you have a choice of using the variable in its original form (continuous) or changing it into a categorical variable (e.g. based on standard cutoffs, quartiles or common practice). The categorical variables should reflect the underlying distribution of the continuous variable and not create categories where there are only a few observations.

It is important to exam the data both ways, since the assumption that a dependent variable has a continuous relationship with the outcome may not be true. Looking at the categorical version of the variable will help you to know whether this assumption is true.

In this example, you could look at BMI as a continuous variable or convert it into a categorical variable based on standard BMI definitions of underweight, normal weight, overweight and obese. Here is how categorical BMI variables are created:

Code to generate categorical BMI variables | BMI Category |
---|---|

gen bmicat=1 if bmxbmi>=0 & bmxbmi<18.5 |
underweight |

replace bmicat=2 if bmxbmi>=18.5 & bmxbmi<25 |
normal weight |

replace bmicat=3 if bmxbmi>=25 |
overweight |

replace bmicat=4 if bmxbmi>=30 & bmxbmi<. |
obese |

For all categorical variables, you need to decide which category to use as the reference group. If you do not specify the reference group options, Stata will choose the lowest numbered group by default.

Use the following general command to specify the reference group:

char var[omit]reference group value

For these analyses, use the following commands to specify the following reference groups.

Stata command | Reference group |
---|---|

char ridreth1[omit]3 |
Non-Hispanic White |

char smoker[omit]3 |
Current Smokers |

char educ[omit]3 |
Greater than high school education |

char bmicat[omit]2 |
Normal weight |

Before you perform a regression on the data, the data needs to meet
a requirement — the dependent variable must be
a continuous variable and the independent variables may be either discrete,
ordinal, or continuous. The association between the
dependent (or outcome) and independent (or exposure) variables is
expressed using the *svy:regress* command. The general form of the command is:

svy:regress depvar indvar

Here is the command (and output) for the BMI-HDL example. This example uses
the *subpop (if eligible==1) *statement to restrict the analysis to individuals with
complete data for all the variables used in the final multiple regression
model. The * eligible *variable is defined in the program available
on the Sample Code and Datasets page.

svy, subpop(if eligible==1): regress lbdhdl bmxbmi

And, here is the output of the statement.

This analysis says that for each 1 unit increase of BMI, on average, HDL decreases by 0.69 mg/dl. Or, you could do the simple regression using the BMI categories:

To perform the same analysis using the categorical BMI variable, *bmicat*,
the statement would be:

svy, subpop(if eligible==1): regress lbdhdl bmicat

And, the output of that statement would be:

This model says that, on average, HDL levels decrease by 5.6 mg/dl between the underweight BMI category and the normal weight BMI category, or the normal weight BMI category to the overweight BMI category.

Delving deeper into this relationship, you will look at each comparison
separately to see whether this continuous relationship really holds. Stata has
a function (called* xi *or *interaction expansion*) which creates the
"indicator variables" to allow you to see these relationships. The
*xi *function will
expand terms containing categorical variables (denoted *i.varname*)
into indicator (also called dummy) variable sets.
It has this
general form:

xi:svy:regress depvar i.indvar

For this example, you will use the HDL variable as the dependent variable and
the BMI categorical variable (*bmicat)* as the independent variable,
denoted with the *i. *prefix. This example uses the *subpop(if ridageyr
>= 20 & ridageyr =.) *statement to select participants who were age 20 years
and older and did not have a missing value for the age variable.

xi:svy, subpop(if ridageyr >=20 & ridageyr =.): regress lbdhdl i.bmicat

Here are the
results for this analysis which use "normal weight" - *bmicat2* as the reference
category:

This analysis using the BMI categorical variable (*BMICAT*) shows that the relationship is not linear and the difference in HDL is
between underweight and normal is 3.2 compared to a 7.5 difference between
normal weight and overweight.

IMPORTANT NOTE

You
can also just use the* xi *option to generate the indicator variables or interaction terms
(rather than using it with the *model *command). The advantage of creating the
indicators prior to the model, is that you do not need to write a command to set
the reference category (you will do this implicitly by selecting the indicators
you include in the model) and the output is easier to read (the *xi model* command
repeats the individual components in interaction terms).

It is also
possible to generate indicator variables using the t*ab, generate* command:

tab var,
gen(*newvar*); for example: tab bmicat, gen(ibmicat)

This command generates four variables:
*ibmicat1*,* ibmicat2*, *ibmicat3*, and *ibmicat4.*

Multiple linear regression uses the same command structure but now includes
other independent variables. And if you want to create indicator variables for
categorical variables, you will want to use the* xi *option. So, the general
structure looks the same:

xi: svy: regress depvar indvar i.var

This example will use the HDL variable (*lbdhdl*) as the dependent variable.
The independent categorical variables (*riagendr*, *ridreth1*, *smoker*, *
educ*, and *bmicat*) are specified with the* i.* prefix, while *
ridageyr* remains an independent continuous variable. Again, it uses the *subpop(if ridageyr >= 20 &
ridageyr =.) *statement to select participants who were age 20 years and
older and did not have a missing value for the age variable.

xi: svy, subpop(if ridageyr >=20 & ridageyr <.): regress lbdhdl i.riagendr i.ridreth1 ridageyr i.smoker i.educ i.bmicat

In this example, the output is:

Later in this module, the results of this multiple regression will be presented in a summary table comparing it with the univariate regression.

Sometimes you may want to calculate means which are
adjusted for the covariates specified in the model to allow you to see the
effect of a given predictor variable. Stata has a built in command, *
adjust,* to do this. *Adjust * is a post-estimation command

The *adjust *command uses only the sample mean, not the mean based on the
survey design, when performing its computations. Therefore, if you want to use the
survey mean, you would need to calculate it first and specify it explicitly in the
*adjust* command. The following commands use *summarize* which is an rclass command and
will not cause any trouble if run between the s*vy: regress* and *adjust
*commands; whereas *svy:mean*
is an eclass* *command and cannot be used in between these commands. Here is the
general form of the command:

sum _cat_1 [aw=weight] if conditions & e(sample)

local cat1 = r(mean)

...

adjust indvar1=cat1 indvar2=cat2.... if e(sample), by(indvar3)

The variables have to appear just as they are in the
regression model. Independent categorical variables are specified with the*
_I *prefix, while the independent continuous variable, *ridageyr*,
doesn't not require the prefix. The following command, will generate mean HDL levels for
BMI
categories (*by (bmicat)*), adjusting for every other variable in the model.

sum _Ibmicat_1 [aw=wtmec4yr] if ridageyr >=20 & ridageyr <. & e(sample)

local bmicat1 = r(mean)

sum _Ibmicat_3 [aw=wtmec4yr] if ridageyr >=20 & ridageyr <. & e(sample)

local bmicat3 = r(mean)

sum _Ibmicat_4 [aw=wtmec4yr] if ridageyr >=20 & ridageyr <. & e(sample)

local bmicat4 = r(mean)

adjust _Iriagendr_2=`riagendr2' _Iridreth1_1=`rid1' _Iridreth1_2=`rid2' ///

_Iridreth1_4=`rid4' _Iridreth1_5=`rid5' ridageyr=`ridage' ///

_Ieduc_1=`educ1' _Ieduc_2=`educ2' _Ismoker_1=`smoke1' ///

_Ismoker_2=`smoke2' if ridageyr >=20 & ridageyr <. & e(sample), by(bmicat) se

The output for this example is:

Later in this module, the results of this adjusted means calculation will be presented in a summary table comparing it with the crude mean.

To understand how much adjustment matters, it is helpful to compare the regression coefficient from the simple and multiple regression models. To help you review the results, the following summary tables present the crude analysis (simple linear regression) and adjusted analysis (multiple linear regression).

BMI |
Crude Analysis Mean HDL |
Adjusted Analysis Mean HDL |
Crude Analysis Coefficient* (95% CI) |
Adjusted Analysis Coefficient* (95% CI) |
Crude Analysis p value |
Adjusted Analysis p value |
---|---|---|---|---|---|---|

underweight |
60.44 |
59.40 |
3.26 |
2.30 |
.028 |
.097 |

normal |
57.18 |
57.10 |
Reference |
Reference |
Reference |
Reference |

overweight |
49.69 |
50.55 |
-7.48 |
-6.55 |
<.001 |
<.001 |

obese |
45.94 |
45.10 |
-11.24 |
-12.00 |
<.001 |
<.001 |

Here are the summary tables of the results for the other covariates in the model.

Smoking |
Crude Analysis Mean HDL |
Adjusted Analysis Mean HDL |
Crude Analysis Coefficient* (95% CI) |
Adjusted Analysis Coefficient* (95% CI) |
Crude Analysis p value |
Adjusted Analysis p value |
---|---|---|---|---|---|---|

current | 49.35 | 51.27 | Reference Group | Reference Group | ||

past | 51.64 | 52.38 |
2.76 (1.29 – 4.23) |
2.33 (1.00 – 3.66) |
.001 | .001 |

never | 52 | 50.05 |
2.32
(.85 – 3.79) |
1.22 (-.22 – 2.66) |
.003 | .095 |

Sex |
Crude Analysis Mean HDL |
Adjusted Analysis Mean HDL |
Crude Analysis Coefficient* (95% CI) |
Adjusted Analysis Coefficient*(95% CI) |
Crude Analysis p value |
Adjusted Analysis p value |
---|---|---|---|---|---|---|

men | 45.91 | 46.08 | Ref erence Group | Ref erence Group | ||

women | 56.21 | 56.06 |
10.30
(9.54 – 11.06) |
9.98
(9.3 – 10.64) |
<.001 | <.001 |

Race/Ethnicity |
Crude Analysis Mean HDL |
Adjusted Analysis Mean HDL |
Crude Analysis Coefficient* (95% CI) |
Adjusted Analysis Coefficient* (95% CI) |
Crude Analysis p value |
Adjusted Analysis p value |
---|---|---|---|---|---|---|

Non-Hispanic White | 51.38 | 50.88 | Ref erence Group | Ref erence Group | ||

Non-Hispanic Black | 54.5 | 55.83 |
3.12
(1.54 – 4.70) |
4.95 (3.61 – 6.29 |
<.001 | <.001 |

Other Hispanic | 47.71 | 48.64 |
-3.67 (-5.47 – -1.88) |
-2.24 (-3.59 – -.89) |
<.001 | .002 |

Mexican-American | 48.92 | 51.55 |
-2.46
(-3.59 – -1.33) |
.67
(-.46 – 1.80) |
<.001 | .235 |

Other | 50.91 | 50.32 |
-.47 (-3.28 – 2.33) |
-.56 (-2.83 – 1.71) |
.733 | .619 |

Education |
Crude Analysis Mean HDL |
Adjusted Analysis Mean HDL |
Crude Analysis Coefficient* (95% CI) |
Adjusted Analysis Coefficient* (95% CI) |
Crude Analysis p value |
Adjusted Analysis p value |
---|---|---|---|---|---|---|

< high school | 49.37 | 49.34 |
-3.10
(-4.41 – -1.79) |
-3.03
(-4.16 – -1.90) |
<.001 | <.001> |

high school | 50.30 | 50.52 |
-2.18 (-3.23 – -1.12) |
-1.85
(-2.98 – -.73) |
<.001 | .002 |

> high school | 52.47 | 52.37 | Reference Group | Reference Group |

Age |
Crude Analysis Mean HDL |
Adjusted Analysis Mean HDL |
Crude Analysis Coefficient* (95% CI) |
Adjusted Analysis Coefficient* (95% CI) |
Crude Analysis p value |
Adjusted Analysis p value |
---|---|---|---|---|---|---|

Age (years) |
- |
- |
.11 (.07 – .12) |
- |
<.001 | <.001 |

Use the *test *post estimation command to produce the Wald F statistic
and the corresponding p-value. Use the *nosvyadjust *option to produce the
unadjusted Wald F. In the example, the command *test* is used to test all
coefficients together; all coefficients separately; and to test the hypothesis
that HDL cholesterol for non-smokers is the same as that for past smokers.

The general form of this statement is below.

test indvar 1 ind var 2 ..., [nosvyadjust]

This example tests all of the coefficients.

test

Here are the results of that statement:

The results of this test will be discussed in the next step.

This example tests the gender coefficient and, using the *nosvyadjust *
option, produces the unadjusted Wald F. Tests for the additional variables are
included in the program available on the
Sample Downloads and Datasets page.

test _Iriagendr_2, nosvyadjust

Here are the results of that statement:

The results of this test will be discussed in the next step.

This example tests the hypothesis that HDL cholesterol for non-smokers is the same as that for past smokers.

test _Ismoker_1 - _Ismoker_2 = 0

Here are the results of the statement:

The results of this test will be discussed in the next step.

In this step, the Stata output is reviewed.

- HDL cholesterol is 6.55 mg/dL higher for overweight adults compared to normal weight adults, as defined by BMI.
- HDL cholesterol is 12.00 mg/dL higher for obese adults compared to normal weight adults, as defined by BMI.
- HDL cholesterol is 2.30 mg/dL lower for underweight adults compared to normal weight adults, as defined by BMI.
- HDL cholesterol is 9.98 mg/dL higher for women than for men, after adjusting for all other variables in the model.
- The F test for gender shows a significant effect (p < 0.001) of gender for HDL cholesterol when controlling for other covariates in the model.
- HDL cholesterol is 4.95 mg/dL higher for non-Hispanic Blacks compared to non-Hispanic Whites, after adjusting for all other variables in the model.
- HDL cholesterol increases 0.11 mg/dL per unit increase in age.

- Printer-friendly annotated table of commands
- Can't view the demonstration? Try our Tech Tips for troubleshooting help.

If you want to look for interactions, use the *xi* option to create interaction terms. The general form
for interaction terms is:

i.var1*i.var2

See the sample code in Sample Datasets and Code for a model with interaction term included.