In cross-sectional surveys such as NHANES,
linear regression analyses can be used to examine the
association between multiple covariates and a health outcome.
For example, we will assess the association between high density
lipoprotein cholesterol (Y) and selected covariates (X_{i})
in this module. The covariates in this example will include
race/ethnicity, age, sex, body mass index (BMI), smoking status,
and education level.

You use simple linear regression when you have a single independent variable -- and multiple linear regression when you have more than one independent variable (i.e., an exposure and one or more covariates). Multiple regression lets you understand the effect of the exposure of interest on the outcome after accounting for the effects of other variables (called covariates or confounders).

*Simple linear regression* is used to explore associations between one
(continuous, ordinal or categorical) exposure and one (continuous) outcome
variable. Simple linear regression lets you answer questions like, "How does HDL level vary with age?".

*Multiple linear regression* is used to explore associations between two
or more exposure variables (which may be continuous, ordinal or categorical) and
one (continuous) outcome variable. The purpose of multiple linear regression is
to let you isolate the relationship between the exposure variable and the
outcome variable from the effects of one or more other variables called
covariates. For example, say that HDL levels tend to be higher among people
with more income; and people with more income tend to be older. In this case,
inferences about HDL and age get confused by the effect on HDL of income.
This kind of "confusion" is called confounding (and these covariates are
sometimes called confounders). Confounders are variables which are
associated with both the exposure and outcome of interest. This
relationship is shown in the following figure.

You can use multiple linear regression to see through confounding and isolate the relationship of interest. In this example, the relationship is between HDL cholesterol level and age. That is, multiple linear regression lets you answer the question, "How does HDL level vary with age, after accounting for — or unconfounded by — or independent of — income?" As mentioned, you can include many covariates at one time. The process of accounting for covariates is also called adjustment.

Comparing the results of simple and linear regressions can help to answer the question "How much did the covariates in the model distort the relationship between exposure and outcome (i.e., how much confounding was there)?"

Note that getting statistical packages like SUDAAN, SAS Survey, and Stata to run analyses is the easy part of regression. What is not easy is knowing which variables to include in your analyses, how to represent them, when to worry about confounding, determining if your models are any good and knowing how to interpret them. These tasks require thought, training, experience, and respect for the underlying assumptions of regression. Remember, garbage in - garbage out.

Finally, remember that NHANES analyses can only establish associations and not causal relationships. This is because the data are cross-sectional, so there is no way to establish temporal sequences (i.e., which came first the "exposure" or the "outcome"?).

This module will assess the association between high density lipoprotein cholesterol (the outcome variable) and selected covariates to show how to use linear regression with Stata. The covariates in this example will include race/ethnicity, age, sex, body mass index (BMI), smoking status, and education level. In other words, what is the effect of each of these variables, independent of the effect of the other variables?

In the simplest case, you plot the values of a dependent, continuous
variable Y against an independent, continuous variable X_{1}, (i.e. a
correlation) and see the best-fit line that can be drawn through the points.

The first thing to do is make sure the relationship of interest is linear (since linear regression draws a straight line through data points). The best way to do this is to look at a scatterplot. If the relationship between variables is linear, continue (see panels a and b below). If it is not linear, do not use linear regression: Stata will draw a line, but that line won't adequately describe the data (see panels c and d below). In this case, you can try and transform the data or use other forms of regression such as polynomial regression.

Panel A | Panel B |
---|---|

Panel C | Panel D |
---|---|

This relationship between X_{1} and Y
can be expressed as

b_{0} also known as the intercept, denotes the point
at which the line intersects the vertical axis;
b_{1} , or the slope, denotes the change in dependent variable, Y,
per unit change in independent variable, X _{1}; and
ε indicates the degree to which the plot
of Y against X differs from a straight line. Note that for survey data,
ε is always greater than 0.

You can further extend equation (1) to include any number of independent
variables X_{i }, where i=1,..,n (both continuous (e.g. 0-100) and
discrete (e.g. 0,1 or yes/no)).

**Equation for Multiple Regression Model**

**(2)**

The choice of variables to include in equation (2) can be based on results of
univariate analyses, where X_{i} and Y have a demonstrated association.
It also can be based on empirical evidence where a definitive association
between Y and an independent variable has been demonstrated in previous studies.

It is possible to have two continuous variables, Y and X_{1},
on sampled individuals such that if the values of Y are plotted
against the values of X_{1},
the resulting plot would resemble a parabola (i.e., the value of Y
could increase with increasing values of X, level off and then
decline). A polynomial regression model is used to describe this
relationship between X_{1} and
Y and is expressed as

**Equation for Polynomial Regression**

**(3)**

Consider the situation described in equation
(2), where a discrete independent variable, X_{2,} and a
continuous independent variable, X_{1}, affect a continuous
dependent variable, Y. This relationship would yield two straight
lines, one showing the relationship between Y and X_{1 }for
X_{2}=0, and the other showing the relationship of Y and X_{1}
for X_{2}=1. If these straight lines were parallel, the
rate of change of Y per unit change in X_{1} would be the
same for X_{2}=0 as for X_{2}=1, and therefore,
there would be **no interaction** between X_{1} and X_{2}.
If the two lines were not parallel, the relationship between Y and X_{1}
would depend upon the relationship between Y and X_{2}, and
therefore there would be an **interaction **between X_{1}
and X_{2}.