Key Concepts About Linear Regression

In cross-sectional surveys such as NHANES, linear regression analyses can be used to examine the association between multiple covariates and a health outcome.  For example, we will assess the association between high density lipoprotein cholesterol (Y) and selected covariates (Xi) in this module.  The covariates in this example will include race/ethnicity, age, sex, body mass index (BMI), smoking status, and education level. 

You use simple linear regression when you have a single independent variable -- and multiple linear regression when you have more than one independent variable (i.e., an exposure and one or more covariates).   Multiple regression lets you understand the effect of the exposure of interest on the outcome after accounting for the effects of other variables (called covariates or confounders). 

Simple linear regression is used to explore associations between one (continuous, ordinal or categorical) exposure and one (continuous) outcome variable.  Simple linear regression lets you answer questions like, "How does HDL level vary with age?". 

Multiple linear regression is used to explore associations between two or more exposure variables (which may be continuous, ordinal or categorical) and one (continuous) outcome variable.  The purpose of multiple linear regression is to let you isolate the relationship between the exposure variable and the outcome variable from the effects of one or more other variables called covariates.  For example, say that HDL levels tend to be higher among people with more income; and people with more income tend to be older.   In this case, inferences about HDL and age get confused by the effect on HDL of income.  This kind of "confusion" is called confounding (and these covariates are sometimes called confounders).  Confounders are variables which are associated with both the exposure and outcome of interest.  This relationship is shown in the following figure.

Diagram of the Relationship between Exposure, Outcome, and the Confounder


Diagram of the relationship between the exposure, the outcome and the confounder (or third variable)


You can use multiple linear regression to see through confounding and isolate the relationship of interest. In this example, the relationship is between HDL cholesterol level and age.  That is, multiple linear regression lets you answer the question, "How does HDL level vary with age, after accounting for  — or unconfounded by or independent of income?"  As mentioned, you can include many covariates at one time.  The process of accounting for covariates is also called adjustment.

Comparing the results of simple and linear regressions can help to answer the question "How much did the covariates in the model distort the relationship between exposure and outcome (i.e., how much confounding was there)?"

Note that getting statistical packages like SUDAAN, SAS Survey, and Stata to run analyses is the easy part of regression.  What is not easy is knowing which variables to include in your analyses, how to represent them, when to worry about confounding, determining if your models are any good and knowing how to interpret them.  These tasks require thought, training, experience, and respect for the underlying assumptions of regression.  Remember, garbage in - garbage out. 

Finally, remember that NHANES analyses can only establish associations and not causal relationships.  This is because the data are cross-sectional, so there is no way to establish temporal sequences (i.e., which came first the "exposure" or the "outcome"?).

This module will assess the association between high density lipoprotein cholesterol (the outcome variable) and selected covariates to show how to use linear regression with Stata.  The covariates in this example will include race/ethnicity, age, sex, body mass index (BMI), smoking status, and education level.  In other words, what is the effect of each of these variables, independent of the effect of the other variables?


Simple Linear Regression Model

In the simplest case, you plot the values of a dependent, continuous variable Y against an independent, continuous variable X1, (i.e. a correlation) and see the best-fit line that can be drawn through the points.

The first thing to do is make sure the relationship of interest is linear (since linear regression draws a straight line through data points).   The best way to do this is to look at a scatterplot. If the relationship between variables is linear, continue (see panels a and b below).  If it is not linear, do not use linear regression:  Stata will draw a line, but that line won't adequately describe the data (see panels c and d below).   In this case, you can try and transform the data or use other forms of regression such as polynomial regression. 


Example of a Linear Relationship
Panel A Panel B
Panel A - shows scatterplot of mileage and weight Panel B - scatterplot of mileage and weight showing fitted line demonstrating linear relationship
Example of a Non-linear Relationship
Panel C Panel D
Panel C - scatterplot of milage and parab2 Panel D - scatterplot of mileage and parab2 with fitted line demonstrating poor fit and a non-linear relationship


This relationship between X1 and Y can be expressed as

Equation for Simple Linear Regression
equation for simple linear regression (1)


b0 also known as the intercept, denotes the point at which the line intersects the vertical axis; b1 , or the slope, denotes the change in dependent variable, Y, per unit change in independent variable, X 1; and ε  indicates the degree to which the plot of Y against X differs from a straight line. Note that for survey data, ε is always greater than 0.  


Multiple Regression Model

You can further extend equation (1) to include any number of independent variables Xi , where i=1,..,n   (both continuous (e.g. 0-100) and discrete (e.g. 0,1 or yes/no)).


Equation for Multiple Regression Model
equation for multiple regression model (2)


The choice of variables to include in equation (2) can be based on results of univariate analyses, where Xi and Y have a demonstrated association. It also can be based on empirical evidence where a definitive association between Y and an independent variable has been demonstrated in previous studies.


Polynomial Regression

It is possible to have two continuous variables, Y and X1, on sampled individuals such that if the values of Y are plotted against the values of X1, the resulting plot would resemble a parabola (i.e., the value of Y could increase with increasing values of X, level off and then decline). A polynomial regression model is used to describe this relationship between X1 and Y and is expressed as

Equation for Polynomial Regression
equation for polynomial regression (3)



Consider the situation described in equation (2), where a discrete independent variable, X2, and a continuous independent variable, X1, affect a continuous dependent variable, Y. This relationship would yield two straight lines, one showing the relationship between Y and X1 for X2=0, and the other showing the relationship of Y and X1 for X2=1.  If these straight lines were parallel, the rate of change of Y per unit change in X1 would be the same for X2=0 as for X2=1, and therefore, there would be no interaction between X1 and X2.  If the two lines were not parallel, the relationship between Y and X1 would depend upon the relationship between Y and X2, and therefore there would be an interaction between X1 and X2.


close window icon Close Window