Logistic Regression is used to assess the likelihood of a disease or health condition as a function of a risk factor (and covariates). Both simple and multiple logistic regression, assess the association between independent variable(s) (Xi) -- sometimes called exposure or predictor variables — and a dichotomous dependent variable (Y) — sometimes called the outcome or response variable. Logistic regression analysis tells you how much an increment in a given exposure variable affects the odds of the outcome.
Simple logistic regression is used to explore associations between one (dichotomous) outcome and one (continuous, ordinal, or categorical) exposure variable. Simple logistic regression lets you answer questions like, "how does gender affect the probability of having hypertension?
Multiple logistic regression is used to explore associations between one (dichotomous) outcome variable and two or more exposure variables (which may be continuous, ordinal or categorical). The purpose of multiple logistic regression is to let you isolate the relationship between the exposure variable and the outcome variable from the effects of one or more other variables (called covariates or confounders). Multiple logistic regression lets you answer the question, "how does gender affect the probability of having hypertension, after accounting for — or unconfounded by — or independent of – age, income, etc.?" This process — accounting for covariates or confounders — is also called adjustment.
Comparing the results of simple and multiple logistic regression can help to answer the question "how much did the covariates in the model alter the relationship between exposure and outcome (i.e., how much confounding was there)?"
In this module, you will assess the association between gender (the exposure variable) and the likelihood of having hypertension (the outcome). You will look at both simple logistic regression and then multiple logistic regression. The multiple logistic regression will include the covariates of age, cholesterol, body mass index (BMI) and fasting triglycerides. This analysis will answer the question, what is the effect of gender on the likelihood of having hypertension – after controlling for age, cholesterol, BMI, and fasting triglycerides?
As noted, the dependent variable Yi for a Logistic Regression is dichotomous, which means that it can take on one of two possible values. NHANES includes many questions where people must answer either “yes” or “no”, questions like “has the doctor ever told you that you have congestive heart failure?”. Or, you can create dichotomous variables by setting a threshold (e.g., “diabetes” = fasting blood sugar > 126); or by combining information from several variables. In this module, you will create a dichotomous variable called “hyper” based on two variables: measured blood pressure and use of blood pressure medications. In SUDAAN, SAS Survey, and Stata, the dependent variable is coded as 1 (for having the outcome) and 0 (for not having the outcome). In this example, for people who have been told they have hypertension or reported use of blood pressure medication, the hypertension variable would have a value of 1, while people who were never told of hypertension or not taking blood pressure medication would have a value of 0.
The independent variables Xj can be dichotomous (e.g. gender ,"high cholesterol"), ordinal (e.g. age groups, BMI categories), or continuous (e.g. fasting triglycerides).
Since you are trying to find associations between risk factors and a condition, you need a formula that will allow you to link these variables. The logit function that you use in logistic regression is also known as the link function because it connects, or links, the values of the independent variables to the probability of occurrence of the event defined by the dependent variable.
In the logit formula above, E(Yi)=pi implies that the Expected Value of (Yi) equals the probability that Yi=1. In this case, ‘Log' indicates natural Log.
Click here to read the optional material.
Conceptually, linear and logistic regression are very similar: you are trying to show the relationship between some exposures and an outcome. So why is the logit formula, with the log term, so fearsome?
The difference is in the nature of the outcome variable. It is continuous in linear regression, but dichotomous in logistic regression, and that creates a problem.
Imagine you wanted to see how blood pressure level (a continuous variable) relates to age (a continuous variable). The output of linear regression, the b coefficient, a number anywhere from - ∞ to + ∞ , estimates how much a person’s blood pressure level changes with every 1 year change in age.
Now you want to see how the chance, or probability, of having hypertension (a dichotomous variable) relates to age. The linear regression approach won’t work if the outcome variable is a probability. That is because probabilities only range from 0 (i.e., no chance) to 1 (i.e. certain), but, as noted, the b coefficients could be negative (i.e. even less than "no chance") or greater than 1 (i.e., greater than "certain").
Logistic regression is a really clever way around this problem. It works by transforming probabilities to odds. “Odds” are a ratio,
where p is the probability that X happens and (1-p) is the probability that X does not happen.
Transforming to odds takes care of the “negative number” problem since, as is clear from the formula, odds range from 0 to infinity. Now comes the cleverest part: the odds are then further transformed into log form:
Why? Because log odds range from - ∞ to + ∞; that means the results of the logistic regression equation (i.e., the beta coefficient) can be interpreted just like those of linear regression: how much does the likelihood of the the outcome change with a 1 unit change in the exposure. But here, “likelihood” is not a probability, but the log odds. Consider the relationship between having hypertension and gender. If the chance of having hypertension is p, then:
The log odds of hypertension if you are a male
The log odds of hypertension if you are a female
The effect of gender then is just the difference:
As you may remember, the difference of logs is the same as the log of their ratio, so:
The output of a logistic regression analysis — the b coefficient — would show the relationship between hypertension and gender as follows:
Pay attention to the ratio: it’s the log of the ratio of odds. A ratio of odds is (reasonably enough) called an “odds ratio”. So the beta coefficient is actually the log odds ratio, which is easily transformed into a regular odds ratio, the usual output of logistic regression:
Bottom line: Logistic regression analysis tells you how much an increment in a given exposure variable affects the odds of the outcome.
The statistics of primary interest in logistic regression are the b coefficients ( b1,b2,b3... ), their standard errors, and their p-values. Like other statistics, the standard errors are used to calculate confidence intervals around the beta coefficients.
The interpretation of the beta coefficients for different types of independent variables is as follows:
If Xj is a dichotomous variable with values of 1 or 0, then the b coefficient represents the log odds that an individual will have the event for a person with X j=1 versus a person with Xj=0. In a multivariate model, this b coefficient is the independent effect of variable X j on Yi after adjusting for all other covariates in the model.
If Xj is a continuous variable, then the eb represents the odds that an individual will have the event for a person with Xj=m+1 versus an individual with Xj=m. In other words, for every one unit increase in Xj, the odds of having the event Yi changes by eb , adjusting for all other covariates in a multivariate model.
A summary table about interpretation of beta coefficients is provided below:
|Independent Variable Type||Example Variables||The b coefficient in simple logistic regression||The b coefficient in multiple logistic regression|
height, weight, LDL
The change in the log odds of the dependent variable per 1unit change in the independent variable.
The change in the log odds of dependent variable per 1 unit change in the independent variable after controlling for the confounding effects of the covariates in the model.
|Categorical (also known as discrete)||
sex (two subgroups - men and women. This example will use men as the reference group.)
The difference in the log odds of the dependent variable for one value of categorical variable vs. the reference group (for example, between women, and the reference group, men).
The difference in the log odds of the dependent variable for one value of categorical variable vs. the reference group (for example, between women and the reference group, men), after controlling for the confounding effects of the covariates in the model.
It is easy to transform the b coefficients into a more interpretable format, the odds ratio, as follows:
eb = odds ratio
Odds and odds ratios are not the same as risk and relative risks.
Odds and probability are two different ways to express the likelihood of an outcome.
Here are their definitions and some examples.
Example: Getting heads in a 1 flip of a coins
Example: Getting a 1 in a single roll of a dice
# of times something
|= 1/1 = 1 (or 1:1)||= 1/5 = 0.2 (or 1:5)|
# of times something
|= 1/2 = .5 (or 50%)||= 1/6 = .16 (or 16%)|
Few people think in terms of odds. Many people equate odds with probability and thus equate odds ratios with risk ratios. When the outcome of interest is uncommon (i.e. it occurs less than 10% of the time), such confusion makes little difference, since odds ratios and risk ratios are approximately equal. When the outcome is more common, however, the odds ratio increasingly overstates the risk ratio. So, to avoid confusion, when event rates are high, odds ratios should be converted to risk ratios. (Schwartz LM, Woloshin S, Welch HG. Misunderstandings about the effects of race and sex on physicians’ referrals for cardiac catheterization. N Engl J Med 1999;341:279–83) There are simple methods of conversion for both crude and adjusted data. (Zhang J, Yu KF. What's the relative risk? A method of correcting the odds ratio in cohort studies of common outcomes. JAMA 1998;280:1690-1691. Davies HT, Crombie IK, Tavakoli M. When can odds ratios mislead? BMJ 1998;316:989-991)
The following formulas demonstrate how you can go between probability and odds.
Logistic Regression Using the SAS System
By Paul D. Allison
By Leon Gordis