Task 2a: How to Use SUDAAN Code to Perform Linear Regression

In this example, you will assess the association between high density lipoprotein (HDL) cholesterol — the outcome variable — and body mass index (bmxbmi) — the exposure variable — after controlling for selected covariates in NHANES 1999-2002. These covariates include gender (riagendr), race/ethnicity (ridreth1), age (ridageyr), smoking (smoker, derived from SMQ020 and SMQ040; smoker =1 if non-smoker, 2 if past smoker and 3 if current smoker) and education (dmdeduc).

Step 1: Specify the variables in the model

For continuous variables, you have a choice of using the variable in its original form (continuous) or changing it into a categorical variable (e.g. based on standard cutoffs, quartiles or common practice).  The categorical variables should reflect the underlying distribution of the continuous variable and not create categories where there are only a few observations.

It is important to exam the data both ways, since the assumption that a dependent variable has a continuous relationship with the outcome may not be true.  Looking at the categorical version of the variable will help you to know whether this assumption is true.

In this example, you could look at BMI as a continuous variable or convert it into a categorical variable based on standard BMI definitions of underweight, normal weight, overweight and obese.  Here is how categorical BMI variables are created:

IMPORTANT NOTE

These programs use variable formats listed in the Tutorial Formats page. You may need to format the variables in your dataset the same way to reproduce results presented in the tutorial.

Table of code to generate categorical BMI and eligibility variables
Code to generate categorical BMI variables Category

if 0 le bmxbmi lt 18.5 then bmicat= 1 ;

underweight

else if 18.5 le bmxbmi lt 25 then bmicat= 2 ;

normal weight

else if 25 le bmxbmi lt 30 then bmicat= 3 ;

overweight

else if bmxbmi ge 30 then bmicat= 4 ;

obese

if (lbdhdl^= . and riagendr^= . and ridreth1^= . and
ridageyr^=. and smoker^= . and dmdeduc^= <. and bmxbmi^= . )
and wtmec4yr>0 and (ridageyr>= 20 ) then eligible= 1 ;

eligibility

Step 2: Create a simple linear regression

The association between the dependent and independent variables is expressed using the model statement in the in proc regress procedure. The dependent variable must be a continuous variable and will always appear on the left hand side of the equation. The variables on the right hand side of the equation are the independent variables and may be discrete or continuous.

Discrete variables are specified using a subgroup or a class statement. In proc regress, the dependent variable is NEVER specified in a subgroup or a class statement because it must be a continuous variable.

IMPORTANT NOTE

These programs use variable formats listed in the Tutorial Formats page. You may need to format the variables in your dataset the same way to reproduce results presented in the tutorial.

Option 1. SUDAAN proc regress Procedure for Simple Linear Regression
Statements Explanation
proc sort data =analysis_data; by sdmvstra sdmvpsu; run ;

Use the proc sort procedure to sort the data by strata and primary sampling units (PSU) before running the procedure.

proc regress data=analysis_data;

Use the SUDAAN procedure, proc regress, to run multiple regression.

subpopn eligible=1

Use the subpop eligible=1 statement to restrict the analysis to individuals with complete data for all the variables used in the final multiple regression model.

Because only those 20 years and older are of interest in this example, use the subpopn statement to select this subgroup. Please note that for accurate estimates, it is preferable to use subpopn in SUDAAN to select a subgroup for analysis, rather than select the study subgroup in the SAS program while preparing the data file.

nest sdmvstra sdmvpsu;

Use the nest statement to apply design-based methods of analysis.

weight wtmec4yr;

Use the weight statement to account for differential selection probabilities and to adjust for non-response. In this example, the examination weight for 4 years of data (wtmec4yr) is used. (For more information on how to select the correct weight for your analysis, see the Weighting module, Task 1.)

model lbdhdl= bmxbmi;

Use the model statement to define the associations to be assessed. Specify the dependent variable to the left-hand side of the equation and the independent variable on the right. This model will show the relationship between a unit increase in BMI and cholesterol level.

run ;

Option 2. SUDAAN proc regress Procedure for Simple Linear Regression with Categorical BMI Variable
Statements Explanation
proc regress data=analysis_data;
subpopn eligible=1 ;
nest sdmvstra sdmvpsu;
weight wtmec4yr;
model lbdhdl= bmicat;
run ;

Use the SUDAAN procedure, proc regress, to run multiple regression. This model will show the relationship between each unit increase in BMI category and cholesterol level.

Option 3. SUDAAN proc regress Procedure for Simple Linear Regression with Categorical BMI Variable and Reference Level
Statements Explanation
proc regress data=analysis_data;
subpopn eligible= 1 ;
nest sdmvstra sdmvpsu;
weight wtmec4yr;
class bmicat/nofreq;
reflevel bmicat=2 ;
model lbdhdl=bmicat;
rformat bmicat bmicat. ;
run ;

Use the SUDAAN procedure, proc regress, to run multiple regression. This model uses the normal BMI category as a reference category for cholesterol level.

Highlights from the output include:

• The results from the first model indicate that for each 1 unit increase of BMI, on average, HDL decreases by 0.69 mg/dl.
• The results from the second model indicate that, on average, HDL levels decrease by 5.6 mg/dl between the underweight BMI category and the normal weight BMI category, or the normal weight  BMI category to the overweight BMI category.
• The results from the third model indicate that the relationship is not linear and the difference in HDL is between underweight and normal is 3.2 compared to a 7.5 difference between normal weight and overweight.

Step 3: Create a multiple regression

IMPORTANT NOTE

These programs use variable formats listed in the Tutorial Formats page. You may need to format the variables in your dataset the same way to reproduce results presented in the tutorial.

SUDAAN proc regress Procedure for Multiple Linear Regression
Statements Explanation
proc sort data =analysis_data;

by sdmvstra sdmvpsu;

run ;

Use the proc sort procedure to sort the data by strata and primary sampling units (PSU) before running the procedure.

proc regress data=analysis_data;

Use the SUDAAN procedure, proc regress, to run multiple regression.

subpopn eligible=1;

Use the subpop eligible=1 statement to restrict the analysis to individuals with complete data for all the variables used in the final multiple regression model.

Because only those 20 years and older are of interest in this example, use the subpopn statement to select this subgroup. Please note that for accurate estimates, it is preferable to use subpopn in SUDAAN to select a subgroup for analysis, rather than select the study subgroup in the SAS program while preparing the data file.

nest sdmvstra sdmvpsu;

Use the nest statement to apply design-based methods of analysis.

weight wtmec4yr;

Use the weight statement to account for differential selection probabilities and to adjust for non-response. In this example, the examination weight for 4 years of data (wtmec4yr) is used. (For more information on how to select the correct weight for your analysis, see the Weighting module, Task 1.)

class riagendr ridreth1 smoker dmdeduc bmicat/nofreq;

Use the class statement to specify discrete  variables.  Note that any variables not specified in the class statement are treated as continuous. The dependent variable should NOT appear in the class statement. The nofreq option is used to suppress the printing of frequencies.

reflevel bmicat=2 ridreth1= 3 riagendr= 1 ;

Use the reflevel statement to change the reference level of a categorical variable. By default the reference level for a discrete variable is set to the last category. For bmicat this would be category 2 (normal weight). For ridreth1 this would be category 4 (Other race/ethnic groups). The reflevel statement changes the reference level to category 3 (non-Hispanic whites). For riagendr the default reference level is category 2 (females). This statement changes the reference level to category 1 (males).

model lbdhdl= riagendr ridreth1 ridageyr smoker dmdeduc bmicat;

Use the model statement to define the associations to be assessed.  Specify the dependent variable to the left-hand side of the equation and the independent variables on the right.

effects smoker=( 1 - 1 0 )/ name= "Never smoker vs. past smoker" ;

Use the effects statement to test the hypothesis that HDL cholesterol for non-smokers is the same as that for past smokers.

lsmeans bmicat;

Use the lsmeans statement to produce means for the BMI categories (bmicat) and their standard errors. These means will be adjusted for age, smoking, gender, race/ethnicity, and education.

Use the test statement to produce statistics and p-values for the Satterthwaite adjusted chi square (satadjchi), the Satterthwaite adjusted F (satadjf), and  Satterthwaite adjusted degrees of freedom (printed by default). If this statement is omitted, the nominal degrees of freedom, the Wald F and the p-value corresponding to the Wald F and Wald P will be produced.

Step 4: Review Output and Highlights of the Results

In this step, the SUDAAN output is reviewed.

• HDL cholesterol is 6.55 mg/dL higher for overweight adults compared to normal weight adults, as defined by BMI.
• HDL cholesterol is 12.00 mg/dL higher for obese adults compared to normal weight adults, as defined by BMI.
• HDL cholesterol is 2.30 mg/dL lower for underweight adults compared to normal weight adults, as defined by BMI.
• HDL cholesterol is 9.98 mg/dL higher for women than for men, after adjusting for all other variables in the model.
• The F test for gender shows a significant effect (p < 0.001) of gender for HDL cholesterol when controlling for other covariates in the model.
• HDL cholesterol is 4.95 mg/dL higher for non-Hispanic Blacks compared to non-Hispanic Whites, after adjusting for all other variables in the model.
• HDL cholesterol increases 0.11 mg/dL per unit increase in age.

Special Topic: Interactions

When interactions are included in the model, they are denoted with an asterisk, *, between the two variables. An interaction can occur between a discrete and a continuous variable, or between two discrete variables. An interaction term will always appear on the right hand side of an equation.

See the sample code in Sample Datasets and Code for a model with interaction term included.