## Task 5a: How to Perform Linear Regression Using SUDAAN

This example uses the demoadv dataset (download at Sample Code and Datasets).  In this example, you will assess the association between systolic blood pressure (mean_spb)  — the outcome variable — and calcium supplement use (anycalsup) — the exposure variable — after controlling for selected covariates in NHANES 2003-2004. These covariates include race/ethnicity (ridreth1), age (ridageyr), and body mass index (BMI) (bmxbmi).

### Step 1: Specify the variables in the model

For continuous variables, you have a choice of using the variable in its original form (continuous) or creating a categorical variable (e.g. based on standard cutoffs, quartiles or common practice).  The categorical variables should reflect the underlying distribution of the continuous variable and not create categories where there are only a few observations.  It is important to examine the data both ways, because the assumption that a dependent variable has a linear, continuous relationship with the outcome may not be true.  Looking at the categorical version of the variable will help you to know whether this assumption is true.  For example, you could model BMI as a continuous variable or convert it into a categorical variable based on standard BMI definitions of underweight, normal weight, overweight and obese.  Here is how categorical BMI variables and eligibility variables are created:

#### Table of Code to Generate Categorical BMI and Eligibility Variables

Code to generate categorical BMI variables Category

if 0 le bmxbmi lt 18.5 then bmicat= 1 ;

underweight

else if 18.5 le bmxbmi lt 25 then bmicat= 2 ;

normal weight

else if 25 le bmxbmi lt 30 then bmicat= 3 ;

overweight

else if bmxbmi ge 30 then bmicat= 4 ;

obese

if (dxdtobmd^= . and ridreth1^= . and ridageyr^= . and bmxbmi^= . and anycalsup^= . ) and wtmec2yr> 0 and (ridageyr>= 20 ) then eligible= 1 ;

eligibility

The demoadv dataset for this example only includes those with MEC weights (wtmec2yr>0).

### Step 2: Sort Data

Before running any SUDAAN procedure, sort the data by strata and PSUs, using the PROC SORT procedure.

by sdmvstra sdmvpsu;
run ;

### Step 3: Create a simple linear regression

The association between the dependent and independent variables is expressed using the model statement in the in proc regress procedure. The dependent variable must be a continuous variable and will always appear on the left hand side of the equation. The variables on the right hand side of the equation are the independent variables and may be discrete or continuous.

Discrete variables are specified using a subgroup or a class statement. In proc regress, the dependent variable is NEVER specified in a subgroup or a class statement because it must be a continuous variable.

IMPORTANT NOTE

These programs use variable formats listed in the sample program. You may need to format the variables in your dataset the same way to reproduce results presented in the tutorial.

#### Option 1. SUDAAN proc regress Procedure for Simple Linear Regression

Statements Explanation

proc sort data =demoadv; by sdmvstra sdmvpsu; run ;

Use the proc sort procedure to sort the data by strata and primary sampling units (PSU) before running the procedure.

Use the SUDAAN procedure, proc regress, to run linear regression.

subpopn eligible=1 ;

Use the subpop eligible=1 statement to restrict the analysis to individuals with complete data for all the variables used in the final multiple regression model.

Because only those 20 years and older are of interest in this example, use the subpopn statement to select this subgroup. Please note that for accurate estimates, it is preferable to use subpopn in SUDAAN to select a subgroup for analysis, rather than select the study subgroup in the SAS datastep while preparing the data file.

nest sdmvstra sdmvpsu;

Use the nest statement to apply design-based methods of analysis.

weight wtmec2yr;

Use the weight statement to account for differential selection probabilities and to adjust for non-response. In this example, the examination weight for 2 years of data (wtmec2yr) is used. (For more information on how to select the correct weight for your analysis, see Module 5.)

model mean_spb= anycalsup;

Use the model statement to define the associations to be assessed. Specify the dependent variable to the left-hand side of the equation and the independent variable on the right. This model will show the relationship between a unit increase in BMI and cholesterol level.

rformat anycalsup yesnos. ;

Use the rformat statement to read the SAS formats into SUDAAN.

Highlights from the output include:

• 4,392 respondents ages 20 and older were included in this analysis with complete data on the dependent and independent variables.
• The results from the first model indicate that for calcium supplement users, on average, systolic blood pressure is higher by 2.04 mm Hg.
• This value is significantly greater than 0 (p-value = 0.0014).

### Step 4: Fit a Multiple Regression

IMPORTANT NOTE

These programs use variable formats listed in the sample program. You may need to format the variables in your dataset the same way to reproduce results presented in the tutorial.

#### SUDAAN proc regress Procedure for Multiple Linear Regression

Statements Explanation

by sdmvstra sdmvpsu;
run
;

Use the proc sort procedure to sort the data by strata and primary sampling units (PSU) before running the procedure.

Use the SUDAAN procedure, proc regress, to run multiple regression.

subpopn eligible=1 ;

Use the subpop eligible=1 statement to restrict the analysis to individuals with complete data for all the variables used in the final multiple regression model.

Because only those 20 years and older are of interest in this example, use the subpopn statement to select this subgroup. Please note that for accurate estimates, it is preferable to use subpopn in SUDAAN to select a subgroup for analysis, rather than select the study subgroup in the SAS datastep while preparing the data file.

nest sdmvstra sdmvpsu;

Use the nest statement to apply design-based methods of analysis.

weight wtmec2yr;

Use the weight statement to account for differential selection probabilities and to adjust for non-response. In this example, the examination weight for 2 years of data (wtmec2yr) is used. (For more information on how to select the correct weight for your analysis, see the Weighting module, Task 1.)

class riagendr anycalsup ridreth1 bmicat/nofreq;

Use the class statement to specify discrete  variables.  Note that any variables not specified in the class statement are treated as continuous. The dependent variable should NOT appear in the class statement. The nofreq option is used to suppress the printing of frequencies.

reflevel bmicat= 2 ridreth1= 3 riagendr= 1 ;

Use the reflevel statement to change the reference level of a categorical variable. By default the reference level for a discrete variable is set to the last category.

model mean_sbp=riagendr ridreth1 ridageyr anycalsup bmicat;

Use the model statement to define the associations to be assessed.  Specify the dependent variable to the left-hand side of the equation and the independent variables on the right.

effects anycalsup=( 1 - 1 )/name= "use calcium supp vs. no use" ;

Use the effects statement to test the hypothesis that the systolic blood pressure for supplement users is the same as that for non users.

lsmeans anycalsup;

Use the lsmeans statement to produce means for the calcium supplement use categories and their standard errors. These means will be adjusted for the other variables in the model.

Use the test statement to produce statistics and p-values for the Satterthwaite adjusted chi square (satadjchi), the Satterthwaite adjusted F (satadjf), and  Satterthwaite adjusted degrees of freedom (printed by default). If this statement is omitted, the nominal degrees of freedom, the Wald F and the p-value corresponding to the Wald F and Wald P will be produced.

rformat anycalsup yesnos. ;

rformat ridreth1 race. ;

rformat bmicat bmicat. ;

rformat riagendr gender. ;

Use the rformat statements to read the SAS formats into SUDAAN.

### Step 5: Review Output and Highlights of the Results

In this step, the SUDAAN output is reviewed.

• There are 4,327 observations used for the subpopulation.
• Systolic blood pressure is 0.42 mm Hg lower in supplement users compared to non-users, after adjusting for the other variables in the model.
• The Satterthwaite adjusted F test for calcium supplement use indicates that this effect is not significant (p = 0.47).
• Therefore, the null hypothesis is not rejected at the 0.05 level and it is concluded that the mean systolic blood pressure in calcium supplement users is not different from mean systolic blood pressure in non-users, after adjusting for gender, race/ethnicity, age, and BMI category.  Note that these results are different than the unadjusted model.
• The adjusted mean systolic blood pressure is 122.98 for supplement users, and 123.40 for non-users.