## Task 5b: How to Perform Linear Regression Using SAS Survey Procedures

This example uses the demoadv dataset (download at Sample Code and Datasets).  In this example, you will assess the association between systolic blood pressure (mean_spb)  — the outcome variable — and calcium supplement use (anycalsup) — the exposure variable — after controlling for selected covariates in NHANES 2003-2004. These covariates include race/ethnicity (ridreth1), age (ridageyr), and body mass index (BMI) (bmxbmi).

### Step 1: Specify the variables in the model

The demoadv dataset for this example only includes those with MEC weights (wtmec2yr>0).

For continuous variables, you have a choice of using the variable in its original form (continuous) or changing it into a categorical variable (e.g. based on standard cutoffs, quartiles or common practice).  The categorical variables should reflect the underlying distribution of the continuous variable and not create categories where there are only a few observations.  It is important to examine the data both ways, since the assumption that a dependent variable has a continuous relationship with the outcome may not be true.  Looking at the categorical version of the variable will help you to know whether this assumption is true.  For example, you could model BMI as a continuous variable or convert it into a categorical variable based on standard BMI definitions of underweight, normal weight, overweight and obese.  Here is how categorical BMI variables and eligibility variables are created:

#### Table of Code to Generate Categorical BMI and Eligibility Variables

Code to generate categorical BMI variables Category

if 0 le bmxbmi lt 18.5 then bmicat= 1 ;

underweight

else if 18.5 le bmxbmi lt 25 then bmicat= 2 ;

normal weight

else if 25 le bmxbmi lt 30 then bmicat= 3 ;

overweight

else if bmxbmi ge 30 then bmicat= 4 ;

obese

if (dxdtobmd^= . and ridreth1^= . and ridageyr^= . and bmxbmi^= . and anycalsup^= . ) and wtmec2yr> 0 and (ridageyr>= 20 ) then eligible= 1 ;

eligibility

IMPORTANT NOTE

These programs use variable formats listed in the sample program. You may need to format the variables in your dataset the same way to reproduce results presented in the tutorial.

### Step 2: Fit a simple linear regression model

The association between the dependent and independent variables is expressed using the model statement in the in the proc surveyreg procedure. The dependent variable must be a continuous variable and will always appear on the left hand side of the equation. The variables on the right hand side of the equation are the independent variables and may be discrete or continuous.

Discrete variables are specified using a class statement. In proc surveyreg, the dependent variable is NEVER specified in a subgroup or a class statement because it must be a continuous variable.

#### Code to Fit a Simple Linear Regression Model

Statements Explanation

Use the proc surveymeans procedure to obtain number of observations, mean, and standard error.

stratum sdmvstra;

Use the stratum statement to define the strata variable (sdmvstra).

cluster sdmvpsu;

Use the cluster statement to define the PSU variable (sdmvpsu).

class anycalsup;

Use the class statement to define a dummy variable for the independent variable (anycalsup).

model mean_sbp=anycalsup;

Use the model statement to specify the dependent variable (mean_sbp) and the independent variable (anycalsup).

domain eligible;

Use the domain statement to specify the table layout to form the subpopulations of interest. This example uses the eligible participants for the multiple regression.

weight wtmec2yr;

Use the weight statement to account for the unequal probability of sampling and non-response. In this example, the MEC weight for 2 years of data (wtmec2yr) is used.

format anycalsup yesnos. ;

Formats the anycalsup variable.

Highlights from the output include:

• 4,392 respondents ages 20 years and older with complete data for the dependent and independent variables were included in this analysis.
• The results from the first model indicate that for calcium supplement users, on average, systolic blood pressure is higher by 2.04 mm Hg.
• This value is significantly greater than 0 (p-value = 0.0014).

### Step 3: Fit a multiple linear regression model

#### Code to Fit a Multiple Linear Regression Model

Statements Explanation

Use the proc surveymeans procedure to obtain number of observations, mean, and standard error.

stratum sdmvstra;

Use the stratum statement to define the strata variable (sdmvstra).

cluster sdmvpsu;

Use the cluster statement to define the PSU variable (sdmvpsu).

class riagendr anycalsup ridreth1 bmicat;

Use the class statement to define dummy variables.

model mean_sbp=anycalsup riagendr ridreth1 ridageyr bmicat/ solution ;

Use the model statement to specify the dependent variable (mean_sbp) and the independent variables (anycalsup riagendr ridreth1 ridageyr bmicat).

domain eligible;

Use the domain statement to specify the table layout to form the subpopulations of interest. This example uses the eligible participants.

weight wtmec2yr;

Use the weight statement to account for the unequal probability of sampling and non-response. In this example, the MEC weight for 2 years of data (wtmec2yr) is used.

format anycalsup yesnos. ;

Formats the anycalsup variable.

### Step 4: Review Output and Highlights of the Results

In this step, the SAS output is reviewed.

• There are 4,327 observations used for the subpopulation.
• Systolic blood pressure is 0.42 mm Hg lower in supplement users compared to non-users, after adjusting for the other variables in the model.
• The F test for calcium supplement use indicates that this effect is not significant (p = 0.47).
• Therefore, the null hypothesis is not rejected at the 0.05 level and it is concluded that the mean systolic blood pressure in calcium supplement users is not different from mean systolic blood pressure in non-users, after adjusting for gender, race/ethnicity, age, and BMI category.