## Task 2a: How to Use SUDAAN Code to Perform Logistic Regression

In this module, you will use simple logistic regression to analyze NHANES data to assess the association between gender (riagendr) — the exposure or independent variable — and the likelihood of having hypertension (based on bpxsar, bpxdar) — the outcome or dependent variable, among participants 20 years old and older.  You will then use multiple logistic regression to assess the relationship after controlling for selected covariates.  The covariates include gender (riagendr), age (ridageyr), cholesterol (lbxtc), body mass index (bmxbmi) and fasting triglycerides (lbxtr).

### Step 1: Create dependent dichotomous variable

For continuous variables, you have a choice of using the variable in its original form (continuous) or changing it into a categorical variable (e.g. based on standard cutoffs, quartiles or common practice).  The categorical variables should reflect the underlying distribution of the continuous variable and not create categories where there are only a few observations.

For the dependent variable, you will create a dichotomous variable, hyper, which defines people as having (or not having) hypertension. Specifically, a person is said to have hypertension if their systolic blood pressure (measured in the MEC) exceeds 140 or their diastolic blood pressure exceeds 90 or if they are taking blood pressure medication.  Remember for logistic regression to work in SUDAAN,  this variable needs to be defined as 0 (meaning outcome did not occur, here person does not have hypertension) or 1 (outcome occurs, here person has hypertension). The code to create this variable is below:

if (bpxsar >= 140 or bpxdar >= 90 or bpq050a = 1 ) then Hyper = 1 ;

else if (bpxsar ne . and bpxdar ne . ) then Hyper = 0 ;

### Step 2: Create independent categorical variables

In addition to creating the dependent  dichotomous variable, this example will also create additional independent categorical variables (age, hichol, bmigrp) from the age, cholesterol, and BMI categorical variables to use in this analysis.

Code to generate independent categorical variables
Independent variable Code to generate independent categorical variables
Age

if 20 <=ridageyr< 40 then 1 ;
else if 40 <=ridageyr< 60 then 2 ;
else if 60 then 3 ;

High cholesterol

if (lbxtc>= 240 or bpq100d = 1 ) then HiChol = 1 ;
else if (lbxtc ne . ) then HiChol = 0 ;

BMI category

if 0 <=bmxbmi< 25 then 1 ;
else if 25 <=bmxbmi< 30 then 2 ;
else if 30 then 3 ;

### Step 3: Transform highly skewed variables

Because the triglycerides variable (lbxtr) is highly skewed, you will use a log transformation to create new variable to use in this analysis.

logtrig=log(lbxtr);

### Step 4: Create eligibility variable

Because not every participant in NHANES responded to every question asked, there may be a different level of item non-response to each variable.  To ensure that your analyses are done on the same number of respondents, create a variable called eligible which is 1 for individuals who have a non-blank value for each of the variables used in the analyses, and 0 otherwise.  Although this is a univariate analysis using only exam variables, the fasting subsample weight (wtsaf4yr) is included in determining the eligible variable. This is because you will be conducting a multivariate analysis using the triglycerides variable later and will limit the sample to persons included in both analyses. The SAS code defining eligible is:

if hyper ne . and hichol ne . and bmigrp ne . and age ne . and logtrig ne . and wtsaf4yr ne 0 then eligible=1 ;

### Step 5: Set up SUDAAN univariate logistic procedure

This step introduces you to the SUDAAN Univariate Logistic Regression procedure (proc rlogist).  You can read the explanations in the summary table below.

IMPORTANT NOTE

These programs use variable formats listed in the Tutorial Formats page. You may need to format the variables in your dataset the same way to reproduce results presented in the tutorial.

SUDAAN Univariate Logistic Procedure
Statements Explanation
proc sort data =analysis_data;

by sdmvstra sdmvpsu;

run ;

Use the proc sort procedure to sort the data by strata and primary sampling units (PSU) before running the procedure.

proc rlogist data=analysis_data;

Use the SUDAAN procedure, proc rlogist, to run logistic regression.

nest sdmvstra sdmvpsu;

Use the nest statement with strata and PSU to account for the design effects.

weight

Use the weight statement to account for the unequal probability of sampling and non-response.  In this example, the MEC weight for four years of data is used.

subpopn eligible=1

Use the subpopn statement to limit the sample to the observations included in the final logistic model.

Because only a subpopulation is of interest, use the subpopn statement to select this subgroup. Please note that for accurate estimates, it is preferable to use subpopn in SUDAAN to select a subgroup for analysis, rather than select the study subgroup in the SAS program while preparing the data file.

class  riagendr

Use a class statement for categorical variables in version 9.0 and later. In earlier versions, you need a subgroup and levels statement.

reflevel riagendr=2

Use the reflevel statement to choose your reference group for the categorical variables. By default SUDAAN uses the highest category as the reference group.

model  hyper=riagendr;

Use the model statement to specify dependent variable and independent variable(s) in your Logistic Regression model.

Use the test statement to produce statistics and P values for the Satterthwaite adjusted CHI square (satadjchi), the Satterthwaite adjusted F (satadjf), and  Satterthwaite adjusted degrees of freedom (printed by default). If this statement is omitted, the nominal degrees of freedom, the WALDF and the p-value corresponding to the WALDF and WALDP will be produced.

rformat riagendr sexfmt. ;
rformat hyper bpfmt. ;

Use the rformat statement to read the SAS formats into SUDAAN.

### Step 6: Review SUDAAN univariate logistic regression output

In this step, the SUDAAN output is reviewed.

• 1,304 respondents have hypertension and 2,515 do not.
• Men are less likely to have hypertension than women. Their odds of hypertension are 0.89 times the odds of women.
• Assuming a p-value less than 0.05 indicates statistical significance, note that gender is not significantly associated with hypertension based on the p-value for Satterthwaite χ2 or F test, which gives the overall p-value for gender. The Satterthwaite adjusted F gives the most conservative estimate of the test statistics. The p-value of 0.156 indicates that this relationship is not statistically significant.

### Step 7: Set up SUDAAN multivariate logistic procedure

The SUDAAN Multivariate Logistic Regression procedure is similar to the univariate procedure explained in the table above. You can follow the steps outlined below to perform a multivariate logistic regression.

IMPORTANT NOTE

These programs use variable formats listed in the Tutorial Formats page. You may need to format the variables in your dataset the same way to reproduce results presented in the tutorial.

SUDAAN Multivariate Logistic Procedure
Statements Explanation
proc sort data =analysis_data;

by sdmvstra sdmvpsu;

run ;

Use the SAS procedure, proc sort, to sort the data by strata and primary sampling units (PSU) before running the procedure.

proc rlogist data=analysis_data;

Use the SUDAAN procedure, proc rlogist, to run logistic regression.

nest sdmvstra sdmvpsu;

Use the nest statement with strata and primary sampling unit to account for design effects.

weight WTSAF4YR;

Use the fasting subsample weight because the log of fasting triglycerides variable comes from a subsample of the lab data file. Not all respondents were tested on triglycerides.

subpopn eligible=1 ;

Use the subpopn statement to limit the sample to the observations included in the final logistic model.

Please note that for accurate estimates, it is preferable to use subpopn in SUDAAN to select a subgroup for analysis, rather than select the study subgroup in the SAS program while preparing the dataset.

class age riagendr hichol bmigrp;

Use the class statement to specify all categorical variables in the model.

Use a class statement for categorical variables in version 9.0 and later. In earlier versions, you need a subgroup and levels statement.

reflevel age=2 2

Use the reflevel statement to choose your reference group for the categorical variables. By default, SUDAAN uses the highest category as the reference group.

model hyper=age riagendr hichol bmigrp logtrig;

Use the model statement to specify dependent variable and all independent variable(s) in your Logistic Regression model.

Use the test statement to produce statistics and P values for the Satterthwaite adjusted CHI square (satadjchi), the Satterthwaite adjusted F (satadjf), and  Satterthwaite adjusted degrees of freedom (printed by default). If this statement is omitted, the nominal degrees of freedom, the WALDF and the p-value corresponding to the WALDF and WALDP will be produced.

### Step 8: Review SUDAAN multivariate logistic procedure output

This step reviews the SUDAAN multivariate logistic procedure output.

• 1,304 respondents have hypertension and 2,515 do not.
• All covariates are statistically significant at p-value<0.05, except for gender. The Satterthwaite adjusted F gives the most conservative estimate of the test statistics.
• Odds ratios should be interpreted as adjusted odds ratios because there are multiple covariates in the model. The adjusted odds of hypertension are 1.29 (95% C.I. 1.03-1.61) for each unit increase in the log of triglycerides.