Task 3a: How to Perform Logistic Regression Using SUDAAN

This example uses the demoadv dataset (download at Sample Code and Datasets). This dataset contains a created variable called anycalsup that has a value of 1 for those who report calcium supplement use, and a value of 2 for those who do not. A participant was considered not to have any calcium supplement use if the daily average amount of calcium supplement use was zero; otherwise, a participant was considered a supplement user (see Supplement Code under Sample Code and Module 9, Task 4 for more information). In this module, you will use simple logistic regression to analyze NHANES data to assess the association between calcium supplement use (anycalsup) — the exposure or independent variable — and the likelihood of receiving treatment for osteoporosis (osteo) — the outcome or dependent variable, among participants ages 20 years old and older. You will then use multiple logistic regression to assess the relationship after controlling for selected covariates. The covariates include gender (riagendr), age (ridageyr), race/ethnicity (ridreth1), and body mass index (bmxbmi).

Step 1: Determine the appropriate weight for the data used

It is always important to check the weights associated with all the variables in the model, and use the weight of the smallest common denominator. In the example the 2-year MEC weight is used, because the osteoporosis variable is from the MEC examination.  The demoadv dataset for this example only includes those with MEC weights (wtmec2yr>0).

• See the Weighting module in the Continuous NHANES Web Tutorial for more information on weighting and combining weights.

Step 2: Create dependent dichotomous variable

For continuous variables, you have a choice of using the variable in its original form (continuous) or changing it into a categorical variable (e.g., based on standard cutoffs, quartiles or common practice).  The categorical variables should reflect the underlying distribution of the continuous variable and not create categories where there are only a few observations. One reason a categorical variable might be created would be to explore a non-linear relationship between the continous variable and the log odds.

For the dependent variable, a dichotomous variable has already been created, treatosteo, which defines people as receiving treatment (or not receiving treatment) for osteoporosis.  A participant was coded as having had treatment for osteoporosis if he or she responded “yes” to OSQ.070 (“{Were you/Was SP} treated for osteoporosis?”) from the osteoporosis questionnaire, and was set to “no” if he or she responded “no” to OSQ.070 or to OSQ.060 (“Has a doctor ever told {you/SP} that {you/s/he} had osteoporosis, sometimes called thin or brittle bones?”) from the osteoporosis questionnaire. (The SAS code to create this variable is found in the “Supplement Program” sample SAS code.)  For logistic regression to work in SUDAAN, this variable needs to be defined as 0 (meaning outcome did not occur, here person is not treated for osteoporosis) or 1 (outcome occurs, here person is treated for osteoporosis). Therefore, a new variable, osteo, was created with this coding. The code to create this variable is below:

if treatosteo= 1 then osteo= 1 ;

else if treatosteo= 2 then osteo= 0 ;

Step 3: Create independent categorical variables

In addition to creating the dependent dichotomous variable, this example will also create additional independent categorical variables (age, bmigrp) from the age, and BMI categorical variables to use in this analysis.

Code to generate independent categorical variables
Independent variable Code to generate independent categorical variables
Age

if 20 <=ridageyr<40 then age= 1 ;
else if 40 <=ridageyr<60 then age= 2 ;
else if ridageyr>= 60 then age= 3 ;

BMI category

if 0 <=bmxbmi<25 then bmigrp= 1 ;
else if 25 <=bmxbmi<30 then bmigrp= 2 ;
else if bmxbmi>= 30 then bmigrp= 3 ;

Step 4: Create eligibility variable

Because not every participant in NHANES responded to every question asked, there may be a different level of item non-response to each variable.  Persons who have missing data for any variable in the model will not be included in the analysis. To ensure that your analyses are done on the same sample, create a variable called eligible which is 1 for individuals who have a non-blank value for each of the variables used in the analyses, and 0 otherwise.  The SAS code defining eligible is:

if anycalsup > . and osteo > . and bmigrp > . and age > .   and riagendr > . then eligible= 1 ;

Step 5: Set up a simple logistic regression model in SUDAAN

This step introduces you to the SUDAAN logistic regression procedure (proc rlogist).  Explanations of the statements used are provided in the summary table below.

IMPORTANT NOTE

These programs use variable formats listed in the sample program. You may need to format the variables in your dataset the same way to reproduce results presented in the tutorial.

SUDAAN Procedure for Simple Logistic Regression
Statements Explanation

proc sort data =demoadv; by sdmvstra sdmvpsu;
run ;

Use the proc sort procedure to sort the data by strata and primary sampling units (PSU) before running the procedure.

Use the SUDAAN procedure, proc rlogist, to run logistic regression.

nest sdmvstra sdmvpsu;

Use the nest statement with strata and PSU to account for the design effects.

weight wtmec2yr;

Use the weight statement to account for the unequal probability of sampling and non-response.  In this example, the MEC weight for2 years of data is used.

subpopn eligible=1 ;

Use the subpopn statement to limit the sample to the observations included in the final logistic model.

Because only a subpopulation is of interest, use the subpopn statement to select this subgroup. Please note that for accurate estimates, it is preferable to use subpopn in SUDAAN to select a subgroup for analysis, rather than create the study subgroup in the SAS datastep while preparing the data file.

class anycalsup;

Use a class statement for categorical variables in version 9.0 and later. In earlier versions, you need a subgroup and levels statement.

reflevel anycalsup=2 ;

Use the reflevel statement to choose your reference group for the categorical variables. By default SUDAAN uses the highest category as the reference group.

model osteo=anycalsup;

Use the model statement to specify the dependent variable and independent variable(s) in your logistic regression model.

Use the test statement to produce statistics and P values for the Satterthwaite adjusted chi-square (satadjchi) and the Satterthwaite adjusted F (satadjf), and to print the Satterthwaite adjusted degrees of freedom (printed by default). If this statement is omitted, the nominal degrees of freedom will be printed, and the WALDF and the p-value corresponding to the WALDF and WALDP will be produced.

rformat anycalsup yesnos. ;

rformat osteo yesnof. ;

Use the rformat statement to read the SAS formats into SUDAAN.

Step 6: Review SUDAAN simple logistic regression output

In this step, the SUDAAN output is reviewed.

• 238 respondents receive osteoporosis treatment and 4,385 do not.
• 2,155 use calcium supplements and 2,468 do not.
• Calcium supplement users are more likely to be receiving treatment for osteoporosis. Their odds of osteoporosis treatment are 2.71 times the odds of non users (the reference group).
• Assuming a p-value less than 0.05 indicates statistical significance, note that supplement use is significantly associated with osteoporosis based on the p-value for Satterthwaite χ2 or F test, which gives the overall p-value for supplement use. The Satterthwaite adjusted F gives the most conservative estimate of the test statistics. The p-value of 0.0003 indicates that this relationship is statistically significant.

Step 7: Set up a multiple logistic model in SUDAAN

The SUDAAN programming for fitting a multiple logistic regression model is similar to the programming explained in the table above for fitting a simple logistic regression model. A sample program for fitting a multiple regression model is described in the table below.

IMPORTANT NOTE

These programs use variable formats listed in the Tutorial Formats page. You may need to format the variables in your dataset the same way to reproduce results presented in the tutorial.

SUDAAN Procedure for Multiple Logistic Regression
Statements Explanation

by sdmvstra sdmvpsu;
run ;

Use the SAS procedure, proc sort, to sort the data by strata and primary sampling units (PSU) before running the procedure.

proc rlogist data=demoost;

Use the SUDAAN procedure, proc rlogist, to run logistic regression.

nest sdmvstra sdmvpsu;

Use the nest statement with strata and primary sampling unit to account for design effects.

weight wtmec2yr;

Use the weight statement to account for the unequal probability of sampling and non-response.  In this example, the MEC weight for 2 years of data is used.

subpopn eligible=1 ;

Use the subpopn statement to limit the sample to the observations included in the final logistic model.

Please note that for accurate estimates, it is preferable to use subpopn in SUDAAN to select a subgroup for analysis, rather than select the study subgroup in the SAS datastep while preparing the dataset.

class age riagendr ridreth1 bmigrp anycalsup;

Use the class statement to specify all categorical variables in the model.

Use a class statement for categorical variables in version 9.0 and later. In earlier versions, you need a subgroup and levels statement.

reflevel riagendr= 1 age= 2 bmigrp= 2 anycalsup= 2 ;

Use the reflevel statement to choose your reference group for the categorical variables. By default, SUDAAN uses the highest category as the reference group.

model osteo=anycalsup riagendr age ridreth1 bmigrp;

Use the model statement to specify the dependent variable and all independent variable(s) in your logistic regression model.

Use the test statement to produce statistics and p-values for the Satterthwaite adjusted chi-square (satadjchi) and the Satterthwaite adjusted F (satadjf), and to print the Satterthwaite adjusted degrees of freedom (printed by default). If this statement is omitted, the nominal degrees of freedom will be printed, and the WALDF and the p-value corresponding to the WALDF and WALDP will be produced.

rformat anycalsup yesnos. ;

rformat osteo yesnof. ;

rformat age agegrp. ;

rformat riagendr gender. ;

rformat ridreth1 race. ;

rformat bmigrp bmifmt. ;

Use the rformat statements to read the SAS formats into SUDAAN.

Step 8: Review SUDAAN multiple logistic regression output

This step reviews the SUDAAN multiple logistic regression output.

• 238 respondents receive osteoporosis treatment and 4,385 do not.
• 2,155 use calcium supplements and 2,468 do not.
• Odds ratios should be interpreted as adjusted odds ratios because there are multiple covariates in the model. The adjusted odds of osteoporosis treatment are 1.37 (95% C.I. 0.89-2.12) for supplement users compared to non-users. Because the confidence interval includes 1, we conclude that calcium supplement users are not more likely to be receiving osteoporosis treatment compared to non-users, after adjusting for the other covariates.
• All other covariates are statistically significant at p-value<0.05, except for BMI. The Satterthwaite adjusted F gives the most conservative estimate of the test statistics.

If you ran both the SAS Survey and SUDAAN programs (or reviewed the output provided on the Sample Code and Datasets page), you may have noticed slight differences in the output. These differences can be caused by missing data in any paired PSU or how each software program handles degrees of freedom.