Once data are sorted in SAS, SUDAAN can be used to specify the sampling design parameters. In this example, the SUDAAN procedure, *proc descript*, is used and the name of the dataset is *DS1*. *Proc descript* is being used as a generic example, but these statements apply to all SUDAAN procedures.

To carry out the appropriate SUDAAN design option for NHANES data, the data from *DS1* must first be sorted by strata and then by PSU (unless the data have already been sorted by PSU within strata). The *proc sort* procedure in SAS must precede any SUDAAN statements.

Data must always be sorted in SAS before doing analyses in SUDAAN.

Generally, a *proc* statement in SUDAAN immediately follows the *sort* statement. In this example, the *proc descript* statement is used. In addition, the *data* option specifies *DS1* as the SAS dataset being used, the *design* option specifies **with replacement (WR)** as the design, and the *noprint *option suppresses printing of results as the results will output to a SAS data file.

Use the *DEFT2* option statement to request the calculation of the design effect using SUDAAN Method 2 (see SUDAAN manual for details), which is the method recommended by NCHS for NHANES data.

The *nest* statement lists the variables that identify the strata and the PSU. The *nest* statement is required to indicate the appropriate design effect used in NHANES. As in the *sort* statement, the *nest *statement lists the stratum variable (i.e., *sdmvstra*) first, followed by the PSU variable (i.e., *sdmvpsu)*.

The *weight* statement accounts for the unequal probability of sampling and nonresponse. For more information on selecting the correct weight, please see Selecting the Correct Weight in the Weighting module. In this example we use an adjusted weight that was calculated based on linkage eligibility using methods described in *Course 3, Module 7, Non-response and weighting issues with the NHANES-CMS Linked Data.*

The *subpopn* statement sets the subgroup. It is recommended that you **use the subpopn statement instead of subsetting the data in the data step** in SAS. Please see Creating Appropriate Subsets of Data for NHANES Analyses in the Weighting module for more information.

The *var* option sets the variable of interest. The *subgroup* and *levels* statements set the categorical variables of interest and the number of levels corresponding to each categorical variable. The *tables* statement requests a stratified output of the categorical variables.

In this step, you will specify how the results are saved to a file because the output in the *proc descript* procedure was suppressed using the *noprint *option. The *filetype* option determines the type of data file to be produced and the *filename* option sets the name of the file to which your results will be saved. If you use the* replace *option, then every time you run the program, your results will be overwritten with the newer results.

In SUDAAN, one must specify the *ATLEVEL1 *and *ATLEVEL2 *options in the *proc* statement in *proc descript *or *proc crosstab* to request that PSUs and strata are counted. The *ATLEVEL1=1* and *ATLEVEL2=2* options specify the sampling stages (in NHANES, the number of strata is level 1, and the number of PSUs is level 2) for which you want counts per table cell. The values 1 and 2 are the positions on the *nest* statement of the variables used to designate the stages of sampling. These options are associated with the keywords *ATLEV1* or *ATLEV2* respectively on the print or output statements. ATLEV1 is the number of strata with at least one valid observation and ATLEV2 is the number of PSUs with at least one valid observation. These numbers are used to calculate degrees of freedom.

The *mean* and *semean* options output the mean and estimated standard error of the mean to the data file. The *nsum* option outputs the number of observations in each level in each subdomain to the data file. The *deff* option outputs the design effect for each subdomain to the data file.

The *rformat* option specifies the formats of the levels of each categorical variable in the *tables* statement. Format statements for each variable must be listed individually. The *rtitle* option is used to set the title for output for procedure. These options are necessary only when printing the results.

After outputting the strata and PSU variables needed to calculate the degrees of freedom in SUDAAN, you can use SAS to calculate the Wald 95% confidence interval using the correct degrees of freedom for a subdomain.

The following table shows how to combine the statements described above to properly calculate 95% confidence limits. The procedure proc descript is being used as an example, but the design and nest statements can be used in the same manner for all SUDAAN procedures. Additionally, other procedure options can be added to these statements to customize the analysis and output. Consult the SUDAAN manual for specifications on the options for each SUDAAN procedure.

Statements | Explanation |
---|---|

proc sort data=DS1; by sdmvstra sdmvpsu; run; |
Use the SAS procedure, proc sort, to sort the data by the design parameters, strata (sdmvstra) and primary sampling units (sdmvpsu), before running the procedure in SUDAAN. |

proc descript data=DS1 design=WR
DEFT2 ATLEVEL1=1 ATLEVEL2=2 ; |
Use proc descript to specify the dataset (DS1), specify the sample design using the design option WR (with replacement).
Use a DEFT2 statement to request the calculation of the design effect using method 2 (see SUDAAN manual for details on the differing methods for calculating design effect). Method 2 is the method recommended by NCHS for NHANES data.
The ATLEVEL1=1 and ATLEVEL2=2 options specify the sampling stages (in NHANES, the number of strata is level 1, and the number of PSUs is level 2) for which you want counts per table cell. ATLEV1 is the number of strata with at least one valid observation and ATLEV2 is the number of PSUs with at least one valid observation. These numbers are used to calculate degrees of freedom. |

nest sdmvstra sdmvpsu; |
Use the nest statement to specify the strata (sdmvstra) and PSU (sdmvpsu) variables to account for the sample design effects. |

Weight wt_linkage_adj; | Use the weight statement to account for the unequal probability of sampling and nonresponse. In this example, the adjusted weight for “linkage non-response” is used for six years of data.. |

SUBPOPN ridageyr>=65 and cms_medicare_match=1 and obese=1; | Use a subpopn statement to subset on the subgroup of interest. In this example, it selects people aged 65 or older (ridageyr>=65) that linked to the Medicare files at some point during 1999-2007 (CMS_Medicare_match=1) and we obese (obese=1). |

Var carrier05; | Use a var statement to set the variable of interest as the percent with of those on the 2005 Carrier file. (In order to generate results that are expressed in percent for a group of interest, this variable was coded as 0=not on 2005 carrier file (on_carrier_2005=0) and 100=on 2005 carrier file (on_carrier_2005=1)) |

class racecat; | Use a class statement to set the categorical variables of interest. In this example, race and Hispanic origin groups (racecat). |

Tables racecat; | Use a tables statement to request percent of males stratified by race and Hispanic origin group (racecat). |

Output atlev1=numstrat atlev2=numpsu mean=mean semean=semean NSUM=N / filetype=SAS filename=test1 replace; |
Use the output statement to request output of results to a SAS data file (filetype=SAS) called test1 (filename=test1). Use a replace statement to replace this file each time this program is run and updated with the latest results. Use an atlev1 option to create the SAS data variable, numstrat, with the value obtained from counting the number of strata in each subdomain requested with at least one valid observation. Use an atlev2 option to create a SAS variable, numpsu, with the value obtained from counting the number of PSU's in each subdomain requested with at least one valid observation. Use the mean option to output the mean to the SAS data set. In this example, the mean is the percent of individuals at each level with high blood pressure. Use the semean option to output the standard error of the mean estimated above to the SAS dataset. Use the nsum option to create the variable N in the SAS dataset which gives the number of observations in each level in each subdomain requested in the table statement. |

Rformat racecat racef.; | Use an rformat option to specify formats for the levels of each categorical variable in the tables statement as needed. Format statements for each variable must be listed individually. In this example, you are setting the formats for the race and Hispanic origin group (racecat). |

Rtitle "Percent with records on the 2005 carrier file, aged 65 and older that were obese by race and Hispanic origin” ; | Use the rtitle option to set the title for output for procedure. |

Statements | Explanation |
---|---|

Proc sort data=test1; |
Use the SAS procedure, proc sort, to sort the data. |

DATA test2; SET test1; |
Use the data statement to create a new dataset (test2) and the set statement to read in the data file created in SUDAAN. |

percent=round(mean,.01); sepercent=round(semean,.01); |
Create the variables percent and sepercent and set them equal to a rounded value of the estimates using the round function. |

df=atlev2-atlev1; |
Calculate the degrees of freedoms by subtracting the PSU (atlev2) from the stratum(atlev1). |

tlow=tinv(.025,df); tup=tinv(.975,df); |
Calculate the t-statistic using the tinv function, which computes the percentile for the t-distribution with the degrees of freedom (df). |

rse=round((semean/mean)*100,.01); rsese=round((1/sqrt(df)),.01); |
Calculate the relative standard error and the relative standard error of the standard error. These are useful for determining the reliability of estimates but are not used for confidence limit calculations. |

ll=round((mean+tlow*semean),.01); ul=round((mean+tup*semean),.01); |
Calculate the upper and lower confidence limits. |

proc print; |
Use the proc print procedure to output the results. |

VAR Racecat Percent Sepercent df tup ll ul; |
Use the var statement to indicate variables of interest race and Hispanic origin (racecat); percent of people in the category (percent); standard error of the percent (sepercent); degrees of freedom (df); t-statistic upper limit (tup); lower confidence limit (ll); and upper confidence limit (ul). |

title1 'Degrees of Freedom and Wald 95% Confidence Interval'; run; |
This title statement identifies the contents of the output. |

Output:

RACE | NSUM | PERCENT | SEPERCENT | DF | TUP | LL | UL |
---|---|---|---|---|---|---|---|

Overall | 843 | 73.06 | 2.32 | 44 | 2.01 | 68.37 | 77.74 |

non-Hispanic black | 148 | 75.64 | 4.22 | 14 | 2.14 | 66.6 | 84.69 |

Mexican American | 176 | 76.5 | 5.22 | 9 | 2.26 | 64.7 | 88.31 |

non-Hispanic white and others | 519 | 72.62 | 2.54 | 38 | 2.02 | 67.49 | 77.75 |

Given the degrees of freedom for Mexican Americans some caution may be needed when analyzing these data by race and Hispanic origin.

Close Window to return to module page.