# Task 3c: How to Calculate Degrees of Freedom for Performing Statistical Tests and Confidence Limits Using the NHANES-CMS Linked data

Once data are sorted in SAS, SUDAAN can be used to specify the sampling design parameters.  In this example, the SUDAAN procedure, proc descript, is used and the name of the dataset is DS1.  Proc descript is being used as a generic example, but these statements apply to all SUDAAN procedures.

## Step 1: Sorting in SAS

To carry out the appropriate SUDAAN design option for NHANES data, the data from DS1 must first be sorted by strata and then by PSU (unless the data have already been sorted by PSU within strata). The proc sort procedure in SAS must precede any SUDAAN statements.

### Warning

Data must always be sorted in SAS before doing analyses in SUDAAN.

## Step 2: Use proc statement in SUDAAN

Generally, a proc statement in SUDAAN immediately follows the sort statement. In this example, the proc descript statement is used.  In addition, the data option specifies DS1 as the SAS dataset being used, the design option specifies with replacement (WR) as the design, and the noprint option suppresses printing of results as the results will output to a SAS data file.

Use the DEFT2 option statement to request the calculation of the design effect using SUDAAN Method 2 (see SUDAAN manual for details), which is the method recommended by NCHS for NHANES data.

## Step 3: Specify design parameters in SUDAAN

The nest statement lists the variables that identify the strata and the PSU. The nest statement is required to indicate the appropriate design effect used in NHANES. As in the sort statement, the nest statement lists the stratum variable (i.e., sdmvstra) first, followed by the PSU variable (i.e., sdmvpsu).

The weight statement accounts for the unequal probability of sampling and nonresponse. For more information on selecting the correct weight, please see Selecting the Correct Weight in the Weighting module. In this example we use an adjusted weight that was calculated based on linkage eligibility using methods described in Course 3, Module 7, Non-response and weighting issues with the NHANES-CMS Linked Data.

The subpopn statement sets the subgroup. It is recommended that you use the subpopn statement instead of subsetting the data in the data step in SAS. Please see Creating Appropriate Subsets of Data for NHANES Analyses in the Weighting module for more information.

The var option sets the variable of interest. The subgroup and levels statements set the categorical variables of interest and the number of levels corresponding to each categorical variable. The tables statement requests a stratified output of the categorical variables.

## Step 4: Specify output

In this step, you will specify how the results are saved to a file because the output in the proc descript procedure was suppressed using the noprint option. The filetype option determines the type of data file to be produced and the filename option sets the name of the file to which your results will be saved. If you use the replace option, then every time you run the program, your results will be overwritten with the newer results.

In SUDAAN, one must specify the ATLEVEL1 and ATLEVEL2 options in the proc statement in proc descript or proc crosstab to request that PSUs and strata are counted. The ATLEVEL1=1 and ATLEVEL2=2 options specify the sampling stages (in NHANES, the number of strata is level 1, and the number of PSUs is level 2) for which you want counts per table cell. The values 1 and 2 are the positions on the nest statement of the variables used to designate the stages of sampling. These options are associated with the keywords ATLEV1 or ATLEV2 respectively on the print or output statements. ATLEV1 is the number of strata with at least one valid observation and ATLEV2 is the number of PSUs with at least one valid observation. These numbers are used to calculate degrees of freedom.

The mean and semean options output the mean and estimated standard error of the mean to the data file. The nsum option outputs the number of observations in each level in each subdomain to the data file. The deff option outputs the design effect for each subdomain to the data file.

The rformat option specifies the formats of the levels of each categorical variable in the tables statement. Format statements for each variable must be listed individually. The rtitle option is used to set the title for output for procedure. These options are necessary only when printing the results.

## Step 5: Use SAS to calculate degrees of freedom and Wald 95% confidence intervals from SUDAAN output

After outputting the strata and PSU variables needed to calculate the degrees of freedom in SUDAAN, you can use SAS to calculate the Wald 95% confidence interval using the correct degrees of freedom for a subdomain.

## Summary: SUDAAN code to output estimates for calculating 95% confidence limits

The following table shows how to combine the statements described above to properly calculate 95% confidence limits. The procedure proc descript is being used as an example, but the design and nest statements can be used in the same manner for all SUDAAN procedures. Additionally, other procedure options can be added to these statements to customize the analysis and output. Consult the SUDAAN manual for specifications on the options for each SUDAAN procedure.

SUDAAN proc descript Procedure
Statements Explanation
proc sort data=DS1;
by sdmvstra sdmvpsu;

run;

Use the SAS procedure, proc sort, to sort the data by the design parameters, strata (sdmvstra) and primary sampling units (sdmvpsu), before running the procedure in SUDAAN.
proc descript data=DS1 design=WR

DEFT2 ATLEVEL1=1 ATLEVEL2=2 ;

Use proc descript to specify the dataset (DS1), specify the sample design using the design option WR (with replacement).
Use a DEFT2 statement to request the calculation of the design effect using method 2 (see SUDAAN manual for details on the differing methods for calculating design effect).  Method 2 is the method recommended by NCHS for NHANES data.

The ATLEVEL1=1 and ATLEVEL2=2 options specify the sampling stages (in NHANES, the number of strata is level 1, and the number of PSUs is level 2) for which you want counts per table cell. ATLEV1 is the number of strata with at least one valid observation and ATLEV2 is the number of PSUs with at least one valid observation. These numbers are used to calculate degrees of freedom.
nest sdmvstra sdmvpsu;
Use the nest statement to specify the strata (sdmvstra) and PSU (sdmvpsu) variables to account for the sample design effects.
Weight wt_linkage_adj; Use the weight statement to account for the unequal probability of sampling and nonresponse. In this example, the adjusted weight for “linkage non-response” is used for six years of data..
SUBPOPN ridageyr>=65 and cms_medicare_match=1 and obese=1; Use a subpopn statement to subset on the subgroup of interest. In this example, it selects people aged 65 or older (ridageyr>=65)  that linked to the Medicare files at some point during 1999-2007 (CMS_Medicare_match=1) and we obese (obese=1).
Var carrier05; Use a var statement to set the variable of interest as the percent with of those on the 2005 Carrier file. (In order to generate results that are expressed in percent for a group of interest, this variable was coded as 0=not on 2005 carrier file (on_carrier_2005=0) and 100=on 2005 carrier file (on_carrier_2005=1))
class racecat; Use a class statement to set the categorical variables of interest. In this example, race and Hispanic origin groups (racecat).
Tables racecat; Use a tables statement to request percent of males stratified by race and Hispanic origin group (racecat).
Output   atlev1=numstrat atlev2=numpsu  mean=mean semean=semean
NSUM=N /  filetype=SAS filename=test1 replace;
Use the output statement to request output of results to a SAS data file (filetype=SAS) called test1 (filename=test1).
Use a replace statement to replace this file each time this program is run and updated with the latest results.
Use an atlev1 option to create the SAS data variable, numstrat, with the value obtained from counting the number of strata in each subdomain requested with at least one valid observation.
Use an atlev2 option to create a SAS variable, numpsu, with the value obtained from counting the number of PSU's in each subdomain requested with at least one valid observation.
Use the mean option to output the mean to the SAS data set.  In this example, the mean is the percent of individuals at each level with high blood pressure.
Use the semean option to output the standard error of the mean estimated above to the SAS dataset.
Use the nsum option to create the variable N in the SAS dataset which gives the number of observations in each level in each subdomain requested in the table statement.
Rformat racecat racef.; Use an rformat option to specify formats for the levels of each categorical variable in the tables statement as needed. Format statements for each variable must be listed individually. In this example, you are setting the formats for the race and Hispanic origin group (racecat).
Rtitle "Percent with records on the 2005 carrier file, aged 65 and older that were obese by race and Hispanic origin” ; Use the rtitle option to set the title for output for procedure.
Calculate 95% confidence intervals with SAS
Statements Explanation

Proc sort data=test1;
by racecat;
run;

Use the SAS procedure, proc sort, to sort the data.

DATA test2;  SET test1;

Use the data statement to create a new dataset (test2) and the set statement to read in the data file created in SUDAAN.

percent=round(mean,.01);

sepercent=round(semean,.01);

Create the variables percent and sepercent and set them equal to a rounded value of the estimates using the round function.

df=atlev2-atlev1;

Calculate the degrees of freedoms by subtracting the PSU (atlev2) from the stratum(atlev1).

tlow=tinv(.025,df);

tup=tinv(.975,df);

Calculate the t-statistic using the tinv function, which computes the percentile for the t-distribution with the degrees of freedom (df).

rse=round((semean/mean)*100,.01); rsese=round((1/sqrt(df)),.01);

Calculate the relative standard error and the relative standard error of the standard error. These are useful for determining the reliability of estimates but are not used for confidence limit calculations.

ll=round((mean+tlow*semean),.01);

ul=round((mean+tup*semean),.01);

Calculate the upper and lower confidence limits.

proc print;

Use the proc print procedure to output the results.

VAR Racecat Percent Sepercent df tup ll ul;

Use the var statement to indicate variables of interest race and Hispanic origin (racecat); percent of people in the category (percent); standard error of the percent (sepercent);  degrees of freedom (df); t-statistic upper limit (tup); lower confidence limit (ll); and upper confidence limit (ul).

title1 'Degrees of Freedom and Wald 95% Confidence Interval';
title2 'Percent Linked to 2005 Carrier file by Race and Hispanic Origin’;
title3 'for those who were age 65+ and obese: NHANES 1999-2004';

run;

This title statement identifies the contents of the output.

Output:

Degrees of Freedom and Wald 95% Confidence Interval Percent Linked to 2005 Carrier file by Race and Hispanic Origin for those who were age 65+ and obese: NHANES 1999-2004
RACE NSUM PERCENT SEPERCENT DF TUP LL UL
Overall 843 73.06 2.32 44 2.01 68.37 77.74
non-Hispanic black 148 75.64 4.22 14 2.14 66.6 84.69
Mexican American 176 76.5 5.22 9 2.26 64.7 88.31
non-Hispanic white and others 519 72.62 2.54 38 2.02 67.49 77.75

Given the degrees of freedom for Mexican Americans some caution may be needed when analyzing these data by race and Hispanic origin.