Estimating Variance, Analyzing Subgroups and Calculating Degrees of Freedom for Performing Statistical Tests and Calculating Confidence Intervals
Purpose
This module introduces the basic concepts of variance (sampling error) estimation for NHANES data. You will learn how the complex survey design of NHANES and clustering of the data affect variance estimation, which methods are appropriate to use when calculating variance for NHANES data, and how to calculate degrees of freedom and construct confidence limits for NHANES estimates. In addition, there will be specific examples using the NHANES-CMS linked data.
Task 1: Explain Variance Estimation within NHANES
This first task describes the importance of accounting for the complex sampling structure of NHANES when estimating variances. In general, using statistical weights that reflect the probability of selection and propensity of response for sampled individuals will affect parameter estimates. Incorporating the attributes of the complex sample design (i.e., differential weighting, clustering and stratification) will affect variance estimates (estimated standard errors and thereby test statistics and confidence intervals).
NHANES survey design affects variance estimates
As stated in the module on sampling in NHANES, the NHANES has a complex, multistage, probability cluster design. Typically, individuals within a cluster (i.e., county, school, city, census block) are more similar to one another than those in other clusters and this homogeneity of individuals within a given cluster is measured by the intra cluster correlation. When working with a complex sample, you ideally want to decrease the amount of correlation between sample persons within clusters. To achieve this, you want to sample fewer people within each cluster but sample more clusters. However, because of operational limitations (e.g., cost of moving the survey MECs, geographic distances between primary sampling units [PSUs], etc.) NHANES can only sample 30 PSUs within a 2-year survey cycle. The sample size in each PSU is roughly equal and it is intended to yield about 5,000 examined persons per year.
In a complex sample survey setting such as NHANES, variance estimates computed using standard statistical software packages that assume simple random sampling are generally too low (i.e., significance levels are overstated) and biased because they do not account for the differential weighting and the correlation among sample persons within a cluster.
Warning: Standard statistical software packages that assume simple random sampling calculate variance estimates that are generally too low and biased because they do not account for differential weighting and the correlation among sample persons within a cluster.
The impact of the complex sample design upon variance estimates is measured by the design effect (DEFF). It is defined as the ratio of the variance of a statistic which accounts for the complex sample design to the variance of the same statistic based on a hypothetical simple random sample of the same size.

Design Effect = Variance estimate (cluster) / Variance estimate (simple random sample)
If the DEFF is 1, the variance for the estimate under the cluster sampling is the same as the variance under simple random sampling. The DEFFs for NHANES are typically greater than 1.
When the DEFF is greater than 1, the effective sample size is less than the number of sample persons but greater than the number of clusters. The effective sample size is calculated by dividing the sample size in a subgroup by the DEFF. Another way to think about clustering is that there is a loss of precision and a reduction in the effective sample size because individuals are chosen within clusters instead of being sampled randomly throughout the population.
Moving from a 6-year (i.e.,NHANES III) to a 2-year data release in the continuous NHANES, the sample size for the survey is smaller for both the number of persons sampled and the number of geographic areas (PSUs) sampled. Due to smaller sample sizes in each 2-year cycle of the continuous NHANES, data are subject to larger sampling variation. For example, standard errors for a variable in NHANES 1999-2000 will be approximately 70% greater than for the corresponding variable in NHANES III (or when combining three cycles of the continuous NHANES).
Design effects for a variable can be different for race/ethnicity or age groups. Within the continuous NHANES survey, DEFF can be very different for different variables due to differences in variation by geography, by household intra class correlation, and by demographic heterogeneity. Because DEFFs are highly variable for different variables within each 2-year cycle of the continuous NHANES, it is difficult to set a single minimum sample size for analysis. The general statistical consideration is that an estimated proportion should have a relative standard error of 30% or less. The NHANES III Analytic Guidelines contain sample sizes required for reliable estimates and for testing differences between subdomains. The required sample size depends on the DEFF for the variable of interest. The sample size tables in Appendix B of the NHANES III Analytic Guidelinespdf icon provide guidance, but it is best to compute an estimate for the sampling error of a statistic and use a reliability cut-point such as 30% relative standard error.
Important note: Software such as SUDAAN or SAS Survey procedures that account for the sampling design effect must be used to calculate an asymptotically unbiased estimate of the variance and should be used for all statistical tests and the construction of confidence limits. These procedures require information on the first stage of the sample design (identification of the PSU and stratum) for each sample person.
References
Park, I and Lee, H. Design Effects for the weighted mean and total estimators under complex survey sampling. Survey methodology 30:183-193. 2004.
Task 2: Describe Methods for Variance Estimation Used in NHANES
The second task briefly describes the method used to calculate variance estimates with NHANES data and the sample design parameters required.
Brief description of variance estimation procedures used with NHANES data
Variance of estimates (sampling errors) should be calculated for all survey estimates to aid in determining statistical reliability. For complex sample surveys, exact mathematical formulas for variance estimates are usually not available. Variance approximation procedures are required to provide reasonable estimates of sampling error. Two variance approximation procedures which account for the complex sample design and compute design effects are replication methods and Taylor Series Linearization. Initially, the delete 1 jackknife method, a replication method, was used to estimate variances based on data from the NHANES 1999-2000 survey. Balance repeated replication was used for NHANES III.
Currently NCHS recommends the use of the Taylor Series Linearization methods for variance estimation in all NHANES surveys. SUDAAN, Stata and the SAS Survey procedures can be used to obtain variance estimated by this method. Survey design variables identifying strata and PSU are required in order to utilize these software packages. If replication methods are used, you must compute your own replicate weights.
Taylor Linearization Procedures
For either linearization or replication, strata and PSU variables must be available on the survey data file. Because of confidentiality issues associated with a two-year data release, true PSUs cannot be released. In order to use the Taylor Series Linearization approach for variance estimation, Masked Variance Units (MVUs) were created and provided on the demographic data files. MVUs are equivalent to Pseudo-PSUs used to estimate sampling errors in past NHANES. These MVUs on the data file are not the “true” design PSUs, but they produce variance estimates that closely approximate the variances that would have been estimated using the ” true” design. See Sampling module: Key Concepts about NHANES Survey Design.
These MVUs have been created and provided for the continuous NHANES (e.g., NHANES 1999-2000, NHANES 2001-2002, etc.) and will be added to the demographic data files for all two-year survey cycles. They can also be used when combining four or more years of data. The stratum variable is sdmvstra and the PSU variable is sdmvpsu. See Sampling module: Key Concepts about NHANES Survey Design.
Software such as SUDAAN, Stata, SPSS, and SAS Survey procedures can all be used to estimate sampling errors by the Taylor series (linearization) method.
Once data are sorted in SAS, SUDAAN can be used to specify the sampling design parameters. In this example, the SUDAAN procedure, proc descript, is used and the name of the dataset is BP_analysis_Data. Proc descript is being used as a generic example, but the following statements apply to all SUDAAN procedures.
Step 1: Sorting in SAS
To carry out the appropriate SUDAAN design option for NHANES data, the data from BP_analysis_Data must first be sorted by strata and then by PSU (i.e., sdmvstra first, followed by sdmvpsu, unless the data have already been sorted by PSU within strata). The proc sort procedure must be conducted in SAS before SUDAAN is used to run the analyses.
Warning: Data must always be sorted in SAS before doing analyses in SUDAAN.
Step 2: Use proc statement in SUDAAN
The proc statement immediately follows the proc sort statement. In this example, the proc descript statement is used. In addition, the dataoption specifies BP_analysis_Data as the SAS dataset being used and the design option specifies with replacement (WR) as the design.
Step 3: Use nest statement in SUDAAN
The nest statement lists the variables that identify the strata and the PSU. The nest statement is required for the appropriate design option for NHANES to be used. See the Sample Design module for further explanation of design options in SUDAAN.
As in the sort statement, the nest statement lists the stratum variable (i.e., sdmvstra) first, followed by the PSU variable (i.e., sdmvpsu).
Summary: Sample SUDAAN code for calculating variance using Taylor Linearization procedures
The following table shows how to combine the statements described above to properly calculate variance estimates. In this example, the proc descript procedure is used to calculate variance. However, the design and nest statements shown below can be used in a similar manner for all SUDAAN procedures. Additionally, other procedure options can be added to these statements to customize the analysis and output. Consult the SUDAAN manual for specifications on the options for each SUDAAN procedure.
Statements | Explanation |
---|---|
proc sortdata=BP_analysis_Data; by sdmvstra sdmvpsu; run; |
Use the SAS procedure, proc sort, to sort the data by the design parameters, strata (sdmvstra) and primary sampling units (sdmvpsu), before running the procedure in SUDAAN. |
proc descript data=BP_analysis_Data design=WR; | Use the proc statement to specify the SUDAAN procedure being used (proc descript here), the data set (BP_analysis_Data), and the sample design (with replacement — WR). |
nest sdmvstra sdmvpsu; | Use the nest statement to specify the strata (sdmvstra) and PSU (sdmvpsu) variables to account for the sample design. |
Reference: RTI (2004). SUDAAN User’s Manual, Release 9.0 Research Triangle Park , NC: Research Triangle Institute
The code needed to calculate variance estimates using SAS Survey procedures is described below. In this example, the SAS Survey procedure, proc surveymeans, is used and the name of the dataset is BP_analysis_Data. Proc surveymeans is being used as a generic example, but the strata, cluster, and weight statements apply to all SAS Survey procedures.
Step 1: Use data statement
When using SAS Survey procedures, the input dataset must be identified. However, the dataset does not have to be presorted by the sample design variables as it does in SUDAAN. Rather, the design variables—strata and PSU—are specified in subsequent steps.
Step 2: Use strata statement
The strata statement names the variables that form the strata. For the Continuous NHANES, the variable that identifies the sample strata is named sdmvstra.
Step 3: Use cluster statement
The cluster statement names the variables that identify the clusters in a clustered sample design such as NHANES. Clusters are nested within the strata.
In NHANES, the variable that represents the sample clusters is named sdmvpsu (masked PSUs).
Summary: Sample SAS Survey Procedure code specifying sampling design parameters
The following table shows how to combine the statements described above to properly specify the sample design parameters using SAS Survey procedures. In this example, the proc surveymeans procedure is used to calculate variance. However, the strata and cluster statements can be used in a similar manner for all SAS Survey procedures. The steps in this task identify the most basic statements used in SAS Survey procedures to account for the complex sample design of NHANES. Additional procedure options can be added to these statements to customize the variance estimates, statistics, and the output from your procedure to suit individual analytic needs. Please consult the SAS/STAT manual for specifications on the options for each SAS Survey procedure.
Statements | Explanation |
---|---|
proc surveymeans data= BP_analysis_Data; | Use the SAS Survey procedure, proc surveymeans, to calculate means and standard errors, and specify the dataset (BP_analysis_Data). |
stratum sdmvstra; | Use the stratum statement to specify the strata (sdmvstra) — this accounts for the design effects of stratification. |
cluster sdmvpsu; | Use the cluster statement to specify primary sampling unit (sdmvpsu) — this accounts for the design effects of clustering. |
Reference: SAS Institute Inc., SAS/STAT User’s Guide, Version 9.1; see: Survey Means Procedure
Once data are sorted in SAS, SUDAAN can be used to specify the sampling design parameters. In this example, the SUDAAN procedure, proc descript, is used and the name of the dataset is DS1. Proc descript is being used as a generic example, but the following statements apply to all SUDAAN procedures.
Step 1: Sorting in SAS
To carry out the appropriate SUDAAN design option for NHANES data, the data from DS1 must first be sorted by strata and then by PSU (i.e., sdmvstra first, followed by sdmvpsu, unless the data have already been sorted by PSU within strata). The proc sort procedure must be conducted in SAS before SUDAAN is used to run the analyses.
Warning: Data must always be sorted in SAS before doing analyses in SUDAAN. SDMVSTRA and SDMVPSU are found on the NHANES Demographic File.
Step 2: Use proc statement in SUDAAN
The proc statement immediately follows the proc sort statement. In this example, the proc descript statement is used. In addition, the data option specifies DS1 as the SAS dataset being used and the design option specifies with replacement (WR) as the design.
Step 3: Use nest statement in SUDAAN
The nest statement lists the variables that identify the strata and the PSU. The nest statement is required for the appropriate design option for NHANES to be used. See the Sample Design module for further explanation of design options in SUDAAN.
As in the sort statement, the nest statement lists the stratum variable (i.e., sdmvstra) first, followed by the PSU variable (i.e., sdmvpsu).
Summary: Sample SUDAAN code for calculating variance using Taylor Linearization procedures
The following table shows how to combine the statements described above to properly calculate variance estimates. In this example, the proc descript procedure is used to calculate variance. However, the design and nest statements shown below can be used in a similar manner for all SUDAAN procedures. Additionally, other procedure options can be added to these statements to customize the analysis and output. Consult the SUDAAN manual for specifications on the options for each SUDAAN procedure.
Statements | Explanation |
---|---|
proc sort data=DS1; by sdmvstra sdmvpsu;run; |
Use the SAS procedure, proc sort, to sort the data by the design parameters, strata (sdmvstra) and primary sampling units (sdmvpsu), before running the procedure in SUDAAN |
Proc descript data=DS1 design=WR; | Use the proc statement to specify the SUDAAN procedure being used (proc descript here), the data set (DS1), and the sample design (with replacement — WR). |
nest sdmvstra sdmvpsu; | Use the nest statement to specify the strata (sdmvstra) and PSU (sdmvpsu) variables to account for the sample design. |
Reference: RTI (2004). SUDAAN User’s Manual, Release 9.0 Research Triangle Park , NC: Research Triangle Institute
Task 3: Calculate Degrees of Freedom for Performing Statistical Tests and Calculating Confidence Limits
This third task will discuss calculation of degrees of freedom using SUDAAN and SAS Survey procedures. The accurate determination of degrees of freedom is important for performing statistical tests and calculating confidence limits.
Degrees of Freedom and NHANES Subgroups
Estimates are often calculated for various subgroups of interest within the total NHANES population. When the number of first stage sampling units (PSUs) is small, the z-statistic should be replaced by a value from a t-distribution when computing confidence limits for these estimates (see SUDAAN 1995 — ref from NHANES III analytic guidelines).
To calculate the correct value for the t-statistic from a t-distribution and a selected level of significance, you must calculate the proper degrees of freedom for the estimate .
In addition, it is important to examine the number of degrees of freedom from which a standard error estimate is based. Continuing research on issues related to stability of variance estimates in subdomains of NHANES have been published and show that standard error estimates based on small numbers of paired PSUs (i.e., degrees of freedom) are prone to instability.
The reliability of the estimated standard error, as measured by its relative standard error (i.e., (standard error of the standard error of the estimate/standard error of the estimate)*100), is inversely proportional to its degrees of freedom. As the number of degrees of freedom increases, the relative standard error decreases and the reliability of the estimate increases. The NHANES guidelines recommended a relative standard error of at most 30%. This corresponds to at least 12 degrees of freedom.
Degrees of freedom are properly calculated by subtracting the number of clusters in the first level of sampling (strata) from the number of clusters in the second level of sampling (PSUs) for each subgroup you are analyzing as shown the in equation below.
Equation for Degrees of Freedom

Differences in Degrees of Freedom for Subgroups in SUDAAN and SAS Survey Procedures
For both SUDAAN and SAS Survey procedures, the degrees of freedom are calculated in the same way when looking at the entire sample population or in subgroups where all strata and PSUs are represented.
However, when you analyze data on a subgroup of sample persons who may not be represented in all strata and PSUs (e.g., Mexican Americans), the degrees of freedom provided in the output may differ. For example, SUDAAN will correctly count the number of PSU’s and strata with at least one valid observation for each cell of the table being requested. In contrast, SAS 9.1 Survey procedures, such as proc surveymeans, compute the degrees of freedom as the number of clusters (PSUs) in the non-empty strata minus the number of non-empty strata. This means that if your data have empty strata (no persons in the population for either PSU) the number of degrees of freedom will increase. This is incorrect and SAS is currently working on correcting this problem. For more information on methods of correctly calculating degrees of freedom using SAS 9.1 Survey procedures, please see the following two SAS 9.1 Survey procedures macros.
%SMSUB macro provides additional capabilities for SAS 9.1 proc surveymeans
http://support.sas.com/ctx/samples/index.jsp?sid=541external icon
Purpose: Provides additional subgroup capabilities beyond those provided by the domain statement in proc surveymeans. This includes:
- presenting subgroup and overall estimates in one table (TABLES=),
- computing ratio estimates for subgroups (RATIO=),
- computing contrasts for means, totals, and ratios (CONTRAST=),
- restricting table requests to a subpopulation (SUBPOP=), and
- incorporating missing values into the variance computations.
%SREGSUB macro provides additional capabilities for SAS 9.1 proc surveyreg
http://support.sas.com/ctx/samples/index.jsp?sid=483external icon
Purpose: Provides linear regression capabilities currently not available in proc surveyreg. This includes:
- restricting the regression analysis to a subpopulation (SUBPOP= ), and
- incorporating missing values into the variance computations
NOMCAR option provides additional capabilities in SAS 9.2
NOMCAR requests that the procedure treat missing values in the variance computation as not missing completely at random (NOMCAR) for Taylor series variance estimation. When you specify the NOMCAR option, PROC SURVEYREG computes variance estimates by analyzing the nonmissing values as a domain or subpopulation, where the entire population includesboth nonmissing and missing domains. See the section Missing Values for more details.
By default, PROC SURVEYREG completely excludes an observation from analysis if that observation has a missing value, unless you specify the MISSING option. Note that the NOMCAR option has no effect on a classification variable when you specify the MISSING option, which treats missing values as a valid nonmissing level.
The NOMCAR option applies only to Taylor series variance estimation. The replication methods, which you request with the VARMETHOD=BRR and VARMETHOD=JACKKNIFE options, do not use the NOMCAR option.
VADJUST option on the model statement provides aditional capabilities in SAS 9.2
VADJUST=DF | NONE specifies whether to use degrees of freedom adjustment in the computation of the matrix for the variance estimation. If you do not specify the VADJUST= option, by default, PROC SURVEYREG uses the degrees-of-freedom adjustment that is equivalent to the VARADJ=DF option. If you do not want to use this variance adjustment, you can specify the VADJUST=NONE option.
SOURCE: SAS 9.2 Documentation SAS/STAT(R) 9.2 User’s Guide
Both SAS Survey procedures (proc surveymeans) and SUDAAN version 9.1 (proc descript) produce 95% confidence intervals (CI). These 95% CIs are calculated using the Wald method, which is based on a t-statistic for the number of degrees of freedom in the entire NHANES sample. However, they do not correct for the reduction in the degrees of freedom in subdomains where not all strata and PSUs are represented. Details on how to correctly produce 95% confidence intervals (CI) will be discussed in the next task, How to Perform Statistical Tests and Calculate Confidence Limits with Degrees of Freedom. Also, the Wald method should not be used when the proportion is close to 0% or 100% (see Alternate methods for calculating 95% confidence limits section below for more information).
Warning: The 95% confidence intervals calculated using the Wald method are based on a t-statistic for the number of degrees of freedom in the entire NHANES sample. The proc surveymeans procedure in SAS 9.1 Survey Procedures and the proc descript procedure in SUDAAN ver 9.1, DO NOT correct for the reduction in the degrees of freedom in subdomains where not all strata and PSUs are represented.
Alternate methods for calculating 95% confidence limits.
For prevalence estimates near 0% or near 100%, standard methods of calculating confidence limits, such as the Wald method, may produce lower limits less than 0% or upper limits greater than 100%. In these cases, it is often recommended to use alternative methods for calculating 95% confidence limits using transformations (such as the logit or arcsine transformation), using the Wilson method, or calculating exact confidence limits such as the Clopper-Pearson approach. For applications to survey data, see Korn and Graubard.
References
Wilson, EB (1927). Probable Inference, the Law of Succession and Statistical Inference. JASA, 22,209-212.
Clopper CJ and Pearson ES.(1934). The Use of Fiducial Limits Illustrated in the Case of the Binomial. Biometrika. 26 404-413.
Korn EL and Graubard BI. Analysis of Health Surveys. Wiley Series in Probability and Statistics. 1999. New York, New York.
For the arcsin and logit, recommend Appendix C of Wolter, K. Introduction to Variance Estimation. Springer-Verlag. New York.
In How to Perform Statistical Tests and Calculate Confidence Limits with Degrees of Freedom, you will learn how to save the results from your SUDAAN procedure as a SAS data file or (or you may specify an ASCII data file). You can use the mean (prevalence), standard error, and degree of freedom estimates, which were correctly calculated from the number of stratum and number of PSUs, #psu, in a SAS spreadsheet or other software program to calculate confidence limits using one of the approaches listed above or other alternative methods.
SAS code to calculate 95% confidence limits using the arcsine transformation, log transformation, or the Clopper-Pearson method of calculating exact confidence limits is available at the Sample Code and Datasets page.
To learn more
To understand more about variance estimation methods you may wish to review the Analytic Guidelines for NHANES analysispdf icon on the NHANES web site; read the text by Korn and Graubard (Korn EL and Graubard BI. Analysis of Health Surveys. Wiley Series in Probability and Statistics. 1999. New York, New York.); or take a course in SUDAAN or complex survey sampling.
Once data are sorted in SAS, SUDAAN can be used to specify the sampling design parameters. In this example, the SUDAAN procedure, proc descript, is used and the name of the dataset is BP_analysis_Data. Proc descript is being used as a generic example, but these statements apply to all SUDAAN procedures.
Step 1: Sorting in SAS
To carry out the appropriate SUDAAN design option for NHANES data, the data from BP_analysis_Data must first be sorted by strata and then by PSU (unless the data have already been sorted by PSU within strata). The proc sort procedure in SAS must precede any SUDAAN statements.
Warning: Data must always be sorted in SAS before doing analyses in SUDAAN.
Step 2: Use proc statement in SUDAAN
Generally, a proc statement in SUDAAN immediately follows the sort statement. In this example, the proc descript statement is used. In addition, the data option specifies BP_analysis_Data as the SAS dataset being used, the design option specifies with replacement (WR) as the design, and the noprint option suppresses printing of results as the results will output to a SAS data file.
Use the DEFT2 option statement to request the calculation of the design effect using SUDAAN Method 2 (see SUDAAN manual for details), which is the method recommended by NCHS for NHANES data.
Step 3: Specify design parameters in SUDAAN
The nest statement lists the variables that identify the strata and the PSU. The nest statement is required to indicate the appropriate design effect used in NHANES. As in the sort statement, the nest statement lists the stratum variable (i.e., sdmstra) first, followed by the PSU variable (i.e., sdmvpsu).
The weight statement accounts for the unequal probability of sampling and nonresponse. For more information on selecting the correct weight, please see Selecting the Correct Weight in the Weighting module.
The subpopn statement sets the subgroup. It is recommended that you use the subpopn statement instead of subsetting the data in the data step in SAS. Please see Creating Appropriate Subsets of Data for NHANES Analyses in the Weighting module for more information.
The var option sets the variable of interest. The subgroup and levels statements set the categorical variables of interest and the number of levels corresponding to each categorical variable. The tables statement requests a stratified output of the categorical variables.
Step 4: Specify output
In this step, you will specify how the results are saved to a file because the output in the proc descript procedure was suppressed using the noprint option. The filetype option determines the type of data file to be produced and the filename option sets the name of the file to which your results will be saved. If you use the replace option, then every time you run the program, your results will be overwritten with the newer results.
In SUDAAN, one must specify the ATLEVEL1 and ATLEVEL2 options in the proc statement in proc descript or proc crosstab to request that PSUs and strata are counted. The ATLEVEL1=1 and ATLEVEL2=2 options specify the sampling stages (in NHANES, the number of strata is level 1, and the number of PSUs is level 2) for which you want counts per table cell. The values 1 and 2 are the positions on the neststatement of the variables used to designate the stages of sampling. These options are associated with the keywords ATLEV1 or ATLEV2respectively on the print or output statements. ATLEV1 is the number of strata with at least one valid observation and ATLEV2 is the number of PSUs with at least one valid observation. These numbers are used to calculate degrees of freedom.
The mean and semean options output the mean and estimated standard error of the mean to the data file. The nsum option outputs the number of observations in each level in each subdomain to the data file. The deff option outputs the design effect for each subdomain to the data file.
The rformat option specifies the formats of the levels of each categorical variable in the tables statement. Format statements for each variable must be listed individually. The rtitle option is used to set the title for output for procedure. These options are necessary only when printing the results.
Step 5: Use SAS to calculate degrees of freedom and Wald 95% confidence intervals from SUDAAN output
After outputting the strata and PSU variables needed to calculate the degrees of freedom in SUDAAN, you can use SAS to calculate the Wald 95% confidence interval using the correct degrees of freedom for a subdomain.
Summary: SUDAAN code to output estimates for calculating 95% confidence limits
The following table shows how to combine the statements described above to properly calculate 95% confidence limits. The procedure proc descript is being used as an example, but the design and nest statements can be used in the same manner for all SUDAAN procedures. Additionally, other procedure options can be added to these statements to customize the analysis and output. Consult the SUDAAN manual for specifications on the options for each SUDAAN procedure.
Statements | Explanation |
---|---|
proc sort data=BP_analysis_Data; by sdmvstra sdmvpsu; run; |
Use the SAS procedure, proc sort, to sort the data by the design parameters, strata (sdmvstra) and primary sampling units (sdmvpsu), before running the procedure in SUDAAN. |
proc descript data=BP_analysis_Data design=WR noprint DEFT2 ATLEVEL1=1 ATLEVEL2=2; |
Use proc descript to specify the dataset (BP_analysis_Data), specify the sample design using the design option WR (with replacement).
Use the noprint option to suppress printing of results. In this example, you will be sending the results to a SAS data file for further calculations and printing in SAS. Use a DEFT2 statement to request the calculation of the design effect using method 2 (see SUDAAN manual for details on the differing methods for calculating design effect). Method 2 is the method recommended by NCHS for NHANES data. The ATLEVEL1=1 and ATLEVEL2=2 options specify the sampling stages (in NHANES, the number of strata is level 1, and the number of PSUs is level 2) for which you want counts per table cell. ATLEV1 is the number of strata with at least one valid observation and ATLEV2 is the number of PSUs with at least one valid observation. These numbers are used to calculate degrees of freedom. |
nest sdmvstra sdmvpsu; | Use the nest statement to specify the strata and PSU variables to account for the sample design effects. |
Weight wtmec4yr; | Use the weight statement to account for the unequal probability of sampling and nonresponse. In this example, the MEC weight for four years of data is used. |
SUBPOPN riagendr=2 and ridageyr > 19 and ridageyr < 60 ; | Use a subpopn statement to subset on the subgroup of interest. In this example, it is females (riagendr=2) between ages 20-59 years (ridageyr > 19 and ridageyr < 60). |
Var hbp; | Use a var statement to set the variable of interest as the percent with high blood pressure. (In order to generate results that are expressed in percent with the condition, this variable was coded as 0=no high blood pressure and 100=those with high blood pressure.) |
class race educ; | Use a class statement to set the categorical variables of interest. In this example, race/ethnic group (race) and education level (educ). |
Tables race*educ ; | Use a tables statement to request prevalence of high blood pressure stratified on education level (educ) within each race/ethnic group (race). |
Output atlev1=numstrat atlev2=numpsu mean=mean semean=semean NSUM=N deffmean / filetype=SAS filename=test1 replace; |
Use the output statement to request output of results to a SAS data file (filetype=SAS) called test1(filename=test1).
Use a replace statement to replace this file each time this program is run and updated with the latest results. Use an atlev1 option to create the SAS data variable, numstrat, with the value obtained from counting the number of strata in each subdomain requested with at least one valid observation. Use an atlev2 option to create a SAS variable, numpsu, with the value obtained from counting the number of PSU’s in each subdomain requested with at least one valid observation. Use the mean option to output the mean to the SAS data set. In this example, the mean is the percent of individuals at each level with high blood pressure. Use the semean option to output the standard error of the mean estimated above to the SAS dataset. Use the nsum option to create the variable N in the SAS dataset which gives the number of observations in each level in each subdomain requested in the table statement. Use the deffmean option to output the design effect for each subdomain requested. |
Rformat race racef.; Rformat educ educf.; |
Use an rformat option to specify formats for the levels of each categorical variable in the tables statement as needed. Format statements for each variable must be listed individually. In this example, you are setting the formats for the race/ethnic group (race) and education level (educ) variables. |
Rtitle “Prevalence of high blood pressure by race and education level for women age 20-59”; | Use the rtitle option to set the title for output for procedure. |
Statements | Explanation |
---|---|
Proc sort data=test1; by race educ; run; |
Use the SAS procedure, proc sort, to sort the data. |
DATA test2; SET test1; |
Use the data statement to create a new dataset (test2) and the set statement to read in the data file created in SUDAAN. |
if race=4 then delete; | Use an if, then statement to delete the other race (race=4)category. This subgroup is not reported for analysis. |
percent=round(mean,.01); sepercent=round(semean,.01); |
Create the variables percent and sepercent and set them equal to a rounded value of the estimates using the round function. |
df=atlev2-atlev1; | Calculate the degrees of freedoms by subtracting the stratum ( atlev1) from the PSU (atlev2). |
tlow=tinv(.025,df); tup=tinv(.975,df); |
Calculate the t-statistic using the tinv function, which computes the percentile for the t-distribution with the degrees of freedom (df). |
rse=round((semean/mean)*100, .01); rsese=round((1/sqrt(df)), .01); |
Calculate the relative standard error and the relative standard error of the standard error. These are useful for determining the reliability of estimates but are not used for confidence limit calculations. |
ll=round((mean+tlow*semean),.01); ul=round((mean+tup*semean),.01); |
Calculate the upper and lower confidence limits. |
proc print; VAR Race Educ Percent Sepercent df deffmean tup ll ul; title1 ‘Degrees of Freedom and Wald 95% Confidence Interval’; title2 ‘Race and Education of High Blood Pressure NHANES1999-2002’; run; |
Use the proc print procedure to output the results.
Use the var statement to indicate variables of interest (race (race); education level (educ); percent of people in the category (percent); standard error of the percent (sepercent ); degrees of freedom (df); design effect (deffmean); t-statistic upper limit (tup); lower confidence limit (ll); and upper confidence limit (ul). |
Once data are sorted in SAS, SUDAAN can be used to specify the sampling design parameters. In this example, the SUDAAN procedure, proc descript, is used and the name of the dataset is DS1. Proc descript is being used as a generic example, but these statements apply to all SUDAAN procedures.
Step 1: Sorting in SAS
To carry out the appropriate SUDAAN design option for NHANES data, the data from DS1 must first be sorted by strata and then by PSU (unless the data have already been sorted by PSU within strata). The proc sort procedure in SAS must precede any SUDAAN statements.
Warning: Data must always be sorted in SAS before doing analyses in SUDAAN.
Step 2: Use proc statement in SUDAAN
Generally, a proc statement in SUDAAN immediately follows the sort statement. In this example, the proc descriptstatement is used. In addition, the data option specifies DS1 as the SAS dataset being used, the design option specifies with replacement (WR) as the design, and the noprint option suppresses printing of results as the results will output to a SAS data file.
Use the DEFT2 option statement to request the calculation of the design effect using SUDAAN Method 2 (see SUDAAN manual for details), which is the method recommended by NCHS for NHANES data.
Step 3: Specify design parameters in SUDAAN
The nest statement lists the variables that identify the strata and the PSU. The nest statement is required to indicate the appropriate design effect used in NHANES. As in the sort statement, the nest statement lists the stratum variable (i.e., sdmvstra) first, followed by the PSU variable (i.e., sdmvpsu).
The weight statement accounts for the unequal probability of sampling and nonresponse. For more information on selecting the correct weight, please see Selecting the Correct Weight in the Weighting module. In this example we use an adjusted weight that was calculated based on linkage eligibility using methods described in Course 3, Module 7, Non-response and weighting issues with the NHANES-CMS Linked Data.
The subpopn statement sets the subgroup. It is recommended that you use the subpopn statement instead of subsetting the data in the data step in SAS. Please see Creating Appropriate Subsets of Data for NHANES Analyses in the Weighting module for more information.
The var option sets the variable of interest. The subgroup and levels statements set the categorical variables of interest and the number of levels corresponding to each categorical variable. The tables statement requests a stratified output of the categorical variables.
Step 4: Specify output
In this step, you will specify how the results are saved to a file because the output in the proc descript procedure was suppressed using the noprint option. The filetype option determines the type of data file to be produced and the filename option sets the name of the file to which your results will be saved. If you use the replace option, then every time you run the program, your results will be overwritten with the newer results.
In SUDAAN, one must specify the ATLEVEL1 and ATLEVEL2 options in the proc statement in proc descript or proc crosstab to request that PSUs and strata are counted. The ATLEVEL1=1 and ATLEVEL2=2 options specify the sampling stages (in NHANES, the number of strata is level 1, and the number of PSUs is level 2) for which you want counts per table cell. The values 1 and 2 are the positions on the nest statement of the variables used to designate the stages of sampling. These options are associated with the keywords ATLEV1 or ATLEV2 respectively on the print or output statements. ATLEV1 is the number of strata with at least one valid observation and ATLEV2 is the number of PSUs with at least one valid observation. These numbers are used to calculate degrees of freedom.
The mean and semean options output the mean and estimated standard error of the mean to the data file. The nsum option outputs the number of observations in each level in each subdomain to the data file. The deff option outputs the design effect for each subdomain to the data file.
The rformat option specifies the formats of the levels of each categorical variable in the tables statement. Format statements for each variable must be listed individually. The rtitle option is used to set the title for output for procedure. These options are necessary only when printing the results.
Step 5: Use SAS to calculate degrees of freedom and Wald 95% confidence intervals from SUDAAN output
After outputting the strata and PSU variables needed to calculate the degrees of freedom in SUDAAN, you can use SAS to calculate the Wald 95% confidence interval using the correct degrees of freedom for a subdomain.
Summary: SUDAAN code to output estimates for calculating 95% confidence limits
The following table shows how to combine the statements described above to properly calculate 95% confidence limits. The procedure proc descript is being used as an example, but the design and nest statements can be used in the same manner for all SUDAAN procedures. Additionally, other procedure options can be added to these statements to customize the analysis and output. Consult the SUDAAN manual for specifications on the options for each SUDAAN procedure.
Statements | Explanation |
---|---|
proc sort data=DS1; by sdmvstra sdmvpsu; run; |
Use the SAS procedure, proc sort, to sort the data by the design parameters, strata (sdmvstra) and primary sampling units (sdmvpsu), before running the procedure in SUDAAN. |
proc descript data=DS1 design=WR DEFT2 ATLEVEL1=1 ATLEVEL2=2; |
Use proc descript to specify the dataset (DS1), specify the sample design using the design option WR (with replacement). Use a DEFT2 statement to request the calculation of the design effect using method 2 (see SUDAAN manual for details on the differing methods for calculating design effect). Method 2 is the method recommended by NCHS for NHANES data. The ATLEVEL1=1 and ATLEVEL2=2 options specify the sampling stages (in NHANES, the number of strata is level 1, and the number of PSUs is level 2) for which you want counts per table cell. ATLEV1 is the number of strata with at least one valid observation and ATLEV2 is the number of PSUs with at least one valid observation. These numbers are used to calculate degrees of freedom. |
nest sdmvstra sdmvpsu; | Use the nest statement to specify the strata (sdmvstra) and PSU (sdmvpsu) variables to account for the sample design effects. |
Weight wt_linkage_adj; | Use the weight statement to account for the unequal probability of sampling and nonresponse. In this example, the adjusted weight for “linkage non-response” is used for six years of data.. |
SUBPOPN ridageyr>=65 and cms_medicare_match=1 and obese=1; |
Use a subpopn statement to subset on the subgroup of interest. In this example, it selects people aged 65 or older (ridageyr>=65) that linked to the Medicare files at some point during 1999-2007 (CMS_Medicare_match=1) and we obese (obese=1). |
Var carrier05; | Use a var statement to set the variable of interest as the percent with of those on the 2005 Carrier file. (In order to generate results that are expressed in percent for a group of interest, this variable was coded as 0=not on 2005 carrier file (on_carrier_2005=0) and 100=on 2005 carrier file (on_carrier_2005=1)) |
class racecat; | Use a class statement to set the categorical variables of interest. In this example, race and Hispanic origin groups (racecat). |
Tables racecat; | Use a tables statement to request percent of males stratified by race and Hispanic origin group (racecat). |
Output atlev1=numstrat atlev2=numpsu mean=mean semean=semean NSUM=N / filetype=SAS filename=test1 replace; |
Use the output statement to request output of results to a SAS data file (filetype=SAS) calledtest1 (filename=test1). Use a replace statement to replace this file each time this program is run and updated with the latest results. Use an atlev1 option to create the SAS data variable, numstrat, with the value obtained from counting the number of strata in each subdomain requested with at least one valid observation. Use an atlev2 option to create a SAS variable, numpsu, with the value obtained from counting the number of PSU’s in each subdomain requested with at least one valid observation. Use the mean option to output the mean to the SAS data set. In this example, the mean is the percent of individuals at each level with high blood pressure. Use the semean option to output the standard error of the mean estimated above to the SAS dataset. Use the nsum option to create the variable Nin the SAS dataset which gives the number of observations in each level in each subdomain requested in the table statement. |
Rformat racecat racef.; | Use an rformat option to specify formats for the levels of each categorical variable in the tables statement as needed. Format statements for each variable must be listed individually. In this example, you are setting the formats for the race and Hispanic origin group (racecat). |
Rtitle “Percent with records on the 2005 carrier file, aged 65 and older that were obese by race and Hispanic origin” ; | Use the rtitle option to set the title for output for procedure. |
Statements | Explanation |
---|---|
Proc sort data=test1; by racecat; run; |
Use the SAS procedure, proc sort, to sort the data. |
DATA test2; SET test1; | Use the datastatement to create a new dataset (test2) and the set statement to read in the data file created in SUDAAN. |
percent=round(mean,.01); sepercent=round(semean,.01); |
Create the variables percent and sepercent and set them equal to a rounded value of the estimates using the round function. |
df=atlev2-atlev1; | Calculate the degrees of freedoms by subtracting the PSU (atlev2) from the stratum(atlev1). |
tlow=tinv(.025,df); tup=tinv(.975,df); |
Calculate the t-statistic using the tinv function, which computes the percentile for the t-distribution with the degrees of freedom (df). |
rse=round((semean/mean)*100,.01); rsese=round((1/sqrt(df)),.01); |
Calculate the relative standard error and the relative standard error of the standard error. These are useful for determining the reliability of estimates but are not used for confidence limit calculations. |
ll=round((mean+tlow*semean),.01); ul=round((mean+tup*semean),.01); |
Calculate the upper and lower confidence limits. |
proc print; | Use the proc printprocedure to output the results. |
VAR Racecat Percent Sepercent df tup ll ul; | Use the varstatement to indicate variables of interest race and Hispanic origin (racecat); percent of people in the category (percent); standard error of the percent (sepercent); degrees of freedom (df); t-statistic upper limit (tup); lower confidence limit (ll); and upper confidence limit (ul). |
title1 ‘Degrees of Freedom and Wald 95% Confidence Interval’; title2 ‘Percent Linked to 2005 Carrier file by Race and Hispanic Origin’; title3 ‘for those who were age 65+ and obese: NHANES 1999-2004’; run; |
This title statement identifies the contents of the output. |
Output:
RACE | NSUM | PERCENT | SEPERCENT | DF | TUP | LL | UL |
---|---|---|---|---|---|---|---|
Overall | 843 | 73.06 | 2.32 | 44 | 2.01 | 68.37 | 77.74 |
non-Hispanic black | 148 | 75.64 | 4.22 | 14 | 2.14 | 66.6 | 84.69 |
Mexican American | 176 | 76.5 | 5.22 | 9 | 2.26 | 64.7 | 88.31 |
non-Hispanic white and others | 519 | 72.62 | 2.54 | 38 | 2.02 | 67.49 | 77.75 |
Given the degrees of freedom for Mexican Americans some caution may be needed when analyzing these data by race and Hispanic origin.