These Frequently Asked Questions (FAQs) and answers cover the the most common questions encountered when working with Continuous NHANES (1999 and on), NHANES III, NHANES II, and NHANES I data. The FAQs are arranged by tutorial module topic. Click the hyperlinked question to view the answer.
Question 1. I noticed your survey acronyms changed from NHES to NHANES. Are there any differences between these surveys?
Answer: The first three National Health Examination Surveys (NHES) were conducted between 1960 and 1970, each with its own specific age groupings as a target population. These three surveys were known as NHES I, II, and III.
Between 1971 and 1975, a large nutrition component was added to the fourth series and all subsequent surveys. The name of the survey was thus changed to the National Health and Nutrition Examination Survey (NHANES). Two subsequent periodic surveys were conducted between 1976-1980, and 1988-1994. These three surveys are known as NHANES I, II, and III. Since 1999, the NHANES is conducted annually.
Question 2. Have the data contents remained constant across surveys?
Answer: No. Survey contents have changed from NHES to NHANES. Compared with the NHES series, the NHANES series not only added a nutrition component, but also incorporated additional examination components (in certain years), as well as an environmental health component and a dental component.
NHANES added in-person interview at participant's home, and additional interviewer-administered survey questionnaires, including a dietary questionnaire, and questionnaires on selected special topics during the MEC interview.
In NHANES III and continuous NHANES, selected groups of participants are also asked to complete questionnaires on other sensitive topics, such as illicit drug use or sexual behaviors using a self-administered interview (Audio-CASI).
For the dietary component, participants who have completed the MEC dietary questionnaire participated in two follow-up questionnaire activities: taking a telephone interview with survey staff using Computer-Assisted Telephone Interview (CATI) and filling out a food frequency questionnaire.
Please also be mindful that certain examination components have been added or removed in different NHANES surveys.
Question 3. Why do you call NHANES conducted after 1999 " continuous NHANES”?
Answer: This is because the previous NHANES I, II and III were periodic surveys conducted between certain years with intervals in between each wave of the survey. Since 1999, the NHANES is conducted annually without any interruptions.
The continuous NHANES program provides increased flexibility in changing survey contents to meet emerging needs. It also allows for increased timeliness in releasing bi-annual datasets and estimates on topics of public health interest.
Navigate the NHANES Website
Question 1. Can I access current and historic data conducted by your agency from the website?
Answer: Yes. All publicly available data and related documentation are released and updated in a centralized location: the NHANES website.
The website contains the public-use data files from the initial National Health Examination Survey (NHES) dataset up to the most current continuous NHANES dataset. Codebooks and documentation accompany each dataset released.
Previously, some datasets were released on CD or DVD. However, the most up-to-date versions of these datasets are available on the website.
Question 2. Why are certain variables or data files not publicly released on the website?
Answer: Some variables or entire data files are not publicly released due to disclosure concerns, for example, geographic identifiers. These files are only available through the Research Data Center (RDC). You may review the Data Release and Access Policy for more information.
You can find more information on this topic from the link on the main section of the homepage: Access to Public and Non-Public NHANES Data Sets
Question 3. How do you decide the contents for each NHANES survey?
Answer: The NHANES program solicits survey proposals from various federal agencies such as NIH, USDA, EPA, and many other CDC centers. These proposals are reviewed by expert panels, and content is determined after a rigorous evaluation process including consideration of criteria such as public health importance, feasibility of the proposed survey items and burden to survey participants.
If you would like to find more information on this topic, you can click on the link on the main section of the homepage: Proposal Guidelines for New Survey Content.
Question 4. I noticed there is a lot of information in the data documentation. Do I have to read these documents?
Answer: The data documentation contains important and relevant information for your analysis. Therefore it is very important that you consult the documentations to assist you in determining the scope of your analysis and which variables to include.
These documents are released with each survey dataset, and contain information pertaining to:
- Survey Contents, which shows the years components were collected and when changes to the components occurred,
- Sample Person Questionnaire protocols, and
- MEC Component Descriptions, which contain the examination and laboratory protocol for obtaining these measures in the survey.
Data Structure & Contents
Question 1. Are NHANES data structured the same way throughout the years?
Answer: No. The data structure and content of the NHANES surveys has changed over the years. This concept is important to understand when searching across different NHANES surveys. So when you use the tutorial, please consult the NHANES Data Structure and Contents module to understand how the continuous NHANES (1999 – present) and historic NHANES (I, II and III) are laid out differently.
Question 2. What are the main differences in data structure between the continuous NHANES and NHANES III?
Answer: The continuous NHANES (1999-present) are conducted annually and released in two-year cycles, so the continuous NHANES is organized by survey cycles, namely, NHANES 1999-2000, 2001-2002, 2003-2004, etc.
Each cycle is divided into four sections labeled by data collection method: Demographics, Examination, Laboratory, and Questionnaire. Within each section are many components – groups of related variables packaged in a data file according to topics.
NHANES III was conducted in two phases, from 1988-91 and from 1992-94. However, the NHANES III data releases were not structured according to survey phases. Instead, they were organized according to the time of the data component release. Two specific releases contain the majority of data of interest to most researchers – these are Series 11, No. 1a and Series 11, No. 2a.
Series 11 No. 1a contains the majority of the data and corresponding documentation for the survey interview and examination components. They are further divided into five separate data files:
- NHANES III Household Adult Data File
- NHANES III Household Youth Data File
- NHANES III Examination Data File
- NHANES III Laboratory Data File
- NHANES III Dietary Recall Data Files
Question 3. I cannot find my variables in NHANES III series 11 No. 1a or 2a. Where else shall I search?
Answer: There are a number of other data releases (e.g. series No. 3a through No. 25a) which contain additional data based on a subsample of the survey, very specific topic areas which were delayed in their original release, or data files based on surplus sera projects. Also, new data releases on NHANES III continue to occur, so please check the NHANES website for these new releases.
Preparing an Analytic Dataset
Locate VariablesQuestion 1. How are continuous NHANES data (1999-present) organized in the publicly accessible website?
- The current NHANES data are first grouped into survey cycles (i.e. 2005-2006, 2003-2004, 2001-2002, 1999-2000);
- Within each cycle, the date files are organized into four major components: Demographic, Examination, Laboratory, and Questionnaire files;
- Survey variables are stored in these different components, and the component's variable list contains the list of all the publicly released variables and their file locations.
Question 2. Why are there so many data files?
Answer: The data files have been separated to reduce the amount of time to download data and documentation from the Internet, along with the greater ease in producing, editing, and validating data files. This does require that you merge files together for analysis. Please refer to the following SAS code example to learn how to merge files together NHANES data merge code example
Question 3. How do I know which component contains the variables of interest to me?
Answer: Generally speaking, the continuous NHANES data files are organized by their collection method. For instance, if the information is collected via household interview or MEC interview, the items are mostly like stored in the Questionnaire component. If the variables of interests are lab items, they are in the Laboratory component. In summary:
- Demographics files: survey design (e.g. weights, design strata) and demographic variables
- Examination files: information collected through physical exams, dental exams, and dietary interview components (Note: not every survey participant agreed to a physical examination)
- Laboratory files: results from specimens such as blood, urine, hair, air, tuberculosis skin test, and household dust and water specimens
- Questionnaire files: data collected through household interview and mobile examination center (MEC) interview
Question 4. Once I know which cycle and component to search for my variables, what is the fastest way to find them?
Answer: The continuous NHANES component's variable list contains the list of all the publicly released variables and their file locations. You can use the Adobe search function to speed up the process of finding your variables of interests. The variable lists include the following information:
- filename the variable is found in,
- name of the component,
- variable name, and
- variable's English label (a short description of the variable).
Question 5. My search resulted in a long list of variables. Which one is appropriate for my analysis?
Answer: Not every result returned will be relevant to your analysis. To decide which variables are appropriate for your analysis, you need to review the survey documentation. You cannot determine which ones to use in your analysis without consulting the data file documentation. This is because the search results give you variables with similar names, but they could be auxiliary, or used for excluding sample persons from certain measures, rather the actual measurements you are interested in. Therefore, you have to READ THE DOCUMENTATION!
To find the relevant documentation, you should first write down the file names that the variables are stored in, and then use this introduction to identify the data file and documentation to download. The data file documentation will guide you to include the appropriate variables for your analysis.
Question 6. What kinds of NHANES documents are available and how is it best to use them?
Answer: There are three types of data documentation for each data file that you need to consult, and each provides valuable information to facilitate your gathering of background information on your variables. The three documentation types are:
Before analyzing the data, you will need to know how the variable is coded, data editing, processing, and collection information, and the frequency (or sample size) of the variable. The codebook lists all the variables in the data file. Use it to determine what the values associated with a variable mean. Use the data file documentation to determine if the collection or measurement is appropriate for your analysis. The frequency files for each data file contain the frequency count for each item in the data file and can be used to verify the sample size for a particular data item.
Question 7. On the NHANES 2003-2004 data page I see links for data. How do I access the data from these links?
Answer: The Docs and Procedures files are in Adobe PDF format, so you should be able to view these directly in your browser, if configured with Adobe Acrobat. A PDF file can be saved from this view using the "File/Save As..." menu and specifying a location on your local computer or network to store the file. Or you can right-click the file name directly on the webpage and select "Save Target As..." from the popup box, then specify a location to save the file on your computer.
Clicking on the Data link will open a dialog box from which you can specify a location to store the file (using the "Save" button) or open it directly with SAS (using the "Open" button.)
Question 8. Next to the name of each questionnaire section, laboratory component, or exam component on the NHANES 2003-2004 data page there are links that appear as follows: [Data, Docs, Procedures]. What are these links for?
Answer: In previous years (1999-2002), each of the codebook, documentation, file frequencies, and SAS transport dataset were made directly, individually accessible by a separate link, in brackets, next to the data section name.
Starting with data years 2003-2004 the documentation, codebook, and frequencies have been combined in a single Adobe PDF file, accessible from the single Docs link. A new, direct link to Procedures is also being provided for each data section on the data page.
For exam sections this will link to the examination procedures manual for that section, and for questionnaire sections this will link to the questionnaire instrument. Procedure manuals for laboratory sections are also available but as there may be multiple documents for a given lab, these will remain on a separate page, accessible by clicking the Laboratory Procedures Manuals link at the top of the list of laboratory data sections.
Question 9. I know NHANES collected information on certain topics, but I couldn't find them on the variable lists. What happened to those data collected?
Answer: This is because all variables listed in the component variable lists are those that are publicly released. These data files are available for download from the NHANES website. If you wish to use variables that are NOT listed in a component variable list, you will need to use the Research Data Center. You can review the NCHS Research Data Center for more information about how to obtain access to non-publicly released variables.
Question 10. Why isn't the adolescent data on alcohol use, smoking, sexual behavior, reproductive health and drug use available as a public release file?
Answer: These files have not been released on the NHANES website due to confidentiality concerns. Adolescent data files containing this sensitive information will be made available at the NCHS Research Data Center.
Download Data Files
Question 1. Where can I access NHANES data files?
Answer: Both current and historic NHANES data are made available for download on the NHANES website. The only exceptions would be those files not publicly released yet due to disclosure risk, or still under processing.
Question 2. What format are the data files in? Can they be used with SAS, SPSS, or Stata?
Answer: The Continuous NHANES files are in SAS transport file format (.xpt). They can be used with any package that supports this file format. For statistical/analytical packages that do not support SAS transport file format, you need to convert the file to a different format using an appropriate software package. Users desiring alternate data formats can use the SAS System Viewer — a free download from SAS Institute — to convert the transport file into space-, tab-, or comma-delimited text files for use in additional software programs, such as Microsoft Excel. All prior NHANES data files are in ASCII format.
Question 3. I have downloaded the files, but I cannot run any of them with my statistical program. What happened?
Answer: This might be due to the fact that you have NOT extracted the files yet. These transport files are NOT usable until you first extract them using the XPORT engine. Then you will need to use proc copy and save them as permanent SAS datasets.
Question 4. What operating system do I need to extract these files?
Answer: NHANES data files can be extracted on Windows, UNIX, or Macintosh based systems.
Question 5. Why do I need to go through the trouble of extracting and saving these files? Can't I just double click on the files and let the SAS program extract and save them automatically?
Answer: It is true that some users might be able to double click on the downloaded files, and your SAS software will extract and save them as temporary WORK files automatically. However, depending on the operating system or version of SAS program, some users may not be able to do the double clicking directly. That is why we provided you the instructions for extracting and saving these files.
Even for those users who can double click and create temporary files, we still recommend that you save them as permanent SAS datasets, because as soon as you exit the SAS program, the temp (WORK) files will no long exist.
Append & Merge Datasets
Question 1. I noticed that continuous NHANES data files are released in 2-year cycles. What do I do if I need to combine different years together?
Answer: When you want to combine multiple years, you need to append the data files from different survey cycles on the same variables.
Question 2. Why do I have to check the contents of the data files before appending the data? What do I do if I find variables named or labeled differently?
Answer: You should always check the variable lists first because variable names may be different from cycle to cycle, or recoded or derived variables may be added in different cycles.
- If the names or labels of the variables of interest are identical in the selected cycles, you can append the data files directly.
- If the names or labels of the variables of interest have changed, you will have to find out whether the wording, definition, and/or response categories have been modified, and then recode the variables to make their names and response categories consistent before appending.
Question 3. The variables I'm interested in come from interview, examination and laboratory components. How do I combine them together?
Answer: If you want to join variables from different components, you need to merge these data files.
The first step in merging data is to sort each of the data files by a unique identifier. Then you merge the data by that unique identifier.
Question 4. What is the unique identifier in NHANES data that we need to append or merge data by?
Answer: In NHANES, each sampled person is identified by a unique sequence number, and the variable name for it is SEQN. Every time you extract variables from an NHANES data file, you should always include the SEQN variable in your selection. The SEQN will later be used for sorting the data, and for appending and merging the data files.
Clean & Recode Data
Question 1. What percent of missing data is usually acceptable for NHANES data analysis?
Answer: As a general rule, if 10% or less of your data for a variable are missing from your analytic dataset, it is usually acceptable to continue your analysis without further evaluation or adjustment. However, if more than 10% of the data for a variable are missing, you may need to determine whether the missing values are distributed equally across socio-demographic characteristics, and decide whether further imputation of missing values or use of adjusted weights are necessary.
Question 2. How are missing values, "blank but applicable", "don't know" and other values coded?
Answer: There are codes for refused (7-fill: that is 7, or 77, or 777…, depending on the number of digits required for a particular data value), don't know (9-fill), and missing values (a blank field) which means the person was not asked the question or given the test. There is no longer a specific code for those cases where the variable response is " blank but applicable”; for such cases the values are designated as missing values. For laboratory data there are special considerations. When a laboratory value was less than the lower limit of detection (LOD), a " fill” value based on the LOD was used instead of the sample value as the sample value was deemed " not detectable.” An indicator variable taking value (0 or 1) is used to identify which values are real and which values are fill values.
Question 3. Why do I have to check the missing data?
Answer: If you fail to identify " refusal” or " don't know” as types of missing data, and treat the assigned values for " refused” or " don't know” as real values, you will get distorted results in your statistical analyses. Therefore, it is important to recode " refused” or " don't know” responses as missing values (either as a period (.) for numeric variables or as a blank for character variables).
Question 4. How do I determine the skip patterns for a questionnaire section?
Answer: The first step is to review all of the documentation for the questionnaires. To review skip patterns look at the complete questionnaire specifications. Please note that not all questionnaire items are released due to small sample sizes and confidentiality/sensitivity issues, but all skip pattern integrity was maintained and validated.
The significance of a skip pattern depends on the question leading to the skip pattern, the questions within that skip pattern, and the variables you intend to analyze. If you fail to check for skip patterns, you may obtain only a proportion of the population, instead of the entire study population. Check the codebook to determine if a skip pattern affects the variables in your analysis.
Question 5. How do I check for outliers, and what do I do with influential outliers?
Answer: For continuous variables, you identify outliers by using a univariate analysis to check for normality. If the distribution is highly skewed, you can do a data transformation to make the distribution of the data closer to normal.
After checking the distribution and normality of the data, plot the survey weight against the variable to determine which of the extreme values identified in the univariate analysis are outliers. You must also determine if the outliers represent valid values and, if so, also carry extremely large survey weights.
Outliers with extremely large weights could have an influential impact on your estimates. You will have to decide whether to keep these influential outliers in your analysis or not. It is up to the analysts to make that decision.
Please consult the Analytical Guidelines for more information on this topic.
Format & Label Data
Question 1. Do I have to format and label all variables?
Answer: No. Formatting and labeling variables in SAS is optional and does not need to be done for all variables in the dataset. However, it is especially useful for frequently used variables and for clarity in your output.
Question 2. Are there rules on how to format and label variables?
Answer: Formats and labels are user-defined tools that provide a convenient way to define variables in your SAS or SUDAAN output. Formatting is used to assign descriptive text names to numeric and character values of a variable. Labeling, on the other hand, allows you to assign descriptive titles to variable names.
Survey Design Factors
Question 1. What do you mean by the phrase " NHANES is a complex survey”?
Answer: We frequently refer to NHANES as a complex survey because the data are not obtained using a simple random sample. Rather, a complex, multistage, probability sampling design is used to select participants representative of the civilian, non-institutionalized US population.
Question 2. How do you draw an NHANES sample?
Answer: The NHANES study draws its sample in four stages described below:
- Stage 1: Primary sampling units (PSUs) are selected. These are mostly single counties or, in a few cases, groups of contiguous counties with probability proportional to a measure of size (PPS).
- Stage 2: The PSUs are divided up into segments (generally city blocks or their equivalent). As with each PSU, sample segments are selected with PPS.
- Stage 3: Households within each segment are listed, and a sample is randomly drawn. In geographic areas where the proportion of age, ethnic, or income groups selected for oversampling is high, the probability of selection for those groups is greater than in other areas.
- Stage 4: Individuals are chosen to participate in NHANES from a list of all persons residing in selected households. Individuals are drawn at random within designated age-sex-race/ethnicity screening subdomains. On average, 1.6 persons are selected per household.
Question 3. What is a Sample Weight?
Answer: A sample weight is assigned to each sample person. It is a measure of the number of people in the population represented by that sample person in NHANES, reflecting the unequal probability of selection, nonresponse adjustment, and adjustment to independent population controls. When unequal selection probability is applied, as in the NHANES sample, the sample weights are used to produce an unbiased national estimate. More information about sample weights and how they are created can be found in the Weighting module.
Question 4. Do I have to use sample weights and other survey design variables?
Answer: Yes. For NHANES datasets, the use of sampling weights and sample design variables is recommended for all analyses because the sample design is a clustered design and incorporates differential probabilities of selection. Accounting for the complex sampling design of NHANES is especially critical when calculating statistical estimates and estimating standard errors of means, geometric means, percentages and other statistics.
If you fail to account for the sampling parameters, you may obtain biased estimates and overstate significance levels.
Question 5. What are Masked Variance Unites (MVUs) and why do we need them in analyses?
Answer: Primary Sampling Units (PSU) are selected from strata defined by geography and proportions of minority populations. Most strata contain two PSUs. Together, these strata and the PSUs represent the variance units (sampling units used to estimate sampling error).
To protect the confidentiality of data obtained from sample persons, Masked Variance Units (MVU) are constructed. MVUs are equivalent to Pseudo-PSUs used to estimate sampling errors in past NHANES. The MVUs on the data file are not the "true" design PSUs. They are a collection of secondary sampling units aggregated into groups for the purpose of variance estimation. They produce variance estimates that closely approximate the variances that would have been estimated using the "true" design variables.
These MVUs have been created for each two-year cycle of NHANES and have been created in a way that allows them to be used for any combination of data cycles. These MVUs are used to define the strata and PSU variables on the public release files. The variable name for the stratum is sdmvstra and the variable name for the PSU is sdmvpsu.
Question 6. Why does NHANES oversample some groups but not others? Do you oversample different groups over the years?
Answer: NHANES typically samples larger numbers of certain subgroups who are of particular public health interest. Oversampling is done to increase the reliability and precision of estimates of health status indicators for these population subgroups. As for why certain subgroups in the population are not oversampled, it may be due to the fact the it is either cost prohibitive and/or operationally not feasible to oversample certain groups in the population.
Which subgroups get oversampled does change from cycle to cycle. Therefore, it is critical to carefully review the documentation for each survey cycle to determine which subgroups were oversampled.
Specifying Weighting Parameters
Question 1. How are NHANES weights constructed?
Answer: In general a sample person is assigned a base weight that is equivalent to the reciprocal of his/her probability of selection. However, calculating the base weight in NHANES is much more complicated due to the survey's complex, multistage design. In summary, NHANES weights are constructed:
based on the final probability selection through 4 sampling stages;
adjusted for nonresponse to the in-home interview when creating interview weights, and further adjusted for non-response to the MEC exam when creating exam weights; and
post-stratified to match the population control totals for each sampling subdomain.
Question 2. How do NHANES weights account for different response rates to the in-home interview and MEC exam?
Answer: In NHANES, an individual can be classified as a non-respondent to the interview portion of the survey and/or the exam portion. An individual is considered a non-respondent to the interview if he/she was selected to be in the sample, but did not participate in the in-home interview. Similarly, an individual who agreed to complete the interview but did not agree to, or come in for, the MEC portion of the survey is considered a non-respondent to the exam.
Adjustments made for survey non-response account only for sample person interview or exam non-response, but not for component/item non-response (i.e., a sample person declined to have their blood pressure measured in the examination component but completed all other examination components).
To produce estimates appropriately adjusted for survey non-response it is important to check all of the variables in your analysis and select the weight of the smallest analysis subpopulation. All interview and MEC exam weights can be found on the demographic file for the respective survey. Weights for a given component conducted on only a subsample of the original NHANES sample are available on the data file for that particular component.
Question 3. Will data and weights be available on public use files for single years such as 1999, 2000, 2001, or 2002?
Answer: No. Even though each single year in NHANES comprises a nationally representative sample of the U.S. population, two years of data are necessary for sufficient sample sizes, hence the data are released in two year cycles, and no single year weights will be available to the public.
Sometimes, even two years of data do not guarantee sufficient sample size to produce statistically reliable estimates. This is especially true when you are dealing with rare events or demographic subdomains (e.g., sex-age-race/ethnicity groups).
Therefore, combining two or more 2-year cycles of the continuous NHANES is strongly recommended, whereas analyzing a single year of NHANES data is discouraged.
However, since some components were collected across three years during 1999-2002, single year datasets for 1999-2002 are available in the Research Data Center (RDC).
Question 4. I was told I have to use the 4-year weights provided on public use files for 1999-2002. Why can't I combine weights together myself?
Answer: Sample weights for NHANES 1999-2000 were based on population estimates developed by the Bureau of the Census before the Year 2000 Decennial Census counts became available. The 2-year sample weights for NHANES 2001-2002, and all other subsequent 2-year cycles, are based on population estimates that incorporate the year 2000 Census counts.
Because different population bases were used, the 2-year weights for 1999-2000 and 2001-2002 are not directly comparable. Therefore, when combining 1999-2000 with 2001-2002 survey years in analyses, you must use the 4-year sample weights provided by NCHS since these have been created to account for the two different reference populations.
For both 1999-2000 and 2001-2002 survey cycles, the demographic file contains the weight variables for your use:
- wtint2yr and wtint4yr for all interviewed sample persons,
- wtmec2yr and wtmec4yr for the sample persons who have MEC data items, and
- two-year and four-year (for subsample datasets with consistent data elements across two survey cycles) subsample weights for selected sample persons.
Question 5. How do I calculate 6-year weights?
Answer: Six year sample weights for NHANES 1999-2004 should be calculated by researchers as follows: With the first two dataset weights (NHANES 1999-2002) already averaged, then the six year year weight would be WT99-04 = (2/3) x WT99-02 + (1/3) x WT03-04, where WT99-02 is the variable WTMEC4YR from the NHANES 2001-2002 demographic file dataset, and WT03-04 is the variable WTMEC2YR from the NHANES 2003-2004 demographic file dataset. Please refer to the NHANES Analytic Guidelines provided with the data release files to determine the appropriate methodology for analyses of combined years of data.
Question 6. What are the subsample weights and how are they constructed?
Answer: NHANES respondents are asked to participate in a variety of survey components that are statistically defined (or random) subsamples of the NHANES MEC-examined sample. These include a variety of lab, nutrition/dietary, environmental, or mental health components. (Please see the respective survey protocol/documentation for more specific information.)
For example, some, but not all, participants are selected to give a fasting blood sample on the morning of their MEC exam. The subsamples selected for these components are chosen at random with a specified sampling fraction (for example, 1/2 or 1/3 of the total examined group) according to the protocol for that component. Each component subsample has its own designated weight, which accounts for the additional probability of selection into the subsample component, as well as the additional nonresponse.
Question 7. Can you combine subsample weights?
Answer: No. Subsample weights are not designed to be combined. In fact, many subsamples are mutually exclusive. If it is necessary to combine two or more subsamples for your analyses, then appropriate weights would need to be recalculated. However, details on how to recalculate weights when combining subsamples go well beyond the scope of this tutorial. Therefore, it is strongly advised that you do not attempt to combine subsamples in any analysis.
Question 8. When I subset NHANES data, should I do it in SUDAAN or in SAS data steps?
Answer: For SUDAAN procedures it is important that you do not create a smaller subgroup based on any non weight-related groups of interest (e.g. demographic, laboratory or examination variables) in the SAS data step before executing the SUDAAN procedure. Instead, it is highly recommended that you create a subset of your sample population using the subpopn statement in the SUDAAN procedure itself and not in the SAS data step. In addition, SUDAAN procedures require that all observations in the dataset being read into a procedure have the same sample weight. Therefore, prior to the SUDAAN procedure you should create a subset of your data to include only those observations with the appropriate sample weight for your analysis.
For SAS Survey procedures, there is no subpopn statement. Instead, most SAS Survey procedures use a domain statement for domain analysis, also known as subgroup analysis or subpopulation analysis.
Question 1. What kind of sampling features may affect the variance estimates of NHANES data?
Answer: NHANES has a complex, multistage, probability cluster design, which would require the statistical analysis to take into account these sample design features. Specifically, attributes of the complex sample design such as differential weighting, clustering and stratification will all have various impact on variance estimates, estimated standard errors, and thereby test statistics and confidence intervals.
Question 2. What would happen to the variance estimates if standard statistical software for simple random samples is used?
Answer: In a complex sample survey setting such as NHANES, variance estimates computed using standard statistical software packages that assume simple random sampling are generally too low and therefore biased. As a result, significance levels are overstated and type I error is more likely to occur. This is because these software packages do not account for the differential weighting and the correlation among sample persons within a cluster.
Question 3. How do you estimate the impact of a complex sample design on variance estimates?
Answer: The impact of the complex sample design upon variance estimates is measured by the design effect (DEFF). It is defined as the ratio of the variance of a statistic which accounts for the complex sample design to the variance of the same statistic based on a hypothetical simple random sample of the same size.
Question 4. Are there specific mathematical formulas you recommend to use for computing variance estimates for complex survey data?
Answer: For complex sample surveys, exact mathematical formulas for variance estimates are usually not available. Variance approximation procedures are required to provide reasonable estimates of sampling error.
Two variance approximation procedures which account for the complex sample design and compute design effects are replication methods and Taylor Series Linearization. Initially, the delete 1 jackknife method, a replication method, was used to estimate variances based on data from the NHANES 1999-2000 survey. Balance repeated replication was used for NHANES III. Currently NCHS recommends the use of the Taylor Series Linearization methods for variance estimation in all NHANES surveys. SUDAAN, Stata and the SAS Survey procedures can be used to obtain variance estimates by this method. Survey design variables identifying strata and PSU are required in order to utilize these software packages. If replication methods are used, you must compute your own replicate weights.
Question 5. Why do you emphasize degrees of freedom so much in your NHANES tutorial?
Answer: There are several reasons for the emphasis given to the proper calculation of the degrees freedom:
To calculate the correct value for the t-statistic from a t-distribution and a selected level of significance, you must calculate the proper degrees of freedom for the estimate.
Continuing research on issues related to stability of variance estimates in subdomains of NHANES have shown that standard error estimates based on small numbers of paired PSUs (i.e., degrees of freedom) are prone to instability. Therefore, it is important to examine the number of degrees of freedom from which a standard error estimate is based.
The reliability of the estimated standard error, as measured by its relative standard error (i.e., (standard error of the standard error of the estimate/standard error of the estimate)*100), is inversely proportional to its degrees of freedom.
As the number of degrees of freedom increases, the relative standard error decreases and the reliability of the estimate increases. The NHANES guidelines recommended a relative standard error of at most 30%. This corresponds to at least 12 degrees of freedom.
Question 6. How do you properly calculate the degrees of freedom?
Answer: Degrees of freedom are properly calculated by subtracting the number of clusters in the first level of sampling (strata) from the number of clusters in the second level of sampling (PSUs) for each subgroup you are analyzing.
Question 7. Are there any differences between SAS and SUDAAN software in terms of handling the degrees of freedom?
Answer: For both SUDAAN and SAS Survey procedures, the degrees of freedom are calculated in the same way when looking at the entire sample population or in subgroups where all strata and PSUs are represented.
However, when you analyze data on a subgroup of sample persons who may not be represented in all strata and PSUs (e.g., Mexican Americans), the degrees of freedom provided in the SUDAAN and SAS Survey Procedures output may differ. For example, SAS Survey procedures, such as proc surveymeans, compute the degrees of freedom as the number of clusters (PSUs) in the non-empty strata minus the number of non-empty strata.
This means that if your data have empty strata (no persons in the population for either PSU) the number of degrees of freedom will increase. This is incorrect and SAS is currently working on correcting this problem.
Question 8. How do you generate confidence intervals using SAS or SUDAAN?
Answer: Both SAS Survey procedures (proc surveymeans) and SUDAAN version 9.1 (proc descript) produce 95% confidence intervals (CI). These 95% CIs are calculated using the Wald method, which is based on a t-statistic for the number of degrees of freedom in the entire NHANES sample.
However, they do not correct for the reduction in the degrees of freedom in subdomains where not all strata and PSUs are represented. Please see the Variance Estimation module for instructions on how to correctly calculate 95% confidence interval. Also, the Wald method should not be used when the proportion is close to 0% or 100%. For prevalence estimates near 0% or near 100%, standard methods of calculating confidence limits, such as the Wald method, may produce lower limits less than 0% or upper limits greater than 100%. In these cases, it is often recommended to use alternative methods for calculating 95% confidence limits using transformations (such as the logit or arcsine transformation), using the Wilson method, or calculating exact confidence limits such as the Clopper-Pearson approach.
Question 1. In the tutorial, you recommend checking the frequency distribution of each variable before analysis. Why?
Answer: A frequency distribution not only presents an organized picture of how individual scores are distributed on a measurement scale, but also reveals extreme values and outliers which may affect the analysis. Researchers can make decisions on whether and how to recode or perform data transformation based on the distribution statistics.
Question 2. Is it a good idea to get frequency tables for all variables in your analysis, and print them out for reference?
Answer: In general, it is a good idea to check the frequency distribution for all variables before analysis. However, you may want your frequency distributions to be structured as either tables or graphs. Because NHANES data have very large sample sizes with a potentially long list of different values for continuous variables, it is recommended that you use a graphic format to check the distribution for continuous variables, and either frequency tables or graphic forms for nominal or interval variables. If you request frequency tables for all variables, you should always examine the length of your printout before you press the " print” button, as there may be hundreds of pages involved.
Question 3. If the statistics for normality turn out to be significant in my analysis, does that mean I cannot use parametric tests any more?
Answer: Not necessarily. Statistics of normality do reveal whether a data distribution is normal or not, and help determine whether parametric or non-parametric tests should be used, or data transformation is needed. However, since NHANES is a large, representative sample of the U.S. population, most continuous variables from this sample are expected to be normally distributed. If you just conduct tests for normality, results on most variables would be significant, i.e. even the slightest deviation from normality could result in rejecting the null hypothesis due to the extremely large sample sizes. Therefore, you should not solely rely on these tests for normality to base your decision on.
A Q-Q plot, or a quantile-quantile plot, may offer additional information. Q-Q plot is a graphical data analysis technique for assessing whether the distribution for data follows a particular distribution. In a Q-Q plot, the distribution of the variable in question is plotted against a normal distribution. The variable of interest is normally distributed if a straight line intersects the y-axis at a 45 degree angle. Based on your tests of normality and Q-Q plot, you may make a more informed decision about parametric or non-parametric tests.
Question 4. What do you use percentiles for?
Answer: Compared with raw scores, percentiles provide additional information about the distribution of values. For example, if you are told that a boy is 27 inches tall and weighs 30 pounds, information such as the average height and weight for his age group, or the number of boys who score above or below this boy in his group would be very helpful. It is much more informative if you could transform the height and weight of the boy into percentile rank, such as 75th percentile in height, and 50th percentile in weight for his age group.
Question 5. Can you generate percentiles with SAS Survey Procedures?
No, not the current versions of SAS.
Question 6. When should you use geometric means instead of arithmetic means?
Answer: In instances where the data are highly skewed, geometric means should be used. A geometric mean, unlike an arithmetic mean, minimizes the effect of very high or low values, which could bias the mean if a straight average (arithmetic mean) were calculated. The geometric mean is a log-transformation of the data and is expressed as the Nth root of the product of N numbers.
Question 7. In the Descriptive Statistics module you demonstrated how to calculate prevalence for hypertension. But the definition you used in this tutorial is different from the one I usually use. Why is that?
Answer: Definitions of many conditions or risk factors, such as hypertension, diabetes, osteoporosis, or obesity have changed over time. In addition, definitions also vary by different health or medical organizations' guidelines. Over the years, publications using historic or current NHANES have reflected these changes in definitions.
In this tutorial, all the definitions used are for illustration purposes only, rather than definitive guidelines. For the appropriate or most recent definition of medical conditions, please consult official publications from corresponding medical or public health agencies.
Question 1. Can we use the student t-test for NHANES data?
Answer: Yes. The student t-test assumes that the data has a normal distribution, and that the covariance is small. NHANES data do meet both assumptions on most occasions provided that you do not divide the data into very small sub-domains. Therefore, the t-test is frequently used for NHANES data to detect differences in health outcomes or risk factors between subpopulations.
Question 2. How should I handle the degrees of freedom when conducting hypothesis testing with NHANES data?
Answer: Unlike with simple random samples, you cannot simply use n-1 as the degrees of freedom in NHANES since it is a multi-stage, area probability sample. So the number of independent pieces of information, or degrees of freedom, depends upon the number of PSUs rather than on the number of sample persons. Therefore, the degrees of freedom are calculated as the number of first stage units (PSUs) containing observations minus the number of strata (please see Sample Design module for more information).
Question 3. When I calculate confidence intervals for a point estimate in NHANES, should I use the t score or the Z score in the formula?
Answer: You should use a t-statistic with degrees of freedom equal to the difference between the number of PSUs and the number of strata containing observations.
Question 4. Do I have to use weights and design based methods when calculating confidence intervals?
Answer: Yes. Sample weights must be incorporated in calculating the estimate and its standard error, and design-based methods must be used to estimate the standard error. Taylor Series Linearization is one example of a design-based method. The design variables needed to obtain estimates of standard errors through this method are provided on the demographic files for the continuous NHANES.
Question 5. Can I get confidence intervals for highly skewed variables?
Answer: If you have highly skewed variables, transformations are recommended before constructing the confidence intervals. One of the most common transformations used in the literature is the loge. We recommend that users verify that the transformed variable is normally distributed before proceeding to construct confidence intervals. Sometimes, applying the log-transformation does not necessarily yield a normally distributed random variable. Furthermore, in instances in which 0 is a plausible value, the log is undefined. We recommend that users try other transformations, for example the square root, in these instances.
Question 6. Can I obtain geometric means and their confidence intervals using SAS proc surveymeans?
Answer: At the present time, SAS proc surveymeans does not have an option to produce geometric means and their standard errors. However, they can be obtained by running proc surveymeans on the log transformed variables to produce means and standard errors of the log transformed variable, constructing the confidence interval on the log-transformed scale, and then back transforming the endpoints.
In our tutorial, we demonstrated how you can obtain the geometric mean and its standard error directly from SUDAAN proc descript. If you have both software, you can then output the results to a SAS dataset where the confidence interval can be constructed directly.
Question 7. What procedures would you recommend for chi square testing?
Answer: For a complex sample like NHANES, we recommend, that you SAS proc surveyfreq (CHISQ, based on the Rao-Scott chi-square with an adjusted F statistic). This would take into account for survey design with degrees of freedom equal to the number of PSUs minus the number of strata containing observations.
If you use SUDAAN, this statistic can be done through proc crosstab procedure in SUDAAN version 9.0. It provides limited chi-square statistics based on Wald chi-square but does not provide an F adjusted p-value. However, SUDAAN regression models do provide F adjusted chi-square statistics which are recommended for analyzing NHANES data.
The Cochran Mantel Haenzel Test, an extension of the Pearson Chi-Square, can be applied to stratified two-way tables to test for homogeneity or independence in a non-survey setting. For a complex sample its analogue can be obtained in SUDAAN proc crosstab (cmh).
Question 1. When do I have to use age standardization?
Answer: Age standardization, or age adjustment, is used when comparing two or more populations at one point in time, or one population at two or more points in time. In other words, age-adjusted rates make two groups that differ in their age distribution more comparable. This method is particularly relevant when populations being compared have different age structure, as is true, for example in the U.S. white and Hispanic populations. In addition to being associated with population structure, age also is frequently associated with many health outcomes and their risk factors. Therefore, age standardization becomes a necessary method to control for the confounding effects of age.
Question 2. Are age-adjusted rates usually different from unadjusted rates in NHANES data?
Answer: That depends mainly on two factors: 1) whether the two subgroups in comparison are very different in age distribution; and 2) whether the health outcomes or risk factors being compared are associated with age. If yes to both cases, you will usually find the age-adjusted rates considerably different from crude rates. This suggests that age-standardization is necessary. Nevertheless, it is generally good practice to use age-adjusted estimates when comparing health outcomes among subgroups, or at least compare the age-adjusted estimates with the crude rates to make sure there are no substantial differences, before using the crude estimates.
Question 3. There are different methods for age-standardization. Which do you recommend for NHANES data?
Answer: For NHANES analysis, we usually adopt the direct method for age standardization. This involves three steps:
- selecting a standard population, typically a US Census population
- calculating age-standardizing proportions for age categories of interests
- applying the adjustment factors to subpopulations under comparison
For continuous NHANES, we recommend using the 2000 Census population. A spreadsheet with the year 2000 U.S. population structure by age is included in the tutorial for your convenience.
Question 4. When do you recommend the use of population estimates?
Answer: We most frequently use prevalence rates to describe health outcomes or risk factors. These rates describe occurrences ranging from rare to common in a population at a given time, but it is hard to see the impact or magnitude of the issue just by looking at the prevalence rate. Population estimates allow researchers to look at the estimated total numbers of persons in the U.S. affected with a given condition, thus they can better describe the public health impact of an outcome or risk factor.
Question 5. How do you calculate population estimates for NHANES data?
- Calculate the crude prevalence (as a percentage) for the age-, sex-, or race/ethnicity subgroups you are interested in reporting. Then output these results to a SAS file.
- Use population totals from the Current Population Surveys (CPS) to determine population estimates in NHANES.
- Multiply the prevalence of the health condition of interest by the corresponding CPS-based population total to obtain an estimate of the number of non-institutionalized U.S. individuals with the condition.
Question 6. Where can we obtain CPS totals for continuous NHANES data?
Answer: CPS-based population tables for NHANES by race/ethnicity, gender and age are located on the respective survey cycle NHANES web page at: http://www.cdc.gov/nchs/nhanes/nhanes_cps_totals.htm and as SAS data files located on the Download Sample Code and Dataset page of our tutorial.
Question 7. Can you combine population totals across survey cycles, or for multiple age and gender or race/ethnic subgroups?
Answer: Yes. It is possible to combine NHANES survey cycles. For example, to combine two survey cycles (e.g., 2001-2002 and 2003-2004), you must use the midpoint of each cycle, and combine them as follows: ½ (NHANES 2001-2002 population totals) + ½ (NHANES 2003-2004 population totals) in order to get a population total for 2001-2004. Similarly, you would do this for each of the age-, sex-, or race/ethnicity groups you wanted to combine to get a population total for that group.
The only exception would be when combining NHANES 1999-2000 with 2001-2002 data. As stated in the weighting module, these survey years used a different reference population for sampling, so population totals for 1999-2002 are provided by NCHS.
Question 8. Why can't you just sum the final sampling weights for the population totals?
Answer: Since the non-institutionalized CPS population totals are used to calculate the final sampling weights for the NHANES survey, you may wonder why you cannot just sum the final sampling weights for all sample persons with the health condition of interest, in order to arrive at population estimates for the health condition. For example, the total population estimate for a given health condition from the interviewed sample should equal the sum of the final interview weights for that health condition within the demographic domains among all interviewed persons. However, if there are a significant number of exclusions or missing data for a health condition, summing the weights will not produce an accurate population estimate. Therefore, using this method is NOT RECOMMENDED.
Question 1. When do you use linear regression for NHANES data?
Answer: A linear regression model is typically used to assess the association between independent variable(s) (Xi) and a continuous dependent variable (Y). In cross-sectional surveys such as NHANES, linear regression analyses can be used to examine associations between covariates and health outcomes. For instance, you can use a multiple linear regression model to assess the association between high density lipoprotein cholesterol (Y) and selected covariates (Xi) such as race/ethnicity, age, sex, body mass index (BMI), smoking status, and education level.
Question 2. Which test statistics would you recommend for regression analysis of NHANES data, WALD F, Satterthwaite adjusted F, or Satterthwaite adjusted chi square?
Answer: For regression analyses, SUDAAN produces the WALD F, Satterthwaite adjusted F, and Satterthwaite adjusted chi square statistics with their corresponding p-values. SAS Survey Procedures only produce the Wald F test with their corresponding p-values.
The current NHANES Analytic Guidelines do not make a recommendation about which test statistic is the " best.” Generally speaking, the Satterthwaite adjusted F is the most conservative of the three statistics, which rejects the null hypothesis less often than do the other two statistics. However, we encourage analysts to examine all three statistics and the corresponding p-values for consistency. We also encourage analysts to compare the nominal degrees of freedom (i.e. the number of PSUs minus the number of strata containing observations) to the adjusted Satterthwaite degrees of freedom. Nominal degrees of freedom that are much larger than the adjusted Satterthwaite degrees of freedom may indicate model instability.
Question 3. How do you specify a multiple regression model in SUDAAN with both continuous and discrete independent variables plus interaction terms?
Answer: In SUDAAN, the association between the dependent and independent variables is expressed using the model statement in the proc regress procedure. The dependent variable must be a continuous variable and will always appear on the left hand side of the equation. The variables on the right hand side of the equation are the independent variables and may be discrete, continuous or both. Continuous variables are simply listed in the model. Discrete variables are specified using a subgroup or a class statement.
When interactions are included in the model, they are denoted with an asterisk, *, between the two variables. An interaction can occur between a discrete and a continuous variable, or between two discrete variables. An interaction term will always appear on the right hand side of an equation.
Question 4. Can you do multiple regression analysis in SAS?
Answer: You can conduct multiple regression analysis on NHANES using SAS Survey Procedures. However, you need to be mindful that version 9.1 of SAS Survey Procedures does not have a domain statement for subpopulation analyses. Therefore, you have to use a macro provided on the SAS website. You need to download that file, save it to your computer, and make sure to note the location, as you will use SAS code to refer to this file later.
In SAS version 9.2 or later version, a domain statement will be added to proc surveyreg so you will no longer need to use the SAS macro to deal with subpopulation analyses.
The model statement in SAS is very similar to the SUDAAN procedure: the dependent variable Y is continuous, and always appears on the left hand side of the equation. The variables on the right hand side of the equation are the independent variables and may be discrete or continuous. Interactions always appear on the right hand side of an equation, and are denoted with an asterisk, *, between the two variables.
Question 5. How do you select a reference category in a regression analysis?
Answer: In SUDAAN, the default reference category for a discrete variable is set to the last category. However, you can use the reflevel statement to change the reference level of a categorical variable to your desired category.
In SAS, by default it is the high level in a discrete variable, and there is no option to change the reference category in the model. Therefore, you will need to recode the desired reference category as the highest level before specifying the model.
Question 1. What statistical software can I use for logistic regression analysis?
Answer: You can run logistic regression with stand-alone SUDAAN or SAS-callable SUDAAN, SAS Survey procedure, or Stata.
Question 2. Are these software packages very similar in programming languages?
Answer: Not really. For instance, each SAS or SUDAAN version has its own unique commands for executing logistic regression analysis. You need to use the correct command for the software that you are using. Please also note that different versions of SAS and SUDAAN use slightly different statements to specify categorical variables and reference groups. Make sure that you are using the correct commands for the version of software on your computer.
This tutorial module usually used SAS 9.1 and SUDAAN 9.0, and the commands are slightly different for SAS and SUDAAN. For example:
- the stand-alone version of SUDAAN, the procedure is logistic
- SAS-callable SUDAAN, the procedure is called rlogist
- SAS survey procedures, the procedure is surveylogistic
Question 3. How do you select weights for logistic regression models?
Answer: You always use the weight of the smallest common denominator for all variables in the model. For instance, if you have both household interview variables and MEC examination variables in the model, you will choose MEC examination weights, since not all respondents who were interviewed have participated in the MEC exam. Therefore, it is always important to check all the variables in the model, and identify the variable(s) with the smallest sample size.
Question 4. How do you code the dependent variable for event and non-event?
Answer: For SUDAAN and SAS, you usually code the dependent variable as 1 for an event, and 0 for a non-event.
Question 5. When I run both the SAS Survey and SUDAAN programs for the same logistic regression model, why do I sometimes get different results?
Answer: This is because there may be slight differences caused by missing data in any paired PSU or how each software program handles degrees of freedom. Specifically:
- The variance estimates and standard errors are identical if there are no missing data in any paired PSUs. They will be different if any one of the paired PSUs contains missing data, as SAS and SUDAAN handle stratum contribution from the missing cells differently.
- The confidence intervals are slightly different since SAS and SUDAAN handles degrees of freedom differently.