Key Concepts About Merging Data in NHANES I

NHANES I data files were released for public use as they were completed. The basic data files are organized by their topic area, which can fall under one of 20 components. Putting the components of these data files together in a dataset is called merging.



Please be aware that some of these files only included data for detailed subsample, or augmented population, so exercise care in choosing the appropriate ones to merge for your analyses.


The first step in merging data is to sort each of the data files by a unique identifier.  In NHANES I data, this unique identifier is known as the sequence number (SEQN). NHANES I uses SEQN to identify each sample person, so SEQN is the variable you must use to merge data files. To ensure that all observations are ordered in the same way in each data file, you need to sort each data file by the SEQN variable. Use the proc sort procedure in SAS to sort the data. After sorting the data files, you can continue merging.

After you have merged the data files, it is advisable that you check the contents again to make sure that the files merged correctly.  Use the proc contents statement to list all variable names and labels; use the proc means statement to check the number of observations for each variable as well as missing, minimum, and maximum values.


warning iconWARNING

The master dataset prepared in the Keep & Merge Data Module contains all sample persons who have completed the household interview. In other words, the master dataset includes observations on participants who were interviewed only (but not MEC examined), plus those both interviewed and examined.

The reasons for including all sample persons are listed below:

For SUDAAN procedures, it is important that you do not create a smaller subgroup based on any non weight-related groups of interests (e.g. demographic, laboratory or examination variables) in the SAS data step before executing the SUDAAN procedure. Instead, it is highly recommended that you create a subset of your sample population using the subpopn statement in the SUDAAN procedure itself and not in the SAS data step. SUDAAN procedures require that all observations in the dataset being read into a procedure have the same sample weight variable.  

For SAS Survey Procedures, there is no subpopn statement. Instead, most SAS Survey procedures use a domain statement for domain analysis, also known as subgroup analysis or subpopulation analysis.

One important reason not to pre-select your study population in the SAS Data step (e.g. males aged 25 and over) is that for software such as SUDAAN, you will have over-estimated the variance associated with any statistical tests you calculate because you will not have taken into account the full sample size of the survey.  This may yield p-values which are greater than they should be.

It is worth pointing out that some analysts would select the study population at the SAS data step, and choose to save the SAS data file with only observations meeting the selection criteria for the study, for instance, only include those who have completed MEC exams. Besides the reasons stated above, the disadvantage of pre-selection is that you will no longer be able to examine the household interview items by using interview weights, such as looking at interview questions on blood pressure by demographic variables. Also you will not be able to examine non-response rates for the MEC items (i.e. the rate of those who are interviewed but not examined). 

For these reasons, in this tutorial example you were instructed to include all sample persons in the dataset at the data step.

After merging the data, you will need to check the results. You should see that all your variables of interest were included and that any variables you renamed or recoded are correct and include all the years of data.



close window icon Close Window