Task 2c: How to Merge NHANES Data in Stata

Here are the steps to appending NHANES data in Stata:


Step 1: Sort data files by SEQN

The first step in merging data is to sort each of the data files by a unique identifier.  In NHANES data, this unique identifier is known as the sequence number (SEQN). NHANES uses SEQN to identify each sample person, so SEQN is the variable you must use to merge data files. To ensure that all observations are ordered in the same way in each data file, you need to sort each data file by the SEQN variable.

Use the sort command to sort the data and the clear option to replace data already in memory. The save command with the replace option replaces the unsorted dataset with the sorted dataset. The general format of the command is below.

use filename, [clear]
sort varname
save filename, [replace]


Sort the 4-year blood pressure examination data by SEQN.

use C:\Nhanes\Data\bpx_4yr, clear
sort seqn
save C:\Nhanes\Data\bpx_4yr, replace


Sort the 4-year blood pressure questionnaire data by SEQN.

use C:\Nhanes\Data\bpq_4yr, clear
sort seqn
save C:\Nhanes\Data\bpq_4yr, replace


Sort the 4-year medical conditions questionnaire data by SEQN.

use C:\Nhanes\Data\mcq_4yr, clear
sort seqn
save C:\Nhanes\Data\mcq_4yr, replace


Sort the 4-year laboratory data by SEQN.

use C:\Nhanes\Data\lab13_4yr, clear
sort seqn
save C:\Nhanes\Data\lab13_4yr, replace


Sort the 4-year demographic data by SEQN.

use C:\Nhanes\Data\demo_4yr, clear
sort seqn
save C:\Nhanes\Data\demo_4yr, replace



Step 2: Merge data by SEQN

After sorting the data files, you can continue merging the data to the master file (demo_4yr) using the merge command. Remember that you will always merge on a unique identifier in this case, the SEQN variable.  The general form of the merge command is as follows:

Merge [varlist] using filename [, options]


use "c:\NHANES\Data\Demo_4yr", clear
merge seqn using C:\Nhanes\Data\bpx_4yr C:\Nhanes\Data\bpq_4yr C:\Nhanes\Data\mcq_4yr C:\Nhanes\Data\lab13_4yr
save C:\Nhanes\Data\demo_bp, replace



Note that the datasets in the using list could have been sorted by using the option unique sort at the end of the merge command. The option unique will match variables uniquely identifying observations on both data in memory and in filename. The option sort will sort the master and using datasets by match variables before the merge.


Step 3: Check the results

After you have merged the data files, it is advisable that you check the contents again to make sure that the files merged correctly.  Use the describe command to list all variable names and labels. Use the tabulate command to check how the merge went. The merge command, used in the previous step, will create new variables _merge, _merge1., depending on the each dataset listed in the using list of the merge command. The variable _merge takes on one of three values: _merge=1, the observation is present on the master dataset; _merge=2, the observation is present in one of the using datasets (but not the master dataset); _merge=3, the observation is present in at least two datasets, either master or using. Finally, use the tabstat command to check the number of observations for each variable as well as minimum, and maximum values in the merged dataset (demo_bp). Use the nmissing command to list the number of missing observations.



The nmissing command can be installed from http://www.stata-journal.com/software/sj5-4/dm67_3/.


tab _merge
tabstat seqn sddsrvyr ridstatr ridpreg sdmvpsu sdmvstra wtmec4yr ///
riagendr ridageyr ridreth1 bpx* bpq* mcq*, stat(n min max)
nmissing bpx* bpq* mcq*


Highlighted results of this program:



close window icon Close Window