The frequency distribution can be presented in table or graphic format. In this task, you will learn how to use the standard Stata commands - summarize, histogram, graph box, and tabstat - to generate these representations of data distributions. These statistics can also be used to determine whether parametric (for a normal distribution) or non-parametric tests are appropriate to use in your analysis. As noted in the Clean & Recode Data module it is advisable to check for extreme weights and outliers before starting any analysis.
There are several things you should be aware of while analyzing NHANES data with Stata. Please see the Stata Tips page to review them before continuing.
The Stata command, summarize, generates descriptive and summary statistics that are useful in describing the characteristics of a distribution. Because the SVY series of commands do not include the summarize command, you will need to use the standard summarize command, but tell Stata to incorporate weights. Below are instructions on how to write these commands and interpret the output.
This command has the general structure:
summarize varname [w=weightvar], detail
Without the detail option you just get obs, mean, std. dev., minimum and maximum.
You can generate summary statistics for various population subsets (e.g. young men, young women, etc). The example below adds the by varname: prefix to the previous example to create this general format.
by var1 var2, sort: sum varname [w=weightvar] if (condition), detail
Here is the command to generate the summary statistics for six population subsets defined by gender (riagendr) and three age categories (age). The command also includes an if statement, which further restricts to age over 20 years (ridageyr>= 20) and people who have been both interviewed and examined (ridstatr==2).
by riagendr age, sort : sum lbxtc [w = wtmec4yr] if (ridageyr >=20 & ridageyr <.) & ridstatr==2, detail
Stata represents missing numeric values (".") as large positive values. Therefore, a missing numeric value would be the highest value. Please see the Stata Tips page for more information.
Reviewing the output, notice that
To generate graphs of the distributions of a continuous variable, use the histogram and graph box commands.
In this example, the general structure of the histogram command is:
histogram varname, by(var1 var2), if (condition), [ options]
In this example, the general structure of the graph box command, including the medtype() option to specify how the median is indicated and the over() option to specify different subgroups, is:
graph box varname [w=weightvar], medtype(line) over(var1) over(var2), if (condition)
The commands to generate histograms and box plots for six population subsets defined by gender (riagendr) and three age categories (age) are below. The commands also include if statements, which further restricts to age over 20 years (ridageyr >=20 & ridageyr <.)and people who have been both interviewed and examined (ridstatr==2). In addition, the histogram command uses the normal option to add a normal density to the graph.
histogram lbxtc, by(riagendr age), if (ridageyr >=20 & ridageyr <.) & ridstatr==2, normal
graph box lbxtc [pweight = wtmec4yr], medtype(line) over(riagendr) over(age), if (ridageyr >=20 & ridageyr <.) & ridstatr==2
Reviewing the output of these commands, notice that:
In some instances, you may not need all of the statistics generated by summarize. You can use the tabstat command as a useful alternative to summarize because it allows specification of the statistics to be displayed.
The general structure for the tabstat command is very similar to the summarize command, but you can specify the statistics you want. Using the tabstat command also arranges the output in a table.
tabstat varname [w=weightvar], statistics(statname)
Here is the same cholesterol (lbxtc) analysis for six population subsets defined by gender (riagendr) and three age categories (age). The command also includes an if statement, which further restricts to age over 20 years (ridageyr >=20 & ridageyr <.)and people who have been both interviewed and examined (ridstatr==2), which now only reports the mean, 25th percentile (p25), median, and 75th percentile (p75).
by riagendr: tabstat lbxtc [w=wtmec4yr], by(age) stat(mean p25 median p75), if (ridageyr >=20 & ridageyr <.) & ridstatr==2
Note that there are two tables - one for each gender with three age categories and that only the statistics requested by the statistics option are displayed.