The frequency distribution can be presented in table or graphic format.
In this task, you will learn how to use the standard Stata
commands* - summarize*,
*histogram*, *graph box*, and *tabstat* - to generate
these representations of data distributions. These statistics can also be used to determine whether parametric (for a normal
distribution) or non-parametric tests are appropriate to use in your analysis.
As noted in the Clean & Recode Data module it is advisable to check for extreme
weights and outliers before starting any analysis.

WARNING

There are several things you should be aware of while analyzing NHANES data with Stata. Please see the Stata Tips page to review them before continuing.

The Stata command*, summarize, *generates descriptive and
summary statistics that are useful in describing the
characteristics of a distribution. Because the SVY series of commands do not include the
*summarize* command,
you will need to use the standard *summarize* command, but tell Stata
to incorporate weights. Below are instructions on how to write
these commands and interpret the output.

This command has the general structure:

__sum__marize varname [w=weightvar], detail

IMPORTANT NOTE

Without the *detail* option you just get *obs*, *mean*, *std.
dev.*, *minimum* and
*maximum*.

You can generate summary statistics for various population subsets (e.g.
young men, young women, etc). The example below adds the *by* *
varname:* prefix to the previous example to create this general format.

by var1 var2, sort: sum varname [w=weightvar] if (condition), detail

Here is the command to
generate the summary statistics for six population subsets defined by gender (*riagendr*)
and three age categories (*age*). The command also includes an *if *statement, which further restricts to age over 20 years
(*ridageyr>= 20*) and
people who have been both interviewed and examined (*ridstatr==2*).

by riagendr age, sort : sum lbxtc [w = wtmec4yr] if (ridageyr >=20 & ridageyr <.) & ridstatr==2, detail

IMPORTANT NOTE

Stata represents missing numeric values (".") as large positive values. Therefore, a missing numeric value would be the highest value. Please see the Stata Tips page for more information.

Reviewing the output, notice that

- The output is arranged by gender and age group so you can see the results for each combination.
- The standard deviation is a measure of the deviation of the observations for the mean.
- Kurtosis is a measure of the peakedness of the distribution. For Stata, the kurtosis of a normally distributed random variable is 3. A kurtosis greater than 3, as in this example, indicates excess values close to the mean and at the tails of the distribution.
- Skewness is a measure of the departure of the distribution of a random variable from symmetry. The skewness of a normally distributed random variable is 0.
- The output also contains the four lowest and highest values, which are useful for review.

- Printer-friendly annotated table of commands
- Watch animation of program and output
- Can't view the demonstration? Try our Tech Tips for troubleshooting help.

To generate graphs of the distributions of a continuous variable, use the *
histogram* and *graph box* commands.

In this example, the general structure of the *histogram* command is:

__hist__ogram
varname, by(var1 var2), if (condition), [ options]

In this example, the general structure of the *graph box* command, including the *medtype()*
option to specify how the median is indicated and the *over()* option
to specify different subgroups, is:

__gra__ph box varname [w=weightvar], medtype(line) over(var1) over(var2), if (condition)

The commands to generate histograms and box plots for six population subsets
defined by gender (*riagendr*)
and three age categories (*age*) are below. The commands also include
i*f *statements, which further restricts to age over 20 years
(*ridageyr >=20 & ridageyr
<.*)and
people who have been both interviewed and examined (*ridstatr==2*). In
addition, the histogram command uses the *normal *option to add a normal
density to the graph.

histogram lbxtc, by(riagendr age), if (ridageyr >=20 & ridageyr <.) & ridstatr==2, normal

graph box lbxtc [pweight = wtmec4yr], medtype(line) over(riagendr) over(age), if (ridageyr >=20 & ridageyr <.) & ridstatr==2

Reviewing the output of these commands, notice that:

- The histogram for a normally distributed
random variable is symmetric and bell-shaped.
**For variables based on data collected in a survey, such as NHANES 1999-2002, the distribution will deviate at least slightly from normality.** - The box plot of the weighted total cholesterol data show three outliers with variables above 600 mg/dl.

In some instances, you may not need all of the statistics
generated by *summarize*. You can use the *tabstat*
command as a useful alternative to *summarize* because it
allows specification of the statistics to be displayed.

The general structure for the *tabstat *command*
*is
very similar to the *summarize *command, but you can specify the statistics
you want. Using the *tabstat *command also arranges the output in a table.

tabstat varname [w=weightvar], statistics(statname)

Here is the same cholesterol (*lbxtc*) analysis for six population subsets defined by
gender (*riagendr*)
and three age categories (*age*). The command also includes an *if *statement, which further restricts to age over 20 years
(*ridageyr >=20 & ridageyr
<.*)and
people who have been both interviewed and examined (*ridstatr==2*), which now
only reports the mean, 25th percentile (*p25*), median, and 75th percentile
(*p75*).

by riagendr: tabstat lbxtc [w=wtmec4yr], by(age) stat(mean p25 median p75), if (ridageyr >=20 & ridageyr <.) & ridstatr==2

Note that there are two tables - one for each gender with three age
categories and that only the
statistics requested by the *statistics *option are displayed.