## Task 1c: How to Check Frequency Distribution and Normality in Stata

The frequency distribution can be presented in table or graphic format. In this task, you will learn how to use the standard Stata commands - summarize, histogram, graph box, and tabstat - to generate these representations of data distributions. These statistics can also be used to determine whether parametric (for a normal distribution) or non-parametric tests are appropriate to use in your analysis. As noted in the Clean & Recode Data module it is advisable to check for extreme weights and outliers before starting any analysis.

WARNING

There are several things you should be aware of while analyzing NHANES data with Stata. Please see the Stata Tips page to review them before continuing.

### Step 1: Use the summarize command to generate weighted summary statistics for a population subset

The Stata command, summarize, generates descriptive and summary statistics that are useful in describing the characteristics of a distribution.   Because the SVY series of commands do not include the summarize command, you will need to use the standard summarize command, but tell Stata to incorporate weights.  Below are instructions on how to write these commands and interpret the output.

This command has the general structure:

summarize varname [w=weightvar], detail

IMPORTANT NOTE

Without the detail option you just get obs, mean, std. dev., minimum and maximum.

You can generate summary statistics for various population subsets (e.g. young men, young women, etc).  The example below adds the by varname: prefix to the previous example to create this general format.

by var1 var2, sort: sum varname [w=weightvar] if (condition), detail

Here is the command to generate the summary statistics for six population subsets defined by gender (riagendr) and three age categories (age).  The command also includes an if statement, which further restricts to age over 20 years (ridageyr>= 20) and people who have been both interviewed and examined (ridstatr==2).

by riagendr age, sort : sum lbxtc [w = wtmec4yr] if (ridageyr >=20 & ridageyr <.) & ridstatr==2, detail

IMPORTANT NOTE

Stata represents missing numeric values (".")  as large positive values. Therefore, a missing numeric value would be the highest value. Please see the Stata Tips page for more information.

#### Portion of Output from Example Stata Statement

Reviewing the output, notice that

• The output is arranged by gender and age group so you can see the results for each combination.
• The standard deviation is a measure of the deviation of the observations for the mean.
• Kurtosis is a measure of the peakedness of the distribution. For Stata, the kurtosis of a normally distributed random variable is 3. A kurtosis greater than 3, as in this example, indicates excess values close to the mean and at the tails of the distribution.
• Skewness is a measure of the departure of the distribution of a random variable from symmetry. The skewness of a normally distributed random variable is 0.
• The output also contains the four lowest and highest values, which are useful for review.

### Step 2: Generate histograms and box plots

To generate graphs of the distributions of a continuous variable, use the histogram and graph box commands.

In this example, the general structure of the histogram command is:

histogram varname, by(var1 var2), if (condition), [ options]

In this example, the general structure of the graph box command, including the medtype() option to specify how the median is indicated and the over() option to specify different subgroups, is:

graph box varname [w=weightvar], medtype(line) over(var1) over(var2), if (condition)

The commands to generate histograms and box plots for six population subsets defined by gender (riagendr) and three age categories (age) are below.  The commands also include if statements, which further restricts to age over 20 years (ridageyr >=20 & ridageyr <.)and people who have been both interviewed and examined (ridstatr==2). In addition, the histogram command uses the normal option to add a normal density to the graph.

histogram lbxtc, by(riagendr age), if (ridageyr >=20 & ridageyr <.) & ridstatr==2, normal

graph box lbxtc [pweight = wtmec4yr], medtype(line) over(riagendr) over(age), if (ridageyr >=20 & ridageyr <.) & ridstatr==2

#### Output from Graph Box statement

Reviewing the output of these commands, notice that:

• The histogram for a normally distributed random variable is symmetric and bell-shaped. For variables based on data collected in a survey, such as NHANES 1999-2002, the distribution will deviate at least slightly from normality.
• The box plot of the weighted total cholesterol data show three outliers with variables above 600 mg/dl.

### Step 3: Use tabstat to request selective statistics

In some instances, you may not need all of the statistics generated by summarize. You can use the tabstat command as a useful alternative to summarize because it allows specification of the statistics to be displayed.

The general structure for the tabstat command is very similar to the summarize command, but you can specify the statistics you want. Using the tabstat command also arranges the output in a table.

tabstat varname [w=weightvar], statistics(statname)

Here is the same cholesterol (lbxtc) analysis for six population subsets defined by gender (riagendr) and three age categories (age).  The command also includes an if statement, which further restricts to age over 20 years (ridageyr >=20 & ridageyr <.)and people who have been both interviewed and examined (ridstatr==2), which now only reports the mean, 25th percentile (p25), median, and 75th percentile (p75).

by riagendr: tabstat lbxtc [w=wtmec4yr], by(age) stat(mean p25 median p75), if (ridageyr >=20 & ridageyr <.) & ridstatr==2

#### Output from Example Tabstat Command

Note that there are two tables - one for each gender with three age categories and that only the statistics requested by the statistics option are displayed.