Task 1c: How to Check Frequency Distribution and Normality in Stata

The frequency distribution can be presented in table or graphic format. In this task, you will learn how to use the standard Stata commands - summarize, histogram, graph box, and tabstat - to generate these representations of data distributions. These statistics can also be used to determine whether parametric (for a normal distribution) or non-parametric tests are appropriate to use in your analysis. As noted in the Clean & Recode Data module it is advisable to check for extreme weights and outliers before starting any analysis.

 

warning iconWARNING

There are several things you should be aware of while analyzing NHANES data with Stata. Please see the Stata Tips page to review them before continuing.


Step 1: Use the summarize command to generate weighted summary statistics for a population subset

 The Stata command, summarize, generates descriptive and summary statistics that are useful in describing the characteristics of a distribution.   Because the SVY series of commands do not include the summarize command, you will need to use the standard summarize command, but tell Stata to incorporate weights.  Below are instructions on how to write these commands and interpret the output. 

 

This command has the general structure:

summarize varname [w=weightvar], detail

 

Info iconIMPORTANT NOTE

Without the detail option you just get obs, mean, std. dev., minimum and maximum.

 

You can generate summary statistics for various population subsets (e.g. young men, young women, etc).  The example below adds the by varname: prefix to the previous example to create this general format.

by var1 var2, sort: sum varname [w=weightvar] if (condition), detail

 

Here is the command to generate the summary statistics for six population subsets defined by gender (riagendr) and three age categories (age).  The command also includes an if statement, which further restricts to age over 20 years (ridageyr>= 20) and people who have been both interviewed and examined (ridstatr==2).  

by riagendr age, sort : sum lbxtc [w = wtmec4yr] if (ridageyr >=20 & ridageyr <.) & ridstatr==2, detail

 

Info iconIMPORTANT NOTE

Stata represents missing numeric values (".")  as large positive values. Therefore, a missing numeric value would be the highest value. Please see the Stata Tips page for more information.

 

Portion of Output from Example Stata Statement

Portion of Output from Example Stata Statement

 

Reviewing the output, notice that

 

Step 2: Generate histograms and box plots

To generate graphs of the distributions of a continuous variable, use the histogram and graph box commands.

In this example, the general structure of the histogram command is:

histogram varname, by(var1 var2), if (condition), [ options]

 

In this example, the general structure of the graph box command, including the medtype() option to specify how the median is indicated and the over() option to specify different subgroups, is:

graph box varname [w=weightvar], medtype(line) over(var1) over(var2), if (condition)

 

The commands to generate histograms and box plots for six population subsets defined by gender (riagendr) and three age categories (age) are below.  The commands also include if statements, which further restricts to age over 20 years (ridageyr >=20 & ridageyr <.)and people who have been both interviewed and examined (ridstatr==2). In addition, the histogram command uses the normal option to add a normal density to the graph.

histogram lbxtc, by(riagendr age), if (ridageyr >=20 & ridageyr <.) & ridstatr==2, normal

graph box lbxtc [pweight = wtmec4yr], medtype(line) over(riagendr) over(age), if (ridageyr >=20 & ridageyr <.) & ridstatr==2

 

Output from Histogram Statement

Output from Histogram Statement

 

Output from Graph Box statement

Output from Graph Box statement

 

 

Reviewing the output of these commands, notice that:

Step 3: Use tabstat to request selective statistics

In some instances, you may not need all of the statistics generated by summarize. You can use the tabstat command as a useful alternative to summarize because it allows specification of the statistics to be displayed.

The general structure for the tabstat command is very similar to the summarize command, but you can specify the statistics you want. Using the tabstat command also arranges the output in a table.

 

tabstat varname [w=weightvar], statistics(statname)

 

Here is the same cholesterol (lbxtc) analysis for six population subsets defined by gender (riagendr) and three age categories (age).  The command also includes an if statement, which further restricts to age over 20 years (ridageyr >=20 & ridageyr <.)and people who have been both interviewed and examined (ridstatr==2), which now only reports the mean, 25th percentile (p25), median, and 75th percentile (p75).

 

by riagendr: tabstat lbxtc [w=wtmec4yr], by(age) stat(mean p25 median p75), if (ridageyr >=20 & ridageyr <.) & ridstatr==2

 

Output from Example Tabstat Command

Output from Example Tabstat Command

 

Note that there are two tables - one for each gender with three age categories and that only the statistics requested by the statistics option are displayed.

close window icon Close Window