# Lesson 2: Summarizing Data

## Summary, References, and Instructions for Epi Info 6

Frequency distributions, measures of central location, and measures of spread are effective tools for summarizing numerical variables including:

- Physical characteristics such as height and diastolic blood pressure,
- Illness characteristics such as incubation period, and
- Behavioral characteristics such as number of lifetime sexual partners.

Some characteristics, such as IQ, follow a normal or symmetrical bell-shaped distribution in the population. Other characteristics have distributions that are skewed to the right (tail toward higher values) or skewed to the left (tail toward lower values). Some characteristics are mostly normally distributed, but have a few extreme values or outliers. Some characteristics, particularly laboratory dilution assays, follow a logarithmic pattern. Finally, other characteristics follow other patterns (such as a uniform distribution) or appear to follow no apparent pattern at all. The distribution of the data is the most important factor in selecting an appropriate measure of central location and spread.

Measures of central location are single values that represent the center of the observed distribution of values. The different measures of central location represent the center in different ways. The arithmetic mean represents the balance point for all the data. The median represents the middle of the data, with half the observed values below the median and half the observed values above it. The mode represents the peak or most prevalent value. The geometric mean is comparable to the arithmetic mean on a logarithmic scale.

Measures of spread describe the spread or variability of the observed distribution. The range measures the spread from the smallest to the largest value. The standard deviation, usually used in conjunction with the arithmetic mean, reflects how closely clustered the observed values are to the mean. For normally distributed data, 95% of the data fall in the range from −1.96 standard deviations to +1.96 standard deviations. The interquartile range, used in conjunction with the median, includes data in the range from the 25^{th} percentile to the 75^{th} percentile, or approximately the middle 50% of the data.

Data that are normally distributed are usually summarized with the arithmetic mean and standard deviation. Data that are skewed or have a few extreme values are usually summarized with the median and range, or with the median and interquartile range. Data that follow a logarithmic scale and data that span several orders of magnitude are usually summarized with the geometric mean.

## References

- Griffin S., Marcus A., Schulz T., Walker S. Calculating the interindividual geometric standard deviation of r use in the integrated exposure uptake biokinetic model for lead in children. Environ Health Perspect 1999;107:481–7.

## Instructions for Epi Info 6 (DOS)

To download:

Go to https://www.cdc.gov/epiinfo/Epi6/ei6.htm and click on “Downloads.”

To get a complete installation package:

Download and run all three self-expanding, compressed files to a temporary directory.

EPI604_1.EXE (File Size = 1,367,649 bytes)

EPI604_2.EXE (File Size = 1,341,995 bytes)

EPI604_3.EXE (File Size = 1,360,925 bytes)

Then run INSTALL.EXE to install the software.

To create a frequency distribution from a data set in Analysis Module:

EpiInfo6: >freq *variable.*

To identify the mode from a data set in Analysis Module:

Epi Info does not have a Mode command. Thus, the best way to identify the mode is to create a histogram and look for the tallest column(s).

EpiInfo6: >histogram *variable*.

To identify the median from a data set in Analysis Module:

EpiInfo6: >means *variable*. Output indicates median.

To identify the mean from a data set in Analysis Module:

EpiInfo6: >means *variable*. Output indicates median.

To calculate the standard deviation from a data set in Analysis Module:

EpiInfo6: >means *variable*. Output indicates standard deviation, abbreviated as Std Dev.