Describing Epidemiologic Data

Robert E. Fontaine

Introduction

On This Page

Organizing Epidemiologic Data
Characterizing The Cases (What?)
Counts and Rates (How Much?)
Time (When?)
Person (Among Whom?)
Best Practices
References

As a field epidemiologist, you will collect and assess data from field investigations, surveillance systems, vital statistics, or other sources. This task, called descriptive epidemiology, answers the following questions about disease, injury, or environmental hazard occurrence:

What?
How much?
When?
Where?
Among whom?

The first question is answered with a description of the disease or health condition. “How much?” is expressed as counts or rates. The last three questions are assessed as patterns of these data in terms of time, place, and person. After the data are organized and displayed, descriptive epidemiology then involves interpreting these patterns, often through comparison with expected (e.g., historical counts, increased surveillance, or output from prevention and control programs) patterns or norms. Through this process of organization, inspection, and interpretation of data, descriptive epidemiology serves multiple purposes (Box 6.1).

Box 6.1

Purposes of Descriptive Epidemiology

Descriptive epidemiology

Provides a systematic approach for dissecting a health problem into its component parts.
Ensures that you are fully versed in the basic dimensions of a health problem.
Identifies populations at increased risk for the health problem under investigation.
Provides timely information for decision-makers, the media, the public, and others about ongoing investigations.
Supports decisions for initiating or modifying control and prevention measures.
Measures the progress of control and prevention programs.
Enables generation of testable hypotheses regarding the etiology, exposure mode, control measure effectiveness, and other aspects of the health problem.
Helps validate the eventual incrimination of causes or risk factors.

Your analytic findings must explain the observed patterns by time, place, and person.

Top of Page

Organizing Epidemiologic Data

Organizing descriptive data into tables, graphs, diagrams, maps, or charts provides a rapid, objective, and coherent grasp of the data. Whether the tables or graphs help the investigator understand the data or explain the data in a report or to an audience, their organization should quickly reveal the principal patterns and the exceptions to those patterns. Tables, graphs, maps, and charts all have four elements in common: a title, data, footnotes, and text (Box 6.2). In this chapter, additional guidelines for preparing these data displays will appear where the specific data display type is first applied.

Top of Page

Characterizing The Cases (What?)

Tables are commonly used for characterizing disease cases or other health events and are ideal for displaying numeric values. In addition to the previously mentioned elements in common to all data displays (Box 6.2), tables have column and row headings that identify the data type and any units of measurement that apply to all data in that column or row. A well-structured analytical table that is organized to focus on comparisons will help you understand the data and explain the data to others. In arranging analytical tables, you should begin with the arrangement of the data space by following a simple set of guidelines (Box 6.3) (1).

Cases are customarily organized in a table called a line-listing (Table 6.1) (2). This arrangement facilitates sorting to reorganize cases by relevant characteristics. The line-listing in Table 6.1 has been sorted by days between vaccination and onset to reveal the pattern of this important time–event association. Commonly in descriptive epidemiology, you organize cases by frequency of clinical findings (Table 6.2) (3). If the disease cause is unknown, this arrangement can assist the epidemiologist in developing hypotheses regarding possible exposures. For example, initial respiratory symptoms might indicate exposure through the upper airways, as in Table 6.2.

Box 6.2

Components of Statistical Data Displays

A statistical data display should include, at a minimum,

A title that includes the what, where, and when that identifies the data it introduces.
A data space where the data are organized and displayed to indicate patterns.
Footnotes that explain any abbreviations used, the data sources, units of measurement, and other necessary details or data.
Text that highlights the main patterns of the data (this text might appear within the table or graphic or in the body of the report).

Box 6.3

Guidelines for Arranging Data in Tables

Round data to two statistically significant or effective numbers.
Using three or more significant figures interferes with comparison and comprehension.
More precision is usually not needed for epidemiologic purposes.
Effective figures refers to numbers that contain additional, leading non-zero digits that do not vary (e.g., 123, 145, 168, or 177) or vary slightly (see BMI columns in Table 6.3) within a column or row.
Provide marginal averages, rates, totals, or other summary statistics for rows and columns whenever possible.
Use columns for most crucial data comparisons.
Numbers are more easily compared down a column than across a row.
Organize data by magnitude (sort) across rows and down columns.
Use the most important epidemiologic features on which to sort the data.
Organizing data columns and rows by the magnitude of the marginal summary statistics is often helpful.
When the row or column headings are numeric (e.g., age groups), they should govern the order of the data.
Use the table layout to guide the eye. For example,
Align columns of numbers on the decimal point (or ones column).
Place numbers close together, which might require using abbreviations in column headings.
Avoid using dividing lines, grids, and other embellishments within the data space.
Use alternating light shading of rows to assist readers in following data across a table.

Source: Adapted from Reference 1.

Table 6.1

Reported cases of intussusception among recipients of tetravalent rhesus-based rotavirus vaccine,a by state—United States, 1998–1999
State	Age (mos.)	Sex	Days^b	Dose
New York	2	M	3	1
California	3	M	3	1
Pennsylvania	6	M	3	1
Pennsylvania	2	M	4	1
Colorado	4	F	4	1
California	7	M	4	2
Kansas	2	F	5	1
Colorado	3	M	5	1
New York	3	F	5	1
North Carolina	4	F	5	1
Missouri	11	M	5	1
Pennsylvania	3	F	7	1
California	4	F	14	2
Pennsylvania	2	M	29	1
California	5	M	59	1
California	7	M	4	2
Kansas	2	F	5	1
Colorado	3	M	5	1
New York	3	F	5	1
North Carolina	4	F	5	1
Missouri	11	M	5	1
Pennsylvania	3	F	7	1
California	4	F	14	2
Pennsylvania	2	M	29	1
California	5	M	59	1

F, female; M, male.
^aRotaShield®, Wyeth-Lederle, Collegeville, Pennsylvania
^bDays from vaccine dose to illness onset
Source: Adapted from Reference 2

Table 6.2

Prevalence of symptoms and work-relateda symptoms among hospital environmental services staff reporting use of a new disinfection product—Pennsylvania, August–September 2015 (n = 68)
Symptom	Number with Any	Number with Work-related^a	Percentage with Any	Percentage with Work-related^a
Watery eyes^b	31	20	46	29
Nasal problems^b	28	15	41	22
Asthma-like symptoms^c	19	10	28	15
Shortness of breath	11	5	16	7
Skin irritation^b	10	7	15	10
Wheeze^b	10	5	15	7
Chest tightness^b	9	2	13	3
Cough	3	1	4	1
Asthma attack^b	2	1	3	1

^aDefined as a symptom that improved while away from the facility, either on days off or on vacation.
^bDuring the previous 12 months.
^cDefined as current use of asthma medicine or one or more of the following symptoms during the previous 12 months: wheezing or whistling in the chest, awakening with a feeling of chest tightness, or attack of asthma.

Source: Adapted from Reference 3

Counts and Rates (How Much?)

Counts

A first and simple step in determining how much is to count the cases in the population of interest. Always check whether data sources are providing incident (new events among the population) or prevalent (an existing event at a specific point in time) cases. For incident cases, specify the period during which the cases occurred. This count of incident cases over time in a population is called incidence. Never mix incident with prevalent cases in epidemiologic analyses.

The counts of incident or prevalent cases can be compared with their historical norm or another expected or target value. These case counts are valid for epidemiologic comparisons only when they come from a population of the same or approximately the same size.

Rates, Ratios, and Alternative Denominators

Rates correct counts for differences among population sizes or study periods. Thus, incidence divided by an appropriate estimation of the population yields several versions of incidence rates. Similarly, prevalent case counts divided by the population from which they arose produce a proportion (termed prevalence). Strictly speaking, in computing rates, the disease or health event you have counted should have been derived from the specific population used as the denominator. However, sometimes the population is unknown, costly to determine, or even inappropriate. For example, a maternal mortality ratio and infant mortality rate use births in a calendar year as a denominator for deaths in the same calendar year, yet the deaths might be related to births in the previous calendar year. To assess adverse effects from a vaccine or pharmaceutical, consider using total doses distributed as the denominator. Another example is injuries from snowmobile use, which have been calculated both as ratios per registered vehicle and as per crash incident (4). Returning now to counts, you can calculate expected case counts for a population by multiplying an expected (e.g., historical counts, increased surveillance, or output from prevention and control programs) or a target rate by the population total. This expected or target case count is now corrected for the population and can be compared with the actual observed case counts.

Figure 6.1

Mean, median, range, and interquartile range of body mass index measurements of 1,800 residents, by education level: Ajloun and Jerash Governorates, Jordan, 2012.

fig6-1-BMI

Source: Adapted from: Ajloun Non-Communicable Disease Project, Jordan, unpublished data, 2017.

Measurements on a Continuous Scale

Disease or unhealthy conditions also can be measured on a continuous scale rather than counted directly (e.g., body mass index [BMI], blood lead level, blood hemoglobin, blood sugar, or blood pressure). You can use empirical cutoff points (e.g., BMI ≥26 for overweight). These can then be counted and the rates calculated. However, a person’s measurements can fluctuate above or below these cutoff values. To calculate incidence, special care therefore is needed to avoid counting the same person every time a fluctuation occurs above or below the cutoff point. For prevalence, this fluctuation amplifies the statistical error. A more precise approach involves computing the average and dispersion of the individual measurements. These can then be compared among groups, against expected values, or against target values. The averages and dispersions can be displayed in a table or visualized in a box-and-whisker plot that indicates the median, mean, interquartile range, and outliers (Figure 6.1) (5).

Time (When?)

Time has special importance in interpreting epidemiologic data in that the initial exposure to a causative agent must precede disease. Often, this will follow a biologically determined interval. The disease or health condition onset time is the preferred statistic for studying time patterns. Onset might not always be available. In surveillance systems, you might have only the report date or another onset surrogate. Moreover, with slowly developing health conditions, a discernable onset might not exist. On the opposite end of the scale, injuries and acute poisonings have instantaneous and obvious onsets.

Similarly, times of suspected exposures vary in their precision. With acute infections, poisonings, and injuries, you will often have precise exposure times to different suspected agents. Contrast this with chronic diseases that can have exposures lasting for decades before development of overt disease. Other relevant events supplementing a chronologic framework of a health problem include underlying environmental conditions, changes in health policy, and application of control and prevention measures.

Relating disease with these events in time can support calculation of key characteristics of the disease or health event. If you know both time of onset and time of the presumed exposure, you can estimate the incubation or latency period. When the agent is unknown, the time interval between presumed exposures and onset of symptoms helps in hypothesizing the etiology. For example, the consistent time interval between rotavirus vaccination and onset of intussusception (Table 6.1) helped build the hypothesis that the vaccine precipitated the disease (1). Similarly, when the incubation period is known, you can estimate a time window of exposure and identify exposures to potential causative agents during that window.

Depicting Data by Time: Graphs

Graphs are most frequently used for displaying time associations and patterns in epidemiologic data. These graphs can include line graphs, histograms (epidemic curves), and scatter diagrams (see Box 6.4 for general guidelines in construction of epidemiologic graphs).

Box 6.4

Guidelines for Graphical Data Presentation

Take care in selecting a graph type in computer graphics programs. In Microsoft Excel (Microsoft Corporation, Redmond, WA), for example, you should use “scatter,” not “line” to produce numerically scaled line graphs.
Adhere to mathematical principles in plotting data and scaling axes.
On an arithmetic scale, represent equal numerical units with equal distances on an axis.
When using transformed data (e.g., logarithmic, normalized, or ranked), represent equal units of the transformed data with equal distances on the axis.
Represent dependent variables on the vertical scale and independent variables on the horizontal scale.
Use alternatives to joining data points with a line. Consider instead
- No line at all (use data markers only).
- A trend line of best fit underlying the data markers.
- A moving average line underlying the data markers.
Aspect ratios (data space width to height) of approximately 2:1 work well. Extreme aspect ratios distort data.
Scale the graph to fill the data space and to improve resolution. If this means that you must exclude the zero level, exclude it, but note for the reader that this has been done.
Do not insist on a zero level unless it is an integral feature of the data (e.g., an endpoint).
Use graphic designs that reveal the data from the broad overview to the fine detail.
To compare two lines, plot their difference directly.
Use visually prominent symbols to plot and emphasize the data.
Make sure overlapping plotting symbols are distinguishable.
When two or more data sets are plotted in the same data space
- Design point markers and lines for visual discrimination; and
- Differentiate them with labels, legends, or keys.
To avoid clutter and maintain undistorted comparisons, consider using two or more separate panels for different strata on the same graph.
When comparing two graphs of the same dependent variable, use scaling that improves comparison and resolution.
Clearly indicate scale divisions and scaling units.
Minimize frames, gridlines, and tick marks (6–10/axis is sufficient) to avoid interference with the data.
Use six or fewer tick mark labels on the axes. More than that becomes confusing clutter.
Keep keys, legends, markers, and other annotations out of the data space. Instead, put them just outside the data region.
Proofread your graphs.

Contact Diagrams

Contact diagrams are versatile tools for revealing relationships between individual cases in time. In contact diagrams (Figure 6.2, panel A) (5), which are commonly used for visualizing person-to-person transmission, different markers are used to indicate the different groups exposed or at risk.
Epidemic Curves

Epidemic curves (Box 6.5) are histograms of frequency distributions of incident cases of disease or other health events displayed by time intervals. Epidemic curves often have patterns that reveal likely transmission modes. The following sections describe certain kinds of epidemic situations that can be diagnosed by plotting cases on epidemic curves.

Box 6.5

Guidelines for Epidemic Curve Histograms

Time intervals are indicated on the x-axis and case counts on the y-axis.
Upright bars in each interval represent the case counts during that interval.
No gaps should exist between the bars.
Use time intervals of half an incubation or latency period or less.
Decrease the time interval size as case numbers increase.
Indicate an interval of 1–2 incubation periods before the outbreak increases from the background and after it returns to background levels.
Use separate, equally scaled epidemic curves to indicate different groups.
Do not stack columns for different groups atop one another in the same graph.
Use an overlaid line graph, labels, markers, and reference lines to indicate suspected exposures, interventions, special cases, or other key features.
Compare the association of cases during these pre-and post-epidemic periods with the main outbreak.

Point Source

An epidemic curve with a tight clustering of cases in time (≤1.5 times the range of the incubation period, if the agent is known) and with a sharp upslope and a trailing downslope is consistent with a point source (Figure 6.3) (6). Variations in slopes (e.g., bimodal or a broader than expected peak) might indicate different ideas about the appearance, persistence, and disappearance of exposure to the source. Of note, administration of antimicrobials, immunoglobulins, antitoxins, or other quickly acting drugs can lead to a shorter than expected outbreak with a curtailed downslope.

To approximate the time of exposure, count backward to the average incubation period before the peak, the minimum incubation period from the initial cases, and the maximum incubation period from the last cases. These three points should bracket the exposure period. If a rapidly acting intervention was taken early enough to prevent cases, discount the contribution of the last cases to this estimation.

Point source outbreaks result in infected persons who might have transmitted the agent directly or through a vehicle to others. These secondary cases might appear as a prominent wave after a point source by one incubation period, as observed after a point source hepatitis E outbreak that resulted from repairs on a broken water main (Figure 6.4) (7). With diseases of shorter incubation and lower rates of secondary spread, the secondary wave might appear only as a more prolonged downslope.

Continuing Common Source

Outbreaks can arise from common sources that continue over time. The continuing common source epidemic curve will increase sharply, similar to a point source. Rather than increase to a peak, however, this type of epidemic curve has a plateau. The downslope can be precipitous if the common source is removed or gradual if it exhausts itself. The rapid increase, plateau, and precipitous downslope all appeared with a salmonellosis outbreak from cheese distributed to multiple restaurants and then recalled (Figure 6.5).

Figure 6.2

Contact between severe acute respiratory syndrome (SARS) cases among a group of relatives and health care workers: Beijing, China, 2003.

Contact between severe acute respiratory syndrome (SARS) cases among a group of relatives and health care workers: Beijing, China, 2003.

Source: Adapted from Reference 5.

Age group (yrs)	Persons – M	Persons – F	Persons – All	BMI (SD) – M	BMI (SD) – F	BMI (SD) – All	BMI ≥ 26 (%) – M	BMI ≥ 26 (%) – F	BMI ≥ 26 (%) – All
18–29	189	251	440	24.7 (4.9)	24.6 (5.2)	24.6 (5.1)	30	32	31
30–39	242	249	491	27.2 (4.7)	29.1 (6.2)	28.2 (5.6)	54	70	62
40–49	198	182	380	28.4 (4.9)	30.8 (7.6)	29.6 (6.5)	66	77	72
50–59	84	119	203	28.5 (6.4)	30.7 (5.7)	29.8 (6.1)	67	82	76
60–74	90	110	200	29.0 (6.1)	30.9 (7.2)	30.1 (6.8)	71	78	75
75–99	43	43	86	26.9 (3.6)	29.6 (5.7)	28.3 (4.9)	58	70	64
All ages	846	954	1,800	27.2 (5.3)	28.7 (6.8)	28.0 (6.2)	55	64	60

Describing Epidemiologic Data

Examining Rates by Time