Skip directly to search Skip directly to A to Z list Skip directly to navigation Skip directly to page options Skip directly to site content

Lesson 4: Displaying Public Health Data

Section 2: Tables

If a table is taken out of its original context, it should still convey all the information necessary for the reader to understand the data.

A table is a set of data arranged in rows and columns. Almost any quantitative information can be organized into a table. Tables are useful for demonstrating patterns, exceptions, differences, and other relationships. In addition, tables usually serve as the basis for preparing additional visual displays of data, such as graphs and charts, in which some of the details may be lost.

Tables designed to present data to others should be as simple as possible.(1) Two or three small tables, each focusing on a different aspect of the data, are easier to understand than a single large table that contains many details or variables.

A table in a printed publication should be self-explanatory. If a table is taken out of its original context, it should still convey all the information necessary for the reader to understand the data. To create a table that is self-explanatory, follow the guidelines below.

More About Constructing Tables

  • Use a clear and concise title that describes person, place and time — what, where, and when — of the data in the table. Precede the title with a table number.
  • Label each row and each column and include the units of measurement for the data (for example, years, mm Hg, mg/dl, rate per 100,000).
  • Show totals for rows and columns, where appropriate. If you show percentages (%), also give their total (always 100).
  • Identify missing or unknown data either within the table (for example, Table 4.11) or in a footnote below the table.
  • Explain any codes, abbreviations, or symbols in a footnote (for example, Syphilis P&S = primary and secondary syphilis).
  • Note exclusions in a footnote (e.g., 1 case and 2 controls with unknown family history were excluded from this analysis).
  • Note the source of the data below the table or in a footnote if the data are not original.

One-variable tables

In descriptive epidemiology, the most basic table is a simple frequency distribution with only one variable, such as Table 4.1a, which displays number of reported syphilis cases in the United States in 2002 by age group.(2) (Frequency distributions are discussed in Lesson 2.) In this type of frequency distribution table, the first column shows the values or categories of the variable represented by the data, such as age or sex. The second column shows the number of persons or events that fall into each category. In constructing any table, the choice of columns results from the interpretation to be made. In Table 4.1a, the point the analyst wishes to make is the role of age as a risk factor of syphilis. Thus, age group is chosen as column 1 and case count as column 2.

Epi Info

To create a frequency distribution from a data set in Analysis Module:

Select frequencies, then choose variable under Frequencies of.

(Since Epi Info 3 is the recommended version, only commands for this version are provided in the text; corresponding commands for Epi Info 6 are offered at the end of the lesson.)

Often, an additional column lists the percentage of persons or events in each category (see Table 4.1b). The percentages shown in Table 4.1b actually add up to 99.9% rather than 100.0% due to rounding to one decimal place. Rounding that results in totals of 99.9% or 100.1% is common in tables that show percentages. Nonetheless, the total percentage should be displayed as 100.0%, and a footnote explaining that the difference is due to rounding should be included.

The addition of percent to a table shows the relative burden of illness; for example, in Table 4.1b, we see that the largest contribution to illness for any single age category is from 35–39 year olds. The subsequent addition of cumulative percent (e.g., Table 4.1c) allows the public health analyst to illustrate the impact of a targeted intervention. Here, any intervention effective at preventing syphilis among young people and young adults (under age 35) would prevent almost half of the cases in this population.

The one-variable table can be further modified to show cumulative frequency and/or cumulative percentage, as in Table 4.1c. From this table, you can see at a glance that 46.7% of the primary and secondary syphilis cases occurred in persons younger than age 35 years, meaning that over half of the syphilis cases occurred in persons age 35 years or older. Note that the choice of age-groupings will affect the interpretation of your data.(3)

Table 4.1a Reported Cases of Primary and Secondary Syphilis by Age — United States, 2002

Age Group (years)Number of Cases
≤1421
15–19351
20–24842
25–29895
30–341,097
35–391,367
40–441,023
45–54982
≥55284
Total6,862

Data Source: Centers for Disease Control and Prevention. Sexually Transmitted Disease Surveillance 2002. Atlanta: U.S. Department of Health and Human Services; 2003.

Table 4.1b Reported Cases of Primary and Secondary Syphilis by Age — United States, 2002

CASES
Age Group (years)NumberPercent
Total6,862100.0*
≤14210.3
15–193515.1
20–2484212.3
25–2989513.0
30–341,09716.0
35–391,36719.9
40–441,02314.9
45–5498214.3
≥552844.1

* Actual total of percentages for this table is 99.9% and does not add to 100.0% due to rounding error.

Data Source: Centers for Disease Control and Prevention. Sexually Transmitted Disease Surveillance 2002. Atlanta: U.S. Department of Health and Human Services; 2003.

Table 4.1c Reported Cases of Primary and Secondary Syphilis by Age — United States, 2002

CASES
Age Group (years)NumberPercentCumulative Percent
Total6,862100.0*100.0*
≤14210.30.3
15–193515.15.4
20–2484212.317.7
25–298951330.7
30–341,0971646.7
35–391,36719.966.6
40–441,02314.981.6
45–5498214.395.9
≥552844.1100

* Percentages do not add to 100.0% due to rounding error.

Data Source: Centers for Disease Control and Prevention. Sexually Transmitted Disease Surveillance 2002. Atlanta: U.S. Department of Health and Human Services; 2003.

Two- and three-variable tables

Tables 4.1a, 4.1b, and 4.1c show case counts (frequency) by a single variable, e.g., age. Data can also be cross-tabulated to show counts by an additional variable. Table 4.2 shows the number of syphilis cases cross-classified by both age group and sex of the patient.

Table 4.2 Reported Cases of Primary and Secondary Syphilis by Age and Sex — United States, 2002

NUMBER OF CASES
Age Group (years)MaleFemaleTotal
Total5,2681,5946,862
≤1491221
15–19135216351
20–24533309842
25–29668227895
30–348772201,097
35–391,1212461,367
40–448451781,023
45–54825157982
≥5525529284

Data Source: Centers for Disease Control and Prevention. Sexually Transmitted Disease Surveillance 2002. Atlanta: U.S. Department of Health and Human Services; 2003.

Epi Info

To create a two-variable tablefrom a data set in Analysis Module:

Select frequencies, then choose variable under Frequencies of. Output shows table with row and column percentages, plus chi-square and p-value. For a two-by-two table, output also provides odds ratio, risk ratio, risk difference and confidence intervals. Note that for a cohort study, the row percentage in cells of ill patients is the attack proportion, sometimes called the attack rate.

A two-variable table with data categorized jointly by those two variables is known as a contingency table. Table 4.3 is an example of a special type of contingency table, in which each of the two variables has two categories. This type of table is called a two-by-two table and is a favorite among epidemiologists. Two-by-two tables are convenient for comparing persons with and without the exposure and those with and without the disease. From these data, epidemiologists can assess the relationship, if any, between the exposure and the disease. Table 4.3 is a two-by-two table that shows one of the key findings from an investigation of carbon monoxide poisoning following an ice storm and prolonged power failure in Maine.(4) In the table, the exposure variable, location of power generator, has two categories — inside or outside the home. Similarly the outcome variable, carbon monoxide poisoning, has two categories — cases (number of persons who became ill) and controls (number of persons who did not become ill).

Table 4.3 Generator Location and Risk of Carbon Monoxide Poisoning After an Ice Storm — Maine, 1998

NUMBER OF
CasesControlsTotal
Total
27162189
Generator location
Inside home or
attached structure
232346
Outside home
4139143

Data Source: Daley RW, Smith A, Paz-Argandona E, Mallilay J, McGeehin M. An outbreak of carbon monoxide poisoning after a major ice storm in Maine. J Emerg Med 2000;18:87–93.

Table 4.4 illustrates a generic format and standard notation for a two-by-two table. Disease status (e.g., ill versus well, sometimes denoted cases vs. controls if a case-control study) is usually designated along the top of the table, and exposure status (e.g., exposed versus not exposed) is designated along the side. The letters a, b, c, and d within the 4 cells of the two-by-two table refer to the number of persons with the disease status indicated above and the exposure status indicated to its left. For example, in Table 4.4, “c” represents the number of persons in the study who are ill but who did not have the exposure being studied. Note that the “Hi” represents horizontal totals; H1 and H0 represent the total number of exposed and unexposed persons, respectively. The “Vi” represents vertical totals; V1 and V0 represent the total number of ill and well persons (or cases and controls), respectively. The total number of subjects included in the two-by-two table is represented by the letter T (or N).

Table 4.4 General Format and Notation for a Two-by-Two Table

IllWellTotalAttack Rate (Risk)
Totala + c = V1b + d = V0TV1 ⁄ T
Exposedaba + b = H1a ⁄ a+b
Unexposedcdc + d = H0c ⁄ c+d

When producing a table to display either in print or projection, it is best, generally, to limit the number of variables to one or two. One exception to this rule occurs when a third variable modifies the effect (technically, produces an interaction) of the first two. Table 4.5 is intended to convey the way in which race/ethnicity may modify the effect of age and sex on incidence of syphilis. Because three-way tables are often hard to understand, they should be used only when ample explanation and discussion is possible.

Table 4.5 Number of Reported Cases of Primary and Secondary Syphilis, by Race/Ethnicity, Age, and Sex — United States, 2002

Race/ethnicityAge Group (years)MaleFemaleTotal
American Indian/
Alaskan Native
≤14101
15–19011
20–24538
25–29314
30–34123
35–39358
40–44437
45–548816
≥55213
Total272451
Asian/Pacific Islander≤14112
15–19022
20–249413
25–2916117
30–3421122
35–3914115
40–4414115
45–54808
≥55000
Total831194
Black, Non-Hispanic≤143912
15–1989164253
20–24313233546
25–29322163485
30–34310166476
35–39385183568
40–44305142447
45–54370112482
≥5512923152
Total2,2261,1953,421
Hispanic≤14112
15–19372562
20–2411729146
25–2913926165
30–3417220192
35–3917822200
40–44939102
45–54691483
≥5518119
Total824147971
White, Non-Hispanic≤14314
15–1992433
20–248940129
25–2918836224
30–3437331404
35–3954135576
40–4442923452
45–5437023393
≥551064110
Total2,1082172,325

Data Source: Centers for Disease Control and Prevention. Sexually Transmitted Disease Surveillance 2002. Atlanta: U.S. Department of Health and Human Services; 2003. p. 118.

Pencil graphic Exercise 4.1

The data in Table 4.6 describe characteristics of the 38 persons who ate food at or from a church supper in Texas in August 2001. Fifteen of these persons later developed botulism.(5)

  1. Construct a table of the illness (botulism) by age group. Use botulism status (yes/no) as the column labels and age groups as the row labels.
  2. Construct a two-by-two table of the illness (botulism) by exposure to chicken.
  3. Construct a two-by-two table of the illness (botulism) by exposure to chili.
  4. Construct a three-way table of illness (botulism) by exposure to chili and chili leftovers.

Check your answer.

Table 4.6 Line Listing for Exercise 4.1

IDAgeAttended SupperCaseDate of OnsetCase StatusAte Any FoodAte ChiliAte ChickenAte Chili Leftovers
11YN-YYYN
23YY8/27Lab-confirmedYYNN
37YY8/31Lab-confirmedYYNN
47YN-YYYN
510YN-YYNY
617YY8/28Lab-confirmedYYYN
721YN-NNNN
823YN-YYNN
925YY8/26Epi-linkedYYNN
1029NY8/28Lab-confirmedYUnkUnkY
1138YN-NNNN
1239YN-NNNN
1341YN-YYYN
1441YN-NNNN
1542YY8/26Lab-confirmedYYUnkN
1645YY8/26Lab-confirmedYYYY
1745YY8/27Epi-linkedYYYN
1846YN-YNYN
1947YN-YNYN
2048YY9/1Lab-confirmedYYUnkN
2150YY8/29Epi-linkedYYNN
2250YN-YNYN
2350YN-YNNY
2452YY8/28Lab-confirmedYYYN
2552YN-NNNN
2653YY8/27Epi-linkedYYYN
2753YN-YYYN
2862YY8/27Epi-linkedYYYN
2962YN-YNYN
3063YN-NNNN
3167YN-NNNN
3268YN-NNNN
3369YN-YYYN
3471YN-YNYN
3572YY8/27Lab-confirmedYYYN
3674YN-YYNN
3774YN-YNYN
3878YY8/25Epi-linkedYYYN

Data Source: Kalluri P, Crowe C, Reller M, Gaul L, Hayslett J, Barth S, Eliasberg S, Ferreira J, Holt K, Bengston S, Hendricks K, Sobel J. An outbreak of foodborne botulism associated with food sold at a salvage store in Texas. Clin Infect Dis 2003;37:1490–5.

Tables of statistical measures other than frequency

Tables 4.1–4.5 show case counts (frequency). The cells of a table could also display averages, rates, relative risks, or other epidemiological measures. As with any table, the title and/or headings must clearly identify what data are presented. For example, the title of Table 4.7 indicates that the data for reported cases of primary and secondary syphilis are rates rather than numbers.

Table 4.7 Rate per 100,000 Population for Reported Cases of Primary and Secondary Syphilis, by Age and Race — United States, 2002

Age Group (years)Am. Indian/ Alaska NativeAsian/
Pacific Is.
Black, Non-
Hispanic
HispanicWhite, Non-
Hispanic
Total
10–140.00.10.30.10.00.1
15–190.50.28.61.90.31.7
20–245.01.520.74.31.14.4
25–292.71.619.14.91.84.6
30–342.02.218.26.13.05.4
35–394.81.620.17.13.66.0
40–444.51.616.64.42.84.6
45–546.10.611.82.71.42.6
55–641.40.04.60.60.50.9
65+0.80.01.50.50.10.2
Totals2.40.99.82.71.22.4

Data Source: Centers for Disease Control and Prevention. Sexually Transmitted Disease Surveillance 2002. Atlanta: U.S. Department of Health and Human Services; 2003.

Composite tables

To conserve space in a report or manuscript, several tables are sometimes combined into one. For example, epidemiologists often create simple frequency distributions by age, sex, and other demographic variables as separate tables, but editors may combine them into one large composite table for publication. Table 4.8 is an example of a composite table from the investigation of carbon monoxide poisoning following the power failure in Maine.(4)

It is important to realize that this type of table should not be interpreted as for a three-way table. The data in Table 4.8 have not been arrayed to indicate the interrelationship of sex, age, smoking, and disposition from medical care. Merely, several one variable tables (independently assessing the number of cases by each of these variables) have been concatenated for space conservation. So this table would not help in assessing the modification that smoking has on the risk of illness by age, for example. This difference also explains why portraying total values would be inappropriate and meaningless for Table 4.8.

Table 4.8 Number and Percentage of Confirmed Cases of Carbon Monoxide Poisoning Identified from Four Hospitals, by Selected Characteristics — Maine, January 1998

CASES
CharacteristicNumberPercent
Total cases100100
Sex (female)5959
Age (years)
0–3
55
4–12
1717
13–18
99
19–64
5252
≥65
1717
Smokers2020
Disposition
Released from ED *
8383
Admitted to hospital
1111
Transferred
55
Died
11

* ED = Emergency department

Data Source: Daley RW, Smith A, Paz-Argandona E, Mallilay J, McGeehin M. An outbreak of carbon monoxide poisoning after a major ice storm in Maine. J Emerg Med 2000;18:87–93.

Table shells

Although you cannot analyze data before you have collected them, epidemiologists anticipate and design their analyses in advance to delineate what the study is going to convey, and to expedite the analysis once the data are collected. In fact, most protocols, which are written before a study can be conducted, require a description of how the data will be analyzed. As part of the analysis plan, you can develop table shells that show how the data will be organized and displayed. Table shells are tables that are complete except for the data. They show titles, headings, and categories. In developing table shells that include continuous variables such as age, we create more categories than we may later use, in order to disclose any interesting patterns and quirks in the data.

The following table shells were designed before conducting a case-control study of fractures related to falls in community-dwelling elderly persons. The researchers were particularly interested in assessing whether vigorous and/or mild physical activity was associated with a lower risk of fall-related fractures.

Table shells of epidemiologic studies usually follow a standard sequence from descriptive to analytic. The first and second tables in the sequence usually cover clinical features of the health event and demographic characteristics of the subjects. Next, the analyst portrays the association of most interest to the researchers, in this case, the association between physical activity and fracture. Subsequent tables may present stratified or adjusted analyses, refinements, and subset analyses. Of course, once the data are available and used for these tables, additional analyses will come to mind and should be pursued.

This sequence of table shells provides a systematic and logical approach to the analysis. The first two tables (Table shells 4.9a and 4.9b), describing the health problem of interest and the population studied, provide the background a reader would need to put the analytic results in perspective.

Table Shell 4.9a Anatomic Site of Fall-related Fractures Sustained by Participants, SAFE Study — Miami, 1987–1989

Fracture Site
Number(Percent)
Skull
Blank line.( )
Spine
Blank line.( )
Clavicle (collarbone)
Blank line.( )
Scapula (shoulderblade)
Blank line.( )
Humerus (upper arm)
Blank line.( )
Radius / ulna (lower arm)
Blank line.( )
Bones of the hand
Blank line.( )
Ribs, sternum
Blank line.( )
Pelvis
Blank line.( )
Neck of femur (hip)
Blank line.( )
Other parts of femur (upper leg)
Blank line.( )
Patella (knee)
Blank line.( )
Tibia / fibula (lower leg)
Blank line.( )
Ankle
Blank line.( )
Bones of the foot
Blank line.( )

Adapted from: Stevens, JA, Powell KE, Smith SM, Wingo PA, Sattin RW. Physical activity, functional limitations, and the risk of fall-related fractures in community-dwelling elderly. Annals of Epidemiology 1997;7:54–61.

Table Shell 4.9b Selected Characteristics of Case and Control Participants, SAFE Study — Miami, 1987–1989

CASESCONTROLS
Number(Percent)Number(Percent)
Age65–74Blank line.( )Blank line.( )
75–84Blank line.( )Blank line.( )
≥85Blank line.( )Blank line.( )
Sex
Male
Blank line.( )Blank line.( )
FemaleBlank line.( )Blank line.( )
RaceWhiteBlank line.( )Blank line.( )
BlackBlank line.( )Blank line.( )
OtherBlank line.( )Blank line.( )
UnknownBlank line.( )Blank line.( )
EthnicityHispanicBlank line.( )Blank line.( )
Non-HispanicBlank line.( )Blank line.( )
UnknownBlank line.( )Blank line.( )
Hours/day spent on feet<1Blank line.( )Blank line.( )
2–4Blank line.( )Blank line.( )
5–7Blank line.( )Blank line.( )
≥8Blank line.( )Blank line.( )
Smoking statusNever_smokedBlank line.( )Blank line.( )
Former smokerBlank line.( )Blank line.( )
Current_smokerBlank line.( )Blank line.( )
UnknownBlank line.( )Blank line.( )
Alcohol use (drinks / week)NoneBlank line.( )Blank line.( )
<1Blank line.( )Blank line.( )
1–3Blank line.( )Blank line.( )
≥4Blank line.( )Blank line.( )
UnknownBlank line.( )Blank line.( )

Adapted from: Stevens, JA, Powell KE, Smith SM, Wingo PA, Sattin RW. Physical activity, functional limitations, and the risk of fall-related fractures in community-dwelling elderly. Annals of Epidemiology 1997;7:54–61.

Now that the data in Table shells 4.9a and 4.9b have illustrated descriptive characteristics of cases and controls in this study, we are ready to refine the analysis by demonstrating the variability of the data as assessed by statistical confidence intervals. Because of the study design in this example, we have chosen the odds ratio to assess statistical differences (see Lesson 3). Table shell 4.9c illustrates a useful display for this information.

Table Shell 4.9c Relationship Between Physical Activity (Vigorous and Mild) and Fracture, SAFE Study — Miami, 1987–1989

CASESCONTROLSOdds Ratio
(95% Confidence
Interval)
Number(Percent)Number(Percent)
Vigorous ActivityYesBlank line.( )Blank line.( )Blank line. (Blank line. - Blank line.)
NoBlank line.( )Blank line.( )
Mild ActivityYesBlank line.( )Blank line.( )Blank line. (Blank line. - Blank line.)
NoBlank line.( )Blank line.( )

Adapted from: Stevens, JA, Powell KE, Smith SM, Wingo PA, Sattin RW. Physical activity, functional limitations, and the risk of fall-related fractures in community-dwelling elderly. Annals of Epidemiology 1997;7:54–61.

Creating class intervals

Conventional Rounding Rules

If a fraction is greater than .5, round it up (e.g., round 6.6 to 7).

If a fraction is less than .5, round it down (e.g., round 6.4 to 6).

If a fraction is exactly .5, it is recommended that you round it to the even value (e.g., round both 5.5 and 6.5 to 6). More common and also acceptable is to round it up (e.g., round 6.5 to 7)

If the epidemiologic hypothesis for the investigation involves variables such as “gender” or “exposure to a risk factor (yes/no),” the construction of tables as described thus far in this chapter should be straightforward. Often, however, the presumed risk factor may not be so conveniently packaged. We may need to investigate an infection acquired as a result of hospitalization and “days of hospitalization” may be relevant; for many chronic conditions, blood pressure is an important factor; if we are interested in the effect of alcohol consumption on health risk, number of drinks per week may be an important measurement. These examples illustrate relevant variables that have a broader range of possible responses than are easily handled by the methods described earlier in this chapter. One solution in this case is to create class intervals for your data, keeping the following guidelines in mind:

CDC's National Center for Health Statistics uses the following age categorizations:

<1 infants
1–4 toddlers
5–14 adolescents
15–24 teens and young adults
25–44 adults
45–64 older adults
>65 elderly
  • Class intervals should be mutually exclusive and exhaustive. In plain language, that means that each individual in your data set should fit uniquely into one class interval, and all persons should fit into some class interval. So, for example, age ranges should not overlap. Most measures follow conventional rounding rules (see sidebar).

    A general tip is to use a large number of class intervals for the initial analysis to gain an appreciation for the variability of your data. You can combine your categories later.
  • Use principles of biologic plausibility when constructing categories. For example, when analyzing infant and childhood mortality, we might use categories of 0–12 months (since neonatal problems are different epidemiologically from those of other childhood problems), 1–5 years (since these result from causes of death primarily outside of institutions), and 5–10 years (since these may result from risks in school settings). Table 4.10 illustrates age groups that are sensible for the study of various health conditions that are behaviorally-related.
  • A natural baseline group should be kept as a distinct category. Often the baseline group will include those who have not had an exposure, e.g., non-smokers (0 cigarettes per day).
  • If you wish to calculate rates to illustrate the relative risk of adverse health events by these categories of risk factors, be sure that the intervals you choose for the classes of your data are the same as the intervals for the denominators that you will find for readily available data. For example, to compute rates of infant mortality by maternal age, you must find data on the number of live-born infants to women; in determining age groupings, consider what categories are used by the United States Census Bureau.
  • Always consider a category for “unknown” or “not stated.”

Table 4.10 Age Groupings Used for Different Conditions, as Reported in Surveillance Summaries, CDC, 2003

Overweight In Adults(7)Traumatic Brain Injury(8)Pregnancy-Related Mortality(9)HIV/AIDS(10)Vaccine Adverse Events(11)
18–24 years
25–34
35–44
45–54
55–64
65–74
≥75


Total
<4 years
5–14
15–19
20–24
25–34
35–44
45–64
≥65

Total
 
≤19 years
20–24
25–29
30–34
35–39
≥40



Total
 
<13 years
13–14
15–24
25–34
35–44
45–54
55–64
≥65

Total
<1 year
1–6
7–17
18–64
≥65




Total

In addition to these guidelines for creating class intervals, the analyst must decide how many intervals to portray. If no natural or standard class intervals are apparent, the strategies below may be helpful.

Strategy 1: Divide the data into groups of similar size

A particularly appropriate approach if you plan to create area maps (see later section on Maps) is to create a number of class intervals, each with the same number of observations. For example, to portray the rates of incidence of lung cancer by state (for men, 2001), one might group the rates into four class intervals, each with 10–12 observations:

Table 4.11 Rates of Lung Cancer in Men, 2001 by State (and the District of Columbia)

RateNumber of States in the USCumulative Frequency
22.1–48.31111
48.4–53.31122
53.4–58.71234
58.8–73.31044
Missing data751

Data Source: U.S. Cancer Statistics Working Group. United States Cancer Statistics: 2002 Incidence and Mortality. Atlanta: U.S. Department of Health and Human Services, Centers for Disease Control and Prevention and National Cancer Institute; 2005.

Strategy 2: Base intervals on mean and standard deviation

With this strategy, you can create three, four, or six class intervals. First, calculate the mean and standard deviation of the distribution of data. (Lesson 2 covers the calculation of these measures.) Then use the mean plus or minus different multiples of the standard deviation to establish the upper limits for the intervals. This strategy is most appropriate for large data sets. For example, let's suppose you are investigating a scoring system for preparedness of health departments to respond to emerging and urgent threats. You have devised a series of evaluation questions ranging from 0 to 100, with 100 being highest. You conduct a survey and find that the scores for health departments in your jurisdiction range from 19 to 82; the mean of the scores is 50, and the standard deviation is 10. Here, the strategy for establishing six intervals for these data specifies:

Upper limit of interval 6 = maximum value = 82
Upper limit of interval 5 = 50 + 20 = 70
Upper limit of interval 4 = 50 + 10 = 60
Upper limit of interval 3 = 50
Upper limit of interval 2 = 50 − 10 = 40
Upper limit of interval 1 = 50 − 20 = 30
Lower limit of interval 1 = 19

If you then select the obvious lower limit for each upper limit, you have the six intervals:

Interval 6 = 71–82
Interval 5 = 61–70
Interval 4 = 51–60
Interval 3 = 41–50
Interval 2 = 31–40
Interval 1 = 19–30

You can create three or four intervals by combining some of the adjacent six-interval limits.

Strategy 3: Divide the range into equal class intervals

This method is the simplest and most commonly used, and is most readily adapted to graphs. The selection of groups or categories is often arbitrary, but must be consistent (for example, age groups by 5 or 10 years throughout the data set). To use equal class intervals, do the following:

Find the range of the values in your data set. That is, find the difference between the maximum value (or some slightly larger convenient value) and zero (or the minimum value).

Decide how many class intervals (groups or categories) you want to have. For tables, choose between four and eight class intervals. For graphs and maps, choose between three and six class intervals. The number will depend on what aspects of the data you want to highlight.

Find what size of class interval to use by dividing the range by the number of class intervals you have decided on.

Begin with the minimum value as the lower limit of your first interval and specify class intervals of whatever size you calculated until you reach the maximum value in your data.

For example, to display 52 observations, say the percentage of men over age 40 screened for prostate cancer within the past two years in 2004 by state (including Puerto Rico and the District of Columbia), you could create five categories, each containing the number of states with percentages of screened men in the given range.

Table 4.12 Percentage of Men Over Age 40 Screened for Prostate Cancer, by State (including Puerto Rico and the District of Columbia), 2004

PercentageNumber of StatesCumulative Frequency
40.0–44.933
45.0–49.91821
50.0–54.92546
55.0–59.9551
60.0–64.9152

Data Source: Behavioral Risk Factor Surveillance System [Internet]. Atlanta: Centers for Disease Control and Prevention. Available from: http://www.cdc.gov/brfss.

EXAMPLE: Creating Class Interval Categories

Use each strategy to create four class interval categories by using the lung cancer mortality rates shown in Table 4.13.

Table 4.13 Age-adjusted Lung Cancer Death Rates per 100,000 population, in Rank Order by State — United States, 2000

RankStateRate per 100,000
1Kentucky116.1
2Mississippi111.7
3West Virginia104.1
4Tennessee103.4
5Alabama100.8
6Louisiana99.2
7Arkansas99.1
8North Carolina94.6
9Georgia93.2
10South Carolina92.4
11Indiana91.6
12Oklahoma89.4
13Missouri88.5
14Ohio85.6
15Virginia83.0
16Maine80.2
17Illinois80.0
18Texas79.3
19Maryland79.2
20Nevada78.7
21Delaware78.2
22Rhode Island77.9
23Iowa77.0
24Michigan76.7
25Pennsylvania76.5
RankStateRate per 100,000
TotalUnited States76.9
26Florida75.3
27Kansas74.5
28Massachusetts73.6
29Alaska72.9
30Oregon72.7
31New Hampshire71.2
32New Jersey71.2
33Washington71.2
34Vermont70.2
35South Dakota68.1
36Wisconsin67.0
37Montana66.5
38Connecticut66.4
39New York66.2
40Nebraska65.6
41North Dakota64.9
42Wyoming64.4
43Arizona62.0
44Minnesota60.7
45California60.1
46Idaho59.7
47New Mexico52.3
48Colorado52.1
49Hawaii49.8
50Utah39.7

Data Source: Stewart SL, King JB, Thompson TD, Friedman C, Wingo PA. Cancer Mortality–United States, 1990-2000. In: Surveillance Summaries, June 4, 2004. MMWR 2004;53 (No. SS-3):23–30.

Strategy 1: Divide the data into groups of similar size

(Note: If the states in Table 4.13 had been listed alphabetically rather than in rank order, the first step would have been to sort the data into rank order by rate. Fortunately, this has already been done.)

  1. Divide the list into four equal sized groups of places:

    50 states ⁄ 4 = 12.5 states per group. Because states can't be cut in half, use two groups of 12 states and two groups of 13 states. Missouri (#13) could go into either the first or second group and Connecticut (#38) could go into either third or fourth group. Arbitrarily putting Missouri in the second category and Connecticut into the third results in the following groups:
    1. Kentucky through Oklahoma (States 1–12)
    2. Missouri through Pennsylvania (States 13–25)
    3. Florida through Connecticut (States 26–38)
    4. New York through Utah (States 39–50)
  2. Identify the rate for the first and last state in each group:
    1. Oklahoma through Kentucky 89.4–116.1
    2. Pennsylvania through Missouri 76.5–88.5
    3. Connecticut through Florida 66.4–75.3
    4. Utah through New York 39.7–66.2

EXAMPLE: Creating Class Interval Categories (Continued)

  1. Adjust the limits of each interval so no gap exists between the end of one class interval and beginning of the next. Deciding how to adjust the limits is somewhat arbitrary — you could split the difference, or use a convenient round number.
    1. Oklahoma through Kentucky 89.0–116.1
    2. Pennsylvania through Missouri 76.0–88.9
    3. Connecticut through Florida 66.3–75.9
    4. Utah through New York 39.7–66.2

Strategy 2: Base intervals on mean and standard deviation

  1. Calculate the mean and standard deviation (see Lesson 2 for instructions in calculating these measures.):
    Mean = 77.1
    Standard deviation = 16.1
  2. Find the upper limits of four intervals
    1. Upper limit of interval 4 = maximum value = 116.1
    2. Upper limit of interval 3 = mean + 1 standard deviation = 77.1 + 16.1 = 93.2
    3. Upper limit of interval 2 = mean = 77.1
    4. Upper limit of interval 1 = mean − 1 standard deviation = 77.1 − 16.1 = 61.0
    5. Lower limit of interval 1 = minimum value = 39.7
  3. Select the lower limit for each upper limit to define four full intervals. Specify the states that fall into each interval. (Note: To place the states with the highest rates first, reverse the order of the intervals):
    1. North Carolina through Kentucky (8 states) 93.3–116.1
    2. Rhode Island through Georgia (14 states) 77.1–93.2
    3. Arizona through Iowa (21 states) 61.1–77.1
    4. Utah through Minnesota (7 states) 39.7–61.0

Strategy 3: Divide the range into equal class intervals

  1. Divide the range from zero (or the minimum value) to the maximum by 4:
    (116.1 − 39.7) ⁄ 4 = 76.4 ⁄ 4 = 19.1
  2. Use multiples of 19.1 to create four categories, starting with 39.7:
    1. 39.7 through (39.7 + 19.1) = 39.7 through 58.8
    2. 58.9 through (39.7 + [2 × 19.1]) = 58.9 through 77.9
    3. 78.0 through (39.7 + [3 × 19.1]) = 78.0 through 97.0
    4. 97.1 through (39.7 + [4 × 19.1]) = 97.1 through 116.1
  3. Final categories:
    1. Arkansas through Kentucky (7 states) 97.1–116.1
    2. Delaware through North Carolina (14 states) 78.0–97.0
    3. Idaho through Rhode Island (25 states) 58.9–77.9
    4. Utah through New Mexico (4 states) 39.7–58.8
  4. Alternatively, since 19.1 is close to 20, multiples of 20 might be used to create the four categories that might look cleaner. For example, the final categories could look like:
    1. Arkansas through Kentucky (7 states) 97.0–116.9
    2. Iowa through North Carolina (16 states) 77.0–96.9
    3. Idaho through Michigan (23 states) 57.0–76.9
    4. Utah through New Mexico (4 states) 37.0–56.9
    OR
    1. Alabama through Kentucky (5 states) 100.0–119.9
    2. Illinois through Louisiana (12 states) 80.0–99.9
    3. California through Texas (28 states) 60.0–79.9
    4. Utah through Idaho (5 states) 39.7–59.9

Pencil graphic Exercise 4.2

With the data on lung cancer mortality rates presented in Table 4.13, use each strategy to create three class intervals for the rates.

Check your answer.

References (This Section)

  1. Koschat MA. A case for simple tables. The American Statistician 2005;59:31–40.
  2. Centers for Disease Control and Prevention. Sexually Transmitted Disease Surveillance, 2002. Atlanta, GA: U.S. Department of Health and Human Services, Centers for Disease Control and Prevention, September 2003.
  3. Pierchala C. The choice of age groupings may affect the quality of tabular presentations. ASA Proceedings of the Joint Statistical Meetings; 2002; Alexandria, VA: American Statistical Association; 2002:2697–702.
  4. Daley RW, Smith A, Paz-Argandona E, Mallilay J, McGeehin M. An outbreak of carbon monoxide poisoning after a major ice storm in Maine. J Emerg Med 2000;18:87–93.
  5. Kalluri P, Crowe C, Reller M, Gaul L, Hayslett J, Barth S, Eliasberg S, Ferreira J, Holt K, Bengston S, Hendricks K, Sobel J. An outbreak of foodborne botulism associated with food sold at a salvage store in Texas. Clin Infect Dis 2003;37:1490–5.
  6. Stevens JA, Powell KE, Smith SM, Wingo PA, Sattin RW. Physical activity, functional limitations, and the risk of fall-related fractures in community-dwelling elderly. Ann Epidemiol 1997;7:54–61.
  7. Ahluwalia IB, Mack K, Murphy W, Mokdad AH, Bales VH. State-specific prevalence of selected chronic disease-related characteristics–Behavioral Risk Factor Surveillance System, 2001. In: Surveillance Summaries, August 22, 2003. MMWR 2003;52(No. SS-08):1–80.
  8. Langlois JA, Kegler SR, Butler JA, Gotsch KE, Johnson RL, Reichard AA, et al. Traumatic brain injury-related hospital discharges: results from a 14-state surveillance system. In: Surveillance Summaries, June 27, 2003. MMWR 2003;52(No. SS-04):1–18.
  9. Chang J, Elam-Evans LD, Berg CJ, Herndon J, Flowers L, Seed KA, Syverson CJ. Pregnancy-related mortality surveillance–United States, 1991-1999. In: Surveillance Summaries, February 22, 2003. MMWR 2003;52(No. SS-02):1–8.
  10. Centers for Disease Control and Prevention. HIV/AIDS Surveillance Report, 2003 (Vol. 15). Atlanta, Georgia: US Department of Health and Human Services;2004:1–46.
  11. Zhou W, Pool V, Iskander JK, English-Bullard R, Ball R, Wise RP, et al. Surveillance for safety after immunization: Vaccine Adverse Event Reporting System (VAERS)–1991–2001. In: Surveillance Summaries, January 24, 2003. MMWR 2003;52(No. SS-01):1–24.

Top