Skip directly to search Skip directly to A to Z list Skip directly to navigation Skip directly to page options Skip directly to site content


This website is archived for historical purposes and is no longer being maintained or updated.

“The findings and conclusions in this book chapter are those of the author(s) and do not necessarily represent the views of the funding agency.”


by Marta Gwinn, MD, MPH
Muin J. Khoury, MD, PhD


News from the Human Genome Project has captivated scientists and the public, creating high expectations that the findings will yield future health benefits. However, a great deal of work remains before practicing physicians can use information from the Human Genome Project to offer patients individualized estimates of risk and interventions based on their genotypes. Estimates of relative and absolute risk for individual patients, such as in the “hypothetical clinical encounter in 2010” described by Francis Collins (1999), will come only from population-based studies conducted as part of an expanded, interdisciplinary research agenda that includes epidemiology as well as well as other laboratory, clinical, and social sciences.

Epidemiology is the study of the distribution and determinants of health conditions in human populations. Originally concerned with the investigation and control of infectious diseases, epidemiology has evolved into a powerful tool for analyzing many kinds of health events. Methods developed for population surveys, risk factor studies, and intervention trials also lend themselves to studies in which genetic information is collected and evaluated. Epidemiologic studies can be used to investigate the prevalence of genetic traits, gene-disease associations, and gene-environment interactions, and to evaluate genetic tests.

The traditional domain of genetic epidemiology (Khoury et al, 1993) is gene discovery, a field that has grown rapidly because of advances in statistical genetics and technology driven by the Human Genome Project. New laboratory methods for measuring environmental exposures and biological processes have also expanded the scope of molecular epidemiology (Schulte and Perera, 1993), in which population-based studies are used to measure biologic and environmental exposures and characterize the associated relative, absolute and attributable risks for disease. Furthermore, as more genetic tests are developed and marketed for use in public health and healthcare settings, the methods of applied epidemiology (Brownson and Petitti, 1998) will be important for evaluating the value added by genetic testing. Does genetic information make a difference in intervention, management, or outcome? Is genetic testing cost-effective? Answers to these questions will be important in preventing the misuse of genetic testing, as well as in realizing its potential benefits.

We recently introduced the term human genome epidemiology, or "HuGE" (Khoury and Dorman, 1998), to define the set of epidemiologic methods for describing the population prevalence of genetic variants and their associations with health and disease, the role of gene-gene and gene-environment interactions, and the clinical validity and utility of genetic tests. Table 1 describes the continuum of HuGE investigations from traditional gene discovery to applications in medicine and public health.

Many epidemiologic studies are designed to evaluate the association of a particular disease with specific environmental exposures, personal characteristics, or behaviors. When such associations are found, these exposures, characteristics, or behaviors are called “risk factors” for the disease. Epidemiologic studies may also identify particular genetic variants as disease risk factors. The disease case definition in such studies must be based on explicit, observable criteria, and is analogous to the genetic concept of phenotype. The case definition also specifies a priori criteria used to select study participants. These criteria define the study's target population (e.g., geographic area or ethnicity) and are essential for drawing causal inferences.

The most common designs for epidemiologic studies of risk factor-disease associations are:

  • cross-sectional studies, in which persons are assessed simultaneously for risk factors and disease;
  • cohort studies, in which persons with and without risk factors are followed up for disease; and
  • case-control studies, in which persons with and without disease are studied for risk factors.

Only cohort studies provide a direct estimate of absolute risk, which is the probability of developing disease during a given time period. In cohort studies that evaluate genetic variants as potential risk factors, absolute risk is analogous to the genetic concept of penetrance.

The most basic analysis of risk factor-disease associations in cross-sectional, cohort and case-control studies can be represented as a 2x2 table in which a given risk factor and disease are either present or absent (Table 2). Both the risk ratio (in cohort studies) and the prevalence ratio (in cross-sectional studies) are called measures of relative risk. In case-control studies, the odds ratio (i.e., odds of exposure to the risk factor among cases, divided by the odds of exposure to the risk factor among controls) is identical to the odds of disease in cohort studies. In addition, the odds ratio is a good approximation of the relative risk for rare diseases.

Attributable fraction is a measure of the overall contribution of a particular risk factor (e.g., genotype) to the occurrence of disease in a given population. It refers to the fraction of cases that would not have occurred within a certain time period, if the risk factor had been absent. Several formulas for calculating attributable fraction have been proposed; one common formula (Miettinen, 1974) is:

attributable fraction = fc (RR - 1) / RR

where fc is the fraction of cases with the risk factor and RR is the measure of relative risk (or odds ratio, for rare diseases).

Measures of absolute risk (penetrance), relative risk, and attributable fraction each provide different information about the role of a gene in disease (Khoury et al, 1993). Table 3 provides an example contrasting these measures for the group of APC gene variants implicated in colorectal cancer associated with familial adenomatous polyposis (FAP), and for a distinct APC variant (I1307K) associated with colorectal cancer in Ashkenazi Jews. The FAP-associated variants are nearly completely penetrant, i.e., the absolute risk for colorectal cancer in persons with these variants approaches 1.0. Furthermore, persons with FAP variants have a 30-fold higher relative risk for colorectal cancer compared with persons without them. However, if the APCI1307K variant is causally related to colorectal cancer, we can anticipate it to have a much higher attributable fraction among Ashkenazi Jews than does FAP (6% vs. 0.4%) because it is much more prevalent.

The same laboratory techniques used to sequence the human genome can be used to create a DNA probe specific to virtually any targeted genetic sequence. Thus, discovery of a new gene variant is all too often followed by a rush to commercialize a genetic test, without careful assessment of the potential benefits and risks. For example, discovery of the APC I1307K variant associated with colon cancer (Laken, 1997) received considerable media attention, and within months one news organization reported: “Researchers have found a new genetic defect present in one of every 17 American Jews that doubles a person's colon cancer risk. The good news is that scientists have developed a blood test, available for $200, that can detect this genetic defect. The test is advisable for everyone in the Ashkenazim population, whether they have a family history of colon cancer or not”(Cancer Research Foundation of America, 1998).

Clearly, this advice is not justified on the basis of the epidemiologic data.

Anticipating that the Human Genome Project will stimulate development of many new genetic tests, the U.S. Department of Health and Human Services has established the Secretary's Advisory Committee on Genetic Testing ( to address the benefits and challenges of genetic knowledge and genetic testing, and to recommend actions for monitoring the accuracy and effectiveness of genetic tests. Three types of data will be needed to evaluate tests before they are introduced in clinical practice:

Analytic validity: sensitivity, specificity, and predictive value in relation to genotype
Clinical validity: sensitivity, specificity, and predictive value in relation to phenotype
Clinical utility: benefits and risks accruing from both positive and negative tests

In addition, genetic tests in widespread use will require ongoing, post-marketing assessment to determine patterns of use and the population impact of testing.

Sensitivity, specificity, and predictive value are familiar parameters to laboratorians and clinicians who perform or interpret screening tests. They are used to compare a test result with a “gold standard.” The calculation of sensitivity, specificity and predictive value is summarized in Table 4.

The distinction between analytic validity and clinical validity underscores the critical importance of defining a “gold standard” for evaluating test results. Analytic validity for genetic tests is established by examining test results for samples having known genotypes (e.g., by genetic sequencing). Clinical validity is established by examining test results for patients having known phenotypes. Thus the “gold standard” for analytic validity is the result of another laboratory test or series of tests, while the “gold standard” for clinical validity is based on clinical criteria. For DNA-based tests, analytic sensitivity, specificity, and predictive value are extraordinarily high, although one can never assume they are perfect. These parameters are of greatest concern to the people who are responsible for developing, performing, and monitoring the quality of laboratory tests.

Healthcare providers and patients are most interested in clinical validity, particularly clinical predictive value, which addresses the likelihood that a person will develop disease, given that he or she has a positive (or negative) test result. In general, the ability to predict outcomes for individuals depends on an understanding of genotype-phenotype correlation at the population level. Population-based studies are particularly critical for estimating the predictive value of tests for susceptibility genes in complex diseases; however, they are also important for defining the clinical spectrum of so-called single-gene disorders. The use of DNA-based tests is now revealing such disorders as cystic fibrosis to be far more variable than previously understood (Mickle and Cutting, 2000).

Clinical sensitivity is another important consideration for interpreting genetic test results, especially in an ethnically diverse population like that of the United States. The same disease can result from several different variants of the same gene (allelic diversity) or of different genes (locus heterogeneity). With current technology, genetic tests cannot always detect all disease-related variants, especially if there are many of them. The failure to detect all disease-related variants reduces a test's clinical sensitivity. For example, Caucasians have the highest incidence of cystic fibrosis in the United States, and most cases result from the ? F508 variant of the CFTR gene; however, a study of 148 CFTR alleles from African-Americans with cystic fibrosis found that only 48% contained the ? F508 variant, while 23% had one of eight other variants not commonly found in Caucasians. Thus, a test for the ? F508 variant alone most likely would have reduced clinical sensitivity for cystic fibrosis in an African-American population (Macek et al, 1997). In all, more than 800 CFTR variants have been described, many in only one or two families. Genetic tests offered for clinical use may include just a few or dozens of CFTR variants (e.g., see Genetests, The clinical sensitivity of these tests in a given population may be the same or different, depending on the population distribution of CFTR variants.

Population geneticists are accustomed to describing genetic variation in terms of allele frequencies, which can be either measured directly or estimated from observed genotype frequencies using the Hardy-Weinberg principle. This principle expresses the concept that both allele and genotype frequencies will remain constant from generation to generation in an infinitely large, interbreeding population without selection, migration, or mutation. If we consider only a single pair of alleles (A and a) occuring with frequencies p and q respectively, expected frequencies of the three possible genotypes are p2 (AA), 2pq (Aa), and q2 (aa). However, in clinical and epidemiologic studies, the association of interest is between disease and individual genotype. Therefore, results of such studies should be reported by genotype.

In general, human disease results from interaction between genetic and environmental factors. Even “genetic” diseases have an environmental component; for example, the mental retardation associated with phenylketonuria occurs only in the presence of dietary phenylalanine. However, genetic variation also contributes to diseases strongly associated with environmental exposures; for example, polymorphisms in the cytochrome P-450 system, involved in oxidative metabolism, influence the risk for lung cancer in tobacco smokers. Teasing out genetic factors in the etiology of common infectious and chronic diseases is a rapidly growing research area.

Epidemiologic studies that measure both genetic and environmental risk factors can estimate their independent effects as well as gene-environment interaction. One approach to this type of analysis is based on a 2x4 table that stratifies the data by both environmental exposure and genotype (Yang and Khoury, 1997). Table 5 presents an example, in which the results of a case-control study of venous thrombosis in young women are stratified by both oral contraceptive use and factor V genotype. In this study, both oral contraceptive use and factor V Leiden (a gene variant associated with hypercoagulability) increased the relative risk for venous thrombosis; however, women with factor V Leiden who also used oral contraceptives had by far the highest relative risk. Because this was a population-based study, the authors could also estimate incidence (absolute risk) of venous thrombosis in women in each group.

Epidemiologic studies incorporating genetic data are subject to all of the usual sources of bias, in addition to potential biases introduced by genotyping. Particular concerns include confounding, misclassification, and type I and type II errors.

  • Confounding exists when a statistical association between genotype and disease is due to underlying associations of both genotype and disease with some other, possibly unmeasured factor in the population, such as ethnicity; in genetic studies, this is sometimes referred to as population stratification.
  • Misclassification is due to measurement error of genotype or other exposures, and is related to the performance characteristics of laboratory methods used (analytic validity).
  • Type I errors are a particular concern in analyses involving multiple comparisons, for example, those evaluating multiple genes in association with a disease outcome.
  • Type II errors may occur in small studies with insufficient power to identify a true association.

An important way to avoid erroneous inferences from epidemiologic studies of genetic factors in human disease is to demand replication in other studies and populations. By integrating genetic information from carefully designed epidemiologic studies with data from other disciplines, we can hope to gain new insights into disease etiology and new avenues for intervention (Collins, 2001).

  1. Brownson RC, Petitti DB. Applied epidemiology: theory to practice. New York: Oxford University Press, 1998.
  2. Cancer Research Foundation of America Web site,
  3. Collins FS. Shattuck Lecture. Medical and societal consequences of the Human Genome Project. New Engl J Med 1999;341:28-37.
  4. Collins FS, McKusick VA. Implications of the Human Genome Project for medical science. JAMA. 2001;285:540-544.
  5. Khoury MJ, Beaty TH, Cohen BH. Scope and strategies of genetic epidemiology: analysis of articles published in Genetic Epidemiology, 1984-1991. Genet Epidemiol 1993;10:321-9.
  6. Khoury MJ, Dorman JS. The Human Genome Epidemiology Network (HuGE Net). Am J Epidemiol 1998;148:1-3.
  7. Laken SJ, Petersen GM, Gruber SB, et al. Familial colorectal cancer in Ashkenazim due to a hypermutable tract in APC. Nat Genet 1997;17:79-83.
  8. Macek M Jr, Mackova A, Hamosh A, et al. Identification of common cystic fibrosis mutations in African-Americans with cystic fibrosis increases the detection rate to 75%. Am J Hum Genet 1997;60:1122-7.
  9. Mickle JE, Cutting GR. Genotype-phenotype relationships in cystic fibrosis. Med Clin North Am 2000;84:597-607.
  10. Miettinen OS. Proportion of disease caused or prevented by a given exposure, trait or intervention. Am J Epidemiol 1974; 99:325-32.
  11. Schulte PA, Perera FP (eds.) Molecular epidemiology: principles and practice. New York: Academic Press, 1993.
  12. Yang Q, Khoury MJ. Evolving methods in genetic epidemiology. III. Gene-environment interaction in epidemiologic research. Epidemiol Rev 1997; 19:33-43.


Table 1: The continuum of human genome epidemiology (HuGE)

Field Application Examples
Genetic epidemiology gene discovery linkage analysis, family-based association studies
Molecular epidemiology gene characterization population studies to characterize gene prevalence, relative risks, attributable risks, and gene-environment interaction
Applied epidemiology health impact evaluation studies to evaluate clinical validity and utility of genetic tests

Table 2: Measures of association in a 2x2 table

Risk factor
Present A B A+B
Absent C D C+D
Total A+C B+D A+B+C+D

E = proportion of persons exposed to risk factor who have disease = A/(A+B)

U = proportion of persons unexposed to risk factor who have disease = C/(C+D)

Risk ratio or prevalence ratio: RR = E / U = [A/(A+B)] / [C/(C+D)]

Odds ratio: OR = E/(1 - E) / U(1 - U) = AD / BC

Table 3. Absolute risk, relative risk, and attributable fraction for variants of the APC gene associated with colorectal cancer.

Allele FAP mutations* APC I1307K†
Genotype prevalence
1 / 8,000
1 / 15
Absolute risk
Relative risk
Attributable fraction

* Bodmer W. Familial adenomatous polyposis (FAP) and its gene, APC. Cytogenet Cell Genet 1999;86:99-104.

† in Ashkenazi Jews only. Woodage T, King SM, Wacholder S, et al. The APCI1307K allele and cancer risk in a community-based study of Ashkenazi Jews. Nat Genet 1998;20:62-5.

Table 4: Sensitivity, specificity, and predictive value in a 2x2 table

“Gold Standard”
Test Result
Present Absent Total

Sensitivity = A / (A+C)

Specificity = D / (B+D)

Positive predictive value = A / (A+B)

Negative predictive value = D / (C+D)

Table 5: Gene-environment interaction: a 2x4 table for a study of venous thrombosis in association with oral contraceptive (OC) use and factor V Leiden (FVL)*


Venous thrombosis



OC use FVL Case Control Odds ratio† per 10,000 PY††
1 (ref)

*Vandenbroucke JP, Koster T, Briet E, et al. Increased risk of venous thrombosis in oral-contraceptive users who are carriers of factor V Leiden mutation. Lancet 1994; 344:1453-1457.

† Estimated relative risk

†† Estimated incidence in strata defined by genotype and oral contraceptive use, PY=person-years

Address correspondence to:
Dr. Marta Gwinn
Office of Genomics and Disease Prevention
Centers for Disease Control and Prevention
4770 Buford Hwy, Mail Stop K28
Atlanta, Georgia 30341-3724
Phone: 770-488-3261
FAX: 770-488-3235