Human Genome Epidemiology (2nd ed.): Building the evidence for using genetic information to improve health and prevent disease
necessarily represent the views of the funding agency.”
Marta Gwinn and Wei Yu
Twentieth-century developments in biology and statistics established genetics as a science and led to the discovery of causal loci for many single-gene disorders. In the 1960s, Dr. Victor A. McKusick began compiling a continuously updated catalog of genes and diseases; first published in book form, the catalog went online in 1987 as Online Mendelian Inheritance in Man, or OMIM. By the 1990s, OMIM was adding more than 150 disease-related genetic variants per year, nearly all of them rare mutations discovered in families (1). Since then, the declining costs and increasing efficiency of new technologies (especially automation and microarrays) have prompted an unprecedented outpouring of genomic data that has been compared with a “tsunami” for its potential to overwhelm capacity for data management and analysis (2,3).
The development of computational technology and methods to organize, archive, visualize, and share genomic data gave rise to the field of bioinformatics (4). In 1988, the National Library of Medicine (a component of the National Institutes of Health) created the National Center for Biotechnology Information (NCBI) to provide “an integrated, one-stop, genomic information infrastructure for biomedical researchers from around the world.” NCBI has become a central repository for genomic sequence data in humans and other species and has developed many other public databases, such as dbSNP (for single nucleotide polymorphisms, or SNPs) and Entrez Gene (for genes) (5,6). Perhaps the most prominent and widely used NCBI database is PubMed, a continuously updated, public database of more than 18 million citations for biomedical literature. Entrez is the search engine that allows searching across all NCBI databases.
The international Human Genome Project’s early commitment to data sharing helped stimulate the construction of other, online genomic data repositories and tools for use by researchers and the public. For example, the UCSC Human Genome Browser, launched in 2002, created a framework for displaying multiply annotated sequence data at any scale throughout the genome (7). The UCSC Genome Browser Database has continued to evolve, adding many web-based applications for viewing, manipulating, and analyzing the data (8).
The Human Genome Organization (HUGO) was founded in 1988 to foster coordination among large-scale human genome mapping and sequencing projects around the world. The HUGO Gene Nomenclature Committee maintains a database of approved, unique gene names and symbols, which currently includes more than 28,000 genes (http://www.genenames.org) (9). The Human Genome Variation Society (HGVS) has begun a grass-roots effort to compile a list of locus-specific databases (LSDBs), which are curated collections of mutations, often reported with associated phenotypic information (10). Recently, NCBI embraced these efforts by allowing users to search, annotate, and submit human genome sequence variants to the dbSNP database by using HGVS standard nomenclature (11).
Human Genome Epidemiology
Genomic data are relevant to public health to the extent that they can be translated into knowledge useful for prevention, prediction, diagnosis, and treatment of disease. Human genome epidemiology is the basic science for translating genomic research, relating genetic variation with variability in health status among well-defined groups of people. Analyzing these data in terms of measured individual and group characteristics is a complex, multidimensional problem.
During the past several years, the Human Genome Epidemiology Network (HuGENet™) has laid out a process for knowledge synthesis and evaluation in human genome epidemiology. The underlying framework for this process is a “network of networks”: a collection of formal and informal collaborations organized according to location, funding source, or research interests (12) (see Chapter 7). The HuGENet “road map” for knowledge synthesis and evaluation defines a cycle that begins with reporting of research results and continues through systematic review and synthesis, grading of evidence, and feedback to research investigators and sponsors (13).
Since 2001, HuGENet has maintained an online knowledge base in human genome epidemiology known as HuGE Navigator (14). The core data are extracted from PubMed weekly by a combination of automated and manual processes. A single curator selects relevant abstracts and indexes them by gene, study type (observational, meta-analysis, pooled analysis, clinical trial, genome-wide association), and category (of genotype prevalence, gene–disease association, gene–environment interaction, pharmacogenomics, and evaluation of genetic tests).
Human genome epidemiology accounts for only a small fraction of the published scientific literature in human genetics or genomics. Identifying relevant articles is a “needle in a haystack” problem that requires maximizing both sensitivity and specificity. In 2001, about 2,500 (5%) of nearly 50,000 PubMed citations on human genetics or genomics were included in the HuGE Navigator database. In 2007, PubMed added more than 67,000 new articles on human genetics or genomics and more than 5,000 (8%) met HuGE Navigator inclusion criteria (Figure 4.1). The rapid growth of this literature threatened to overwhelm the sole database curator; furthermore, an evaluation of sensitivity found that as many as 20% of relevant articles were being missed (15).
In 2006, HuGE Navigator introduced a new search strategy based on data and text mining algorithms; this approach reduced by 90% the number of citations reviewed by the curator each week, while increasing recall (sensitivity) to 97.5% (16). To make the database more accessible and useful to interdisciplinary researchers, HuGE Navigator added a user interface and an integrated set of new applications for exploring genetic associations, candidate gene selection, and investigator networks (17). Some of these applications are described in the following text.
HuGE Literature Finder is the core application of HuGE Navigator (18). The use of nonstandard terminology in published literature is a major obstacle to efficient searching and synthesis of information in human genome epidemiology. To address this problem, HuGE Navigator uses Unified Medical Language System (UMLS) concept unique identifiers (CUIs) to index PubMed abstracts in the database. Medical subject headings (MeSH) constitute one of the controlled vocabularies in UMLS; HuGE Navigator converts MeSH terms assigned by PubMed staff to UMLS CUIs. To index genes, HuGE Navigator uses HUGO gene symbols as well as Entrez Gene identifiers and gene aliases to supplement the content-rich UMLS metathesaurus. HuGE Navigator thus allows users to perform free-text queries, which enhances search sensitivity and makes more information available to the user. A filtering feature allows users to stratify query results by indexing terms (disease, gene, study type, category), as well as by author, journal, year, and country of publication. Genome-wide association studies (GWAS) are flagged and linked to the National Human Genome Research Institute’s (NHGRI) Catalog of Published Genome-Wide Association Studies.
Phenopedia and Genopedia provide summary views of the HuGE Literature database by disease (MeSH term) and gene (HUGO gene symbol). Disease term definitions and gene-centered data are accessible from either view. Phenopedia is disease-centered, displaying a frequency table of association studies, meta-analyses, and GWAS by gene. Phenopedia is a springboard for an important goal of the HuGENet roadmap: to develop an online encyclopedia containing disease-specific summaries of existing knowledge about genetic factors (see Chapter 20). Phenopedia also provides links to Web sites for disease-specific research consortia, databases, and other resources.
Genopedia is gene-centered, displaying a frequency table of association studies, meta-analyses, and GWAS by gene. Genopedia links at the gene level to other databases containing detailed sequence data, as well as relevant information on molecular pathways, genetic variation, and genotype prevalence, genetic associations, gene expression, and genetic testing.
HuGE Investigator Browser creates domain-specific investigator networks by automatically parsing author affiliation data in PubMed records (17). This example of data mining provides a new way to explore and build investigator networks that are crucial to the HuGENet strategy. Nevertheless, it is only a starting point because the information available from PubMed is limited to first authors and ambiguity in author names and affiliations cannot be completely resolved.
Gene Prospector ranks genes in order of available evidence for association with diseases or potential interactions with environmental risk factors. Published GWAS findings and meta-analyses are weighted more than individual association studies and availability of animal data is used to break ties (19).
Variant Name Mapper is an example of HuGE Navigator applications that assist users in conducting analyses, such as systematic reviews of genetic associations (20). Variant Name Mapper maps common names for genetic variants to their corresponding rs numbers (assigned by dbSNP). In the absence of a universal nomenclature for genetic variants, rs numbers provide a key for comparison, especially with results of commercial chips for GWAS.
HuGE Watch offers a general overview of publication trends in human genome epidemiology by year, by country, and by journal. Even these minimal data can offer useful information (21). For example, results of HuGE Watch queries show that although the number of published gene–disease association studies more than tripled from 2001 through 2008, the number examining gene–environment interactions remained small (Figure 4.2).
HuGE Navigator can be used to generate summary impressions of research activity in human genome epidemiology, as well as in such specialized subdomains as meta-analyses, clinical trials, and evaluations of genetic tests. HuGE Navigator can also serve as a starting point for systematic reviews and meta-analyses of gene– disease associations, providing a quick orientation to the literature captured by PubMed. For example, examining frequently studied gene–disease associations can suggest which ones lack a recent meta-analysis (22). Although PubMed is the largest single database of biomedical publications, it does not include all journals. As outlined in the HuGE Review Handbook [PDF 166.11 KB], a comprehensive review requires searching other publication databases (such as Science Citation Index, EMBASE, and BIOSIS) and other data sources (23). Inevitably, the reviewer must do the work to collect the articles, conduct hand searches, and abstract and analyze the data.
Accelerated production of genomic data has prompted the proliferation of databases. Since 1996, the Nucleic Acids Research journal has published an annual genomic database issue and compiled an online directory; in 2008, the cumulative number of databases topped 1,000 for the first time (24). Reporting on the “annual stamp collecting edition,” science blogger Duncan Hull asked, “As we pass the one thousand databases mark . . . I wonder what proportion of these databases will never be used?” (25). Simply capturing and storing data online—without the capacity to process or analyze it—does little to transform it into useful information. In a special issue dedicated to “big data,” Nature magazine editorialized, “Researchers need to adapt their institutions and practices in response to torrents of new data—and need to complement smart science with smart searching” (26).
Although vastly challenging, assembly of the first human genome sequence was essentially a linear puzzle. Through annotation, analysis, and knowledge synthesis, the genome sequence is now just one dimension in a complex, multidimensional system of relationships at many levels. Understanding these relationships requires an interconnected data system, as well as tools for navigation. For example, NCBI’s Entrez search engine connects NCBI databases that extend from the level of SNPs to phenotypes. The network of links among NCBI databases can be explored visually online at http://www.ncbi.nlm.nih.gov/Database/datamodel/.
The HuGE Navigator exploits existing knowledge infrastructures, including HUGO, UMLS, and especially NCBI databases. The HuGE Literature database is compiled from PubMed abstracts; HuGE Navigator’s controlled vocabulary includes MeSH terms (as part of UMLS); and HuGE Navigator further mines PubMed data for author and journal information. In turn, Entrez Gene links to the HuGE Navigator, which also supplies citations for Entrez Gene’s GeneRIFs (References Into Function) annotated bibliography (27).
Standardization of gene names and identifiers is now widely accepted, allowing HuGE Navigator to link to many other gene-centered databases. For epidemiology and downstream translation, however, disease-centered data are more important. Unfortunately, existing disease ontologies are far more intricate and less precise than those for genes, and no single, best system prevails. To define phenotypes, the HuGE Navigator employs MeSH terms, which are assigned by expert coders when publication abstracts are entered in PubMed; however, many other controlled vocabularies in the Unified Medical Language System (UMLS) have been developed for medical and biomedical research purposes. For example, the Systematized Nomenclature of Medicine—Clinical Terms (SNOMED CT) was designed to capture clinical care and research data; its use requires a license, which in the United States is provided by the National Library of Medicine . The International Classification of Diseases (ICD) developed by the World Health Organization is used worldwide for reporting morbidity and mortality statistics. Each of these vocabularies has advantages and disadvantages and the mapping from one to another (e.g., via UMLS) is not always straightforward.
Currently, various disease-specific summaries of genetic associations can be found scattered throughout the published literature and across many domain-specific Web sites, such as the PDQ Cancer Information Summaries: Genetics. Two well-recognized, online resources for disease-oriented summaries are OMIM and GeneReviews; both of these focus largely on uncommon, single-gene disorders.
The HuGENet collaboration aspires to develop an updated, online encyclopedia containing disease-specific summaries of existing knowledge about genetic factors— including genotype–phenotype associations, gene–gene and gene–environment interactions, and available genetic tests. This ambition faces many fundamental obstacles that are intrinsic to the way that research in this area is currently funded, conducted, published, and evaluated; nevertheless, some possible prototypes exist.
An instructive example is the AlzGene knowledge base, which is a component of the Alzheimer Research Forum Web site. AlzGene was the basis for a comprehensive systematic review and meta-analysis of Alzheimer disease genetic association studies published in 2007 (28). The database is continuously updated with primary research data abstracted from articles captured in PubMed. Users can search the database by gene and polymorphism, as well as by study, to obtain tables that summarize studied populations and results. An alternative view of the data includes allele and genotype frequencies stratified by race and ethnicity, along with metaanalysis results displayed as a forest plot. Other components of the Alzheimer Research Forum Web site include a bibliography, a research news digest, a conference calendar, and information on disease management and drug development. Now 10 years old, the Alzheimer Research Forum calls itself a “thriving scientific web community,” which promises to evolve further via informatics as a resource for sharing “richly contextualized information” among researchers, practitioners, and affected families (29).
Building the knowledge base in human genome epidemiology involves organizing, sharing, mining, interpreting, and evaluating the results of genomic research from a population perspective. This effort faces many technical, scientific, and social challenges, which can be met only by unprecedented levels of interaction across multiple levels of the research enterprise and cooperation among individual scientists, research groups, institutions, and agencies.
Controlled vocabularies and ontologies (which specify terms, concepts, and relationships) have become fundamental devices for organizing and sharing information within specific domains (30), and are particularly important for human genome epidemiology, which is concerned with integrating heterogeneous types of information (e.g., on genetic variants, individual traits, population characteristics) and the quantitative relationships among them. Naming all the elements in these domains and consistently modeling the relationships between them is a challenge of daunting scale and complexity. Human genome epidemiology should encourage the consistent use of interoperable ontologies for human phenotypes to permit collection, sharing, analysis, and synthesis of information by humans and computers (31).
Describing human genetic variation presents technical challenges. The HUGO system of unique gene names and symbols has become a widely accepted standard; however, development of a nomenclature for genetic variants is still evolving (32). Systematic review and synthesis of gene–disease associations require specific data at the level of genetic variants. As a central repository for SNPs and other genetic variants, dbSNP assigns each variant a unique accession number (rs number). Consistent use of rs numbers in abstracts that report genetic associations would substantially enhance capacity for data mining and knowledge synthesis in this field.
During the last decade, the Internet has become the preeminent infrastructure for building scientific knowledge through dissemination, annotation, and synthesis. Technical innovations such as XML (Extensible Markup Language) have enhanced the basis for data mining, and open access scientific journals have helped create a rich substrate (33). Overall, the trend in biomedical research is toward development of a “cyberinfrastructure” that integrates databases, network protocols, and computational tools together across research domains (34,35). The Cancer Biomedical Informatics Grid (caBIG) is a well-established model, dedicated to managing knowledge and supporting collaboration in cancer research.
Only recently have advances in genotyping technology permitted large-scale epidemiologic studies of gene–disease association and gene–environment interaction. NHGRI has sponsored a number of such studies through two large initiatives, the Genetic Association Information Network (GAIN) and the Genes, Environment, and Health Initiative (GEI). An integral component of these initiatives is an online data repository, dbGaP, (the database of Genotypes and Phenotypes), developed in collaboration with NCBI. In addition, NHGRI maintains a summary online “catalog” of published novel and statistically significant results of these studies, which are also indexed by HuGE Navigator. None of these resources, however, can capture all of the measured genetic associations (regardless of prior probability, size, or statistical significance) in a format amenable to knowledge synthesis.
In principle, online data repositories could be equipped with tools for summary analysis and meta-analysis of gene–disease associations. Recently, however, concern that even simple prevalence data for a sufficient number of SNPs could be matched with other genotype data to identify individual persons led several large data repositories to modify their public data access policies and to remove summary genotype prevalence data from public view (36,37). Together, current developments in genomics and informatics technologies are challenging traditional approaches to maintaining confidentiality of research data (38). Reliance on routine electronic data safeguards (such as removing personally identifying information) is clearly inadequate (39). Because privacy and confidentiality are socially defined concepts, their meaning and value must be considered from humanistic as well as scientific perspectives.
Genetics and epidemiology have grown from “cottage industries” to “big science,” built on large-scale research collaborations and consortia (40,41). Although technology makes big science possible, it is still a thoroughly human enterprise shaped by social priorities, incentives, and expectations. Appropriate policies and norms for collecting, curating, publishing, and sharing data will have major implications for the developing knowledge base in genomics, including human genome epidemiology (42). In the “big data” issue of Nature, a group of authors from diverse fields, fifteen different institutions, and four countries wrote of the need to organize research output and recognize the role of knowledge management in the biological sciences:
Biocuration, the activity of organizing, representing and making biological information accessible to both humans and computers, has become an essential part of biological discovery and biomedical research. But curation increasingly lags behind data generation in funding, development and recognition. (43)
They further observed that “As publication has become a mainly digital endeavor . . . , publications and biological databases are becoming increasingly similar” and recommended the use of reporting-structure standards to improve cross-referencing and indexing, and thus to increase the visibility and value of scientific research findings. For human genome epidemiology, an initial step in this direction is an extension of the STrengthening the Reporting of OBservational studies in Epidemiology (STROBE) statement to genetic association studies (44) (see Chapter 10). As an interdisciplinary effort to integrate information from many domains across many dimensions, human genome epidemiology can take a leading role in coordinated efforts to improve knowledge synthesis.
- Peltonen L, McKusick V. Genomics and medicine. Dissecting human disease in the postgenomic era. Science. 2001;291(5507):1224–1229.
- National Cancer Institute. caBIG: the launch of a bioinformatics community. NCI Cancer Bulletin. 2004;1(9):5–6.
- Attwood, TK. The quest to deduce protein function from sequence: the role of pattern databases. Int J Biochem Cell Biol. 2000;32(2):139–155.
- Sherry ST, Ward MH, Kholodov M, et al. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. January 1, 2001;29(1):308–311.
- Maglott D, Ostell J, Pruitt KD, et al. Entrez Gene: gene-centered information at NCBI. Nucleic Acids Res. 2005;33:D54–D58.
- Kent WJ, Sugnet CW, Furey TS, et al. The human genome browser at UCSC. Genome Res. 2002;12(6):996–1006.
- Karolchik D, Kuhn RM, Baertsch R, et al. The UCSC Genome Browser Database: 2008 update. Nucleic Acids Res. 2008;36:D773–D779.
- Bruford EA, Lush MJ, Wright MW, et al. The HGNC Database in 2008: a resource for the human genome. Nucleic Acids Res. 2008;36:D445–D448.
- Horaitis O, Talbot CC, Jr, Phommarinh M, et al. A database of locus-specific databases. Nat Genet. 2007;39(4):425.
- den Dunnen JT and Antonarakis SE. Mutation nomenclature extensions and suggestions to describe complex mutations: a discussion. Hum Mutat. 2000:15(1):7–12.
- Seminara D, Khoury MJ, O’Brien TR, et al. The emergence of networks in human genome epidemiology: challenges and opportunities. Epidemiology. 2007;18(1):1–8.
- Ioannidis JP, Gwinn M, Little J, et al. A road map for efficient and reliable human genome epidemiology. Nat Genet. 2006;38(1):3–5.
- Yu W, Gwinn M, Clyne M, et al. A navigator for human genome epidemiology. Nat Genet. 2008;40(2):124–125.
- Lin BK, Clyne M, Walsh M, et al. Tracking the epidemiology of human genes in the literature: the HuGE Published Literature database. Am J Epidemiol. 2006;164(1):1–4.
- Yu W, Clyne M, Dolan SM, et al. GAPscreener: an automatic tool for screening human genetic association literature in PubMed using the support vector machine technique. BMC Bioinformatics. 2008;9:205.
- Yu W, Yesupriya A, Wulf A, et al. An automatic method to generate domain-specific investigator networks using PubMed abstracts. BMC Med Inform Decis Mak. 2007;20;7:17.
- Yu W, Yesupriya A, Wulf A, et al. An open source infrastructure for managing knowledge and finding potential collaborators in a domain-specific subset of PubMed, with an example from human genome epidemiology. BMC Bioinformatics. 2007;8:436.
- Yu W, Wulf A, Liu T, et al. Gene Prospector: An evidence gateway for evaluating potential susceptibility genes and interacting risk factors for human diseases. BMC Bioinformatics. December 8, 2008;9:528.
- Yu W, Ned R, Wulf A, et al. The need for genetic variant naming standards in published abstracts of human genetic association studies.. BMC Research Notes 2009;2:56.
- Yu W, Wulf A, Yesupriya A, et al. HuGE Watch: tracking trends and patterns of published studies of genetic association and human genome epidemiology in near-real time. Eur J Hum Genet. 2008;16:1155–1158.
- Yesupriya A, W Yu, Clyne M, et al. The continued need to synthesize the results of genetic associations across multiple studies. Genet Med. 2008;10:633–635.
- Frodsham AJ, Higgins JP. Online genetic databases informing human genome epidemiology. BMC Med Res Methodol. 2007;7:31.
- Galperin MY. The Molecular Biology Database Collection: 2008 update. Nucleic Acids Res. January 2008; 36(Database issue):D2–D4. Epub November 19, 2007.
- One thousand databases high and rising.
- Anonymous. Community cleverness required. Nature. 2008;455(7209):1.
- Mitchell JA, Aronson AR, Mork JG, et al. Gene indexing: characterization and analysis of NLM’s GeneRIFs. AMIA Annu Symp Proc. 2003;2003:460–464.
- Bertram L, McQueen MB, Mullin K, et al. Systematic meta-analyses of Alzheimer disease genetic association studies: the AlzGene database. Nat Genet. 2007;39(1):17–23.
- Clark T, Kinoshita J. Alzforum and SWAN: the present and future of scientific web communities. Brief Bioinform. 2007;8(3):163–171.
- Bodenreider O, Stevens R. Bio-ontologies: current trends and future directions. Brief Bioinform. 2006;7(3):256–274.
- Smith B, Ashburner M, Rosse C, et al. The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration. Nat Biotechnol. 2007;25:1251–1255.
- den Dunnen JT, Antonarakis SE. Mutation nomenclature extensions and suggestions to describe complex mutations: a discussion. Hum Mutat. 2000;15(1):7–12.
- Brown PO, Eisen MB, Varmus HE, et al. Why PLoS became a publisher. PLoS Biol. 2003;1(1): E36.
- Stein LD. Towards a cyberinfrastructure for the biological sciences: progress, visions and challenges. Nat Rev Genet. 2008;9(9): 678–688.
- Buetow KH. Cyberinfrastructure: empowering a “third way” in biomedical research. Science. 2005;308(5723):821–824.
- Homer N, Szelinger S, Redman M, et al. Resolving individuals contributing trace amounts of DNA to highly complex mixtures using high-density SNP genotyping microarrays. PLoS Genet. 2008;4(8):e1000167.
- Zerhouni EA, Nabel EG. Protecting Aggregate Genomic Data. Science. October 3, 2008;322(5898):44. Epub September 4, 2008.
- Lunshof JE, Chadwick R, Vorhaus DB, et al. From genetic privacy to open consent. Nat Rev Genet. 2008;9(5):406–411.
- McGuire AL, Gibbs RA. Genetics. No longer de-identified. Science. 2006;312(5772):370–371.
- Hoover RN. The evolution of epidemiologic research: from cottage industry to “big” science. Epidemiology. January 2007;18(1):13–17.
- Kreeger K. Consortia, “big science” part of a paradigm shift for genetic epidemiology. J Natl Cancer Inst. 2003;95(9):640–641.
- Foster MW, Sharp RR. Share and share alike: deciding how to distribute the scientific and social benefits of genomic data. Nat Rev Genet. 2007;8(8):633–639.
- Howe D, Costanzo M, Fey P, et al. Big data: The future of biocuration. Nature. 2008;455(7209):47–50.
- von Elm E, Altman DG, Egger M, et al. The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement: guidelines for reporting observational studies. PLoS Med. 2007;4(10):e296.