Human Genome Epidemiology Information (HuGE) on the Internet
Current Resources and Future Prospects
Adam D. Marks1 and Paula W. Yoon2
1 ORISE fellow, Molecular Biology Branch, National Center for Environmental Health, CDC
2 Epidemiologist, Office of Genomics and Disease Prevention, CDC
In February 2001, years ahead of schedule, scientists announced that a first draft of the human genome sequence had been completed. Many researchers heralded the creation of a rough draft of the human genome as a milestone in the advancement of genetic technology (1). While this is doubtlessly true, genetic researchers had already made great strides in identifying genetic causes for many diseases. At the time of the announcement, more than one thousand genes associated with diseases had been described in the scientific literature (2). Although most of this work has involved single-gene or single-locus disorders, increasingly genes are being identified that are associated with common chronic diseases such as cardiovascular disease, obesity, and diabetes (3). Despite the continuing advances in human genetics spurred on by the Human Genome Project, numerous gaps exist in the amount and quality of population-level information for most of the newly discovered genes. Human genome epidemiology (HuGE) studies are needed to measure the prevalence of gene variants in populations, identify gene-gene and gene-environment interactions, quantify the impact of gene variants on the risk for disease, and evaluate and monitor the increasing use of genetic tests.
Genetic epidemiology encompasses the continuum from gene discovery to risk characterization and evaluation of genetic tests and services. Accomplishing this broad research agenda will require access to a wealth of ever increasing genetic information and a forum in which to share new insight and discoveries about genes and disease. Traditionally, scientists have communicated their findings through presentations at professional meetings and publications in scientific journals. In the age of the World-Wide Web, the Internet and Internet-based databases are also becoming an important source of genetic information (4). The Internet provides a dynamic repository for the most up-to-date and comprehensive collections of information in this rapidly evolving field. However, specific information is often hard to find, given the Internet's lack of organization and the number of databases and information sites scattered across the World-Wide Web (5). In this article, we discuss genomic resources on the Web and the need for a comprehensive site dedicated to the organization and maintenance of HuGE information.
Genomic Resources on the Internet
Since the 1970's, scientists and politicians alike have agreed that the results of efforts to map and sequence the human genome should be made freely available to the public (6). Genetic databases in the public domain have existed for decades, starting with the Human Gene Mapping (HGM) library of the late 1970s that could be accessed from computers on the Yale campus (6). Since then, biology and the Internet have enjoyed a happy and productive partnership, and almost every major genetic database now exists in an online form (7). Several hundred public genetic databases exist today, including sequence databases (e.g., GenBank and the International Nucleotide Sequencing Database Collaboration); structural and mapping information (e.g. GenAtlas and the Unified Database for Human Genome Mapping [UDB]); and protein and gene function data (e.g., SWISS-PROT and GeneCards). In addition, a number of commercial companies maintain private genetic databases that can be accessed for a fee. The best-known example is the Celera genetic database, which claims to offer greater computing power and a higher level of annotation than many public databases (8). These and other databases are used for various genetic and biochemical studies and are vital to the discovery and characterization of new genes (7). In addition to the research-orientated databases, numerous clinical and medical genetic information sources exist on the Web, as described below, as well as many educational Web sites geared toward professionals and the public.
Despite the large amount of genetic information that is accessible on the Internet, the availability of genetic epidemiology information is limited. This is due partly to the fact that the field of genetic epidemiology is relatively young. The study of diseases that result from interactions between multiple genetic variants and environmental factors is complex, and population-based data are sparse. Where information does exist, it is scattered across the Web on sites for government agencies, universities, private companies, nonprofit organizations, and others. No focal point or central repository exists today for data and resources on genetic epidemiology. However, several major Web sites exist which function as hubs for genetic information.
The National Center for Biotechnology Information (NCBI) was established at the National Institutes of Health (NIH) in 1988 as a resource for molecular biology information. NCBI has a Web site dedicated to human genome resources, which serves as a clearinghouse for genomic information. The site includes access to sequence and mapping databases, as well as to LocusLink, a query interface to descriptive information about genes and gene loci. One of the newer features of the site is a "Genes and Disease" page aimed at students and the public. This web page provides information for about 70 genetic diseases with links to related databases and allied resources. Although most of the diseases outlined are associated with a single gene, this site will serve as a resource for emerging genetic epidemiology and related clinical information for all diseases, including those that have more complex patterns of inheritance.
Another database that serves as a central repository for genetic information is the Human Genome Variation database (HGVbase). HGVbase is the product of a European consortium involving the Karolinska Institute (Sweden), the European Bioinformatics Institute (United Kingdom), and the European Molecular Biology Laboratory (Germany). The database summarizes all known sequence variations in the human genome and was created to facilitate research into how genotypes affect common diseases, drug responses, and other complex phenotypes. The database can be searched using text or DNA sequence strings. A search results in information about sequence variations with details of their physical and functional relation to their closest neighboring gene. A short paragraph is often provided that lists diseases that have been associated with the gene variants of interest.
Another project with a more targeted focus on gene-disease association is the Environmental Genome Project (EGP). EGP is an endeavor coordinated by the National Institute of Environmental Health Sciences to study genes that interact with environmental agents to cause disease. One goal of the project is to facilitate epidemiologic studies of gene-environment interactions by resequencing selected environmental response genes. A catalogue of polymorphisms for these genes is available to researchers through a centralized database called GeneSNPs. The database includes detailed maps of single nucleotide polymorphisms (SNPs) associated with genes, particularly within expressed and regulatory regions of genes. Information is also available on allele and genotype frequencies in select populations for a subset of the genes.
Through the Internet, the research community has created several foci for gene discovery data that are accurate, comprehensive, and useful as research tools to help define the genetic components of human phenotypic variation. Collaborations like NCBI, the HGVbase, and EGP promote the advancement of science through the timely sharing of data and information while at the same time promoting the quality of data by providing opportunities for comparison, replication, and peer review. The genetic epidemiology community would benefit greatly from a clearinghouse-type Web site where gene-disease association data could be compiled and made accessible through a centralized database. The database could include published studies as well as data from studies with negative associations because these are rarely found in the literature and are important for guiding future research. In the meantime, some human genetic epidemiology information can be found in various places on the Internet, mainly on sites dedicated to medical genetics or disease-specific databases.
Medical Genetics Resources on the Internet
Perhaps the best known online resource for medical genetics is the Online Mendelian Inheritance of Man (OMIM), a guide to genes and inherited disorders that is maintained by The Johns Hopkins University and collaborators (5). OMIM provides information on more than 10,000 genetic conditions and is continually updated with the latest findings published in peer-reviewed journals in genetics, molecular biology, and related disciplines (9). Because the majority of conditions listed in OMIM are rare, the research cited is usually based on case reports and family studies. Epidemiologic data from population-based studies are limited but growing. For example, an OMIM search of "asthma" produced information from several collaborative and case-control studies, as well as the usual gene finding reports and research from animal models. Although OMIM entries are periodically updated, new information and summaries of studies are usually appended to the end of the disease entry; little effort is made to integrate the new information into an overall summary of the gene-disease association. Despite this shortcoming, OMIM is a valuable resource for information about the genetic basis of most diseases. Now that OMIM is integrated into the NCBI genomic information Web site, it is linked to other relevant information and databases.
Another online database that contains some genetic epidemiologic information is GeneTests, which has recently combined with GeneClinics to form the GeneTests-GeneClinics Web site. The GeneTests-GeneClinics Web site is aimed at healthcare providers and researchers and includes a laboratory and clinical directory, educational materials, and disease reviews. The site currently has reviews for about 140 inherited disorders. The GeneReviews are written by experts and include information on diagnosis, clinical description, disease management, molecular genetics, testing, and genetic counseling. The type of genetic epidemiology information that can sometimes be found in these reviews includes disease and gene prevalence data, evidence for the association of the disease with other risk factors, estimates of risk associated with specific genetic mutations, and information about genetic testing. Another source of quality, updated information, specifically for cancer, is the National Cancer Institute's Web site. This site has a wealth of information about treatments, prevention, testing, clinical trials, statistics, and health education materials for a large number of specific cancers. Genetic epidemiology information available at this site includes cancer incidence, mortality, and survival, as well as information on the risk for disease associated with genes, family history, and other factors.
Dozens of others Web sites contain information on medical genetics, including those maintained by consumer organizations, (e.g., the Genetic Alliance), genetic professional groups (e.g., American College of Medical Genetics), genetic companies (e.g., GeneSage), governmental organizations (e.g., NIH, Office of Rare Diseases ), foundations and non-profit groups (e.g., Cystic Fibrosis Foundation), and commercial Web-based companies (e.g., WebMD Health). An extensive listing of Web sites concerned with medical genetics can be found on the National Human Genome Research Institute (NHGRIB) Web site.
Disease-Specific Resources on the Internet
The rapid proliferation of information on gene-disease associations has made linkages and linkage exclusions difficult to describe and catalog for many different phenotypes and different populations. Printed summaries are often not available because the vast amount of evolving information is difficult to compile and unlikely to be timely. To overcome this problem, groups of scientists interested in the same disease have worked together to create specific databases that follow some common standards and are publicly accessible on the Internet. These databases often contain a great deal of specific information about certain mutations, genes, or gene families. For example, the Asthma and Allergy Gene Database, which is based in Munich, Germany, provides linkage and mutation tables; a general statistics overview; gene expression data; gene therapy trials; links to related literature on family, segregation, twin, and adoption studies; and articles concerning the ethics of asthma genetic research.
The Asthma Gene Database is one of the few online genetic databases dedicated to the genetics of a complex trait. Examples of other disease-specific databases include the Albinism database, the Hereditary Hearing Loss Homepage, and the Mutation Database of Inherited Peripheral Neuropathies. There are also numerous gene-specific databases including a p53 database supported by the French Institute Curie, the Human Cytochrome P450 (CYP) Allele Nomenclature Committee, and the Blood Group Antigen Gene Mutation Database. The HUGO Mutation Database Initiative (MDI) has compiled a fairly detailed list of disease and locus-specific databases, which is available on their Web site. This list includes links to several hundred online genetic databases as well as to Web sites for education materials.
The quality and extent of information contained in gene- and disease-specific databases varies greatly, with most lacking genetic epidemiology information. Databases for common or chronic conditions, such as breast cancer or asthma, are more likely to contain information about the epidemiology of the gene-disease association than are databases for rare diseases. Information about gene-disease associations for rare diseases is usually based on family and linkage studies, not population-based studies. The availability of Medline through the Internet has probably had the greatest impact on making genetic epidemiology information accessible and relatively timely. The availability of full-text articles through some journals' Web sites will continue to improve accessibility and timeliness. However, the synthesis of the information that links gene discovery to the epidemiology of gene-disease association, and eventually to the translation of what that means for disease control and prevention, is still missing.
Public Health Genetic Resources on the Web
The Public Health Genetics Unit of the United Kingdom National Health Service has a Web site that provides news and information about advances in genetics and its impact on public health and disease prevention. The site includes links to documents concerning the development of health service policy in genetics; lists of genetic testing and counseling services in the United Kingdom; information on the ethical, legal, and social implications of genomics; and short summaries of the genetic basis for selected diseases. For example, a review of Alzheimer disease summarizes current knowledge about the genetics of Alzheimer disease, including data on gene-environment interactions, and discusses predictive and diagnostic testing for the disease. The summaries are not intended to provide clinical information for patients; rather they discuss the healthcare implications of new genetic knowledge from a public health perspective. The summaries included on the Web site are for both rare conditions, such as fragile X, and common multifactorial conditions, such as colorectal cancer.
Another Web site that focuses on the genetic basis of disease from a public health perspective is that of the Centers for Disease Control and Prevention's Office of Public Health Genomics (NOPHG). In the mid 1990s, the NOPHG established the Human Genome Epidemiology Network (HuGENet™) to 1) promote global collaboration in the development and dissemination of peer-reviewed epidemiological information on human genes, 2) develop an updated and accessible knowledge base on the World-Wide Web, and 3) promote the use of this knowledge base for making decisions involving the use of genetic tests and services for disease prevention and health promotion. Some of the HuGENet™ products available on the Web site include reviews, fact sheets, case studies, e-journal club discussions, and a database of published HuGE literature. The HuGE Reviews focus on a gene-disease associations and describe what is know about a gene's allelic variants and its frequency in different populations, the magnitude of the risks due to the gene and associated factors, and the validity and utility of genetic tests for the disease. The HuGE Fact Sheets summarize key information from the HuGE Reviews in a one- to two-page document. The case studies examine specific gene-disease associations with questions, problem solving, and discussion that are useful for teaching the HuGE concepts. The two case studies available on the NOPHG Web site are NOD2 and Crohn disease, and Factor V Leiden and venous thrombosis. The e-journal club is an electronic discussion forum where new HuGE findings published in the scientific literature are abstracted, summarized, and discussed. The purpose of these discussions is to determine the public health impact of new findings related to gene-disease associations, and in particular, whether or not genetic testing is likely to result from these findings. The HuGE Published Literature database is a collection of HuGE articles published in peer-reviewed literature starting in October of 2000. The database, which is updated weekly, is accessible on the Internet, and users can search the database by gene, health outcome, or environmental factor. Key information about each study is presented, along with a direct link to PubMed's abstract of the article.
The Public Health Perspective Series that is available through the NOPHG Web site serves as a model for examining genomic-related topics for their impact on disease prevention and health promotion. Each edition of the series focuses on a genomic topic and includes concept papers, fact sheets, published literature, and links to pertinent Web sites. For example, the focus on hemochromatosis included a concept paper that discussed screening for hemochromatosis, video clips from a conference on hemochromatosis held at the University of North Carolina, fact sheets, numerous published articles and studies, education materials for patients and providers, and links to consumer organizations and other relevant Web sites.
NOPHG is building upon the concept of the Public Health Perspective Series by developing a Genomics and Disease Prevention Information System (GDPInfo) that collects, organizes and provides access to information on the impact of human genetic variation and its interaction with the environment on health and disease. The goal of GDPInfo is to make information available that will guide the development of research, policy, and practice on the use of genetics for improving health and preventing disease. GDPInfo includes information available from CDC and other government agencies and provides links to many of the other genetic databases described in this chapter. The Web interface for GDPInfo includes a query tool that builds a search based on gene name, disease, environmental factor, or selected topics such as pharmacogenomics or newborn screening. A tool such as GDPInfo can improve access to information that is needed for translating gene discoveries into medical programs and services that will have an impact on preventing disease and promoting health. GDPInfo is intended to serve as a focus for HuGE related information, thereby making such information easily accessible to researchers, policy makers, and practitioners around the globe.
In the past decade, the amount of genetic information available to researchers has increased exponentially, with the majority of this information accessible on the Internet. Although the availability of genetic data on the World-Wide Web has been of great value to researchers, efficiently sifting through all the relevant information on a given topic can be difficult. For this reason, central databases such as the NCBI database and the HGVbase are important tools. They provide researchers with a one-stop Web site where up-to-date information can be obtained on a given topic. Such focal points allow for important genetic information to quickly reach the research community as a whole, and for maximum research efficiency.
Likewise, to maximize the impact of genetics research on disease prevention and health promotion, relevant data must be brought together for convenient access by epidemiologists and public health practitioners. Although some HuGE-related information can be found in assorted Web sites, no central database exists today through which scientists, policy makers, and other public health officials can easily access the many different types of information needed for HuGE research and the ensuing development of public health applications. The creation and upkeep of such a database is critical for the continued growth of the HuGE field and the translation of genetic progress into concrete public health practice. The effort of the NOPHG to create such a database will greatly aid the efforts to centralize HuGE data on the World-Wide Web.
- Semple, CA. Bases and Spaces: Resources on the Web for Accessing the Draft Human Genome- after publication of the Draft. Genome Biology 2001 June; 2(6): 1-7
- Peltonen, L and VA McKusick. Dissecting Human Disease in the Postgenomic Era. Science 2001 February; 291: 1224-1229
- Omenn, GS. Public Health Genetics: An Emerging Interdisciplinary Field for the Post-Genomic Era. Annual Review of Public Health 2000; 21: 1-13
- Guttmacher AE. Human Genetics on the Web. Annual Review of Genomics and Human Genetics, 2001; 2: 213-233
- Norman F. Genetic Information Resources: A New Field for Medical Librarians. Health Libraries Review, 1999; 16: 15-28
- Pearson PL. Genome mapping databases: data acquisition, storage and access. Current Science 1991; 1: 119-123
- Skupski MP, Booker M, Farmer A, et. al. The Genome Sequence Database: towards an integrated functional genomics resource. Nucleic Acids Research 1998 November; 27 (1): 35-38
- Butler D and Smaglick P. Celera genome licensing terms spark concerns over monopoly. Nature 2000 January; 403: 231
- Hamosh A, Scott AF, Amberger J, Bocchini C, Valle D, McKusick VA. Online Mendelian Inheritance in Man (OMIM), a Knowledgebase of Human Genes and Genetic Disorders. Nucleic Acids Res 2002; 30(1):52-55.
- Human Genome Database
- Human Genome project Working Draft
- Locus Link
- HUGO mutation database
- GenBank (BLAST tools)
- Pharmacogenetics Knowledge Base (PharmGKB)
List of useful HuGE Web sites:
- Breast Cancer Gene Database allows searches by gene name or topic. For each gene, population frequency data are provided as well as the magnitude of disease risk (if known). For some genes there is also information about gene-environment interactions.
- Online Mendelian Inheritance of Man (OMIM) contains a wide range of information on hundreds of genetic conditions. Summaries of genetic conditions and mutations may include information about the population frequency of alleles, gene-environment interactions, and gene-gene interactions.
- GeneTests-GeneClinics contains mostly clinical information, but there is also information on the prevalence of alleles, the disease risk associated with certain alleles, and information on genetic testing.
- Office of Public Health Genomics focuses on the genetic basis of disease from a public health perspective. HuGENet™ resources include reviews, fact sheets, case studies, e-journal club discussions, and a database of HuGE published literature.
- Cancer.gov contains updated information on a variety of cancer-related topics, including cancer genetics. HuGE information found on this Web site includes cancer incidence, mortality and survival, as well as information about the risk of disease from genes, family history and environmental factors.
- Public Health Genetics Unit, UK provides news and information about advances in genetics and its impact on public health and disease prevention. Includes links to documents on health service policy in genetics; lists of genetic services in the UK; information on ethical, legal, and social issues in genetics; and summaries of the genetic basis for selected diseases.
- GeneCards is a database of human genes, their product and their involvement in diseases. Most of the information is clinical, though there is information about gene prevalence and disease risk associated with genes. The database also has convenient links to other Web sites such as OMIM, GDB, and PubMed.