Chapter 12 – Genome-wide association studies, field synopses, and the development of the knowledge base on genetic variation and human diseases

Human Genome Epidemiology (2nd ed.): Building the evidence for using genetic information to improve health and prevent disease

“The findings and conclusions in this book are those of the author(s) and do not
necessarily represent the views of the funding agency.”

These chapters were published with modifications by Oxford University Press (2010)

Muin J. Khoury, Lars Bertram, Paolo Boffetta, Adam S. Butterworth, Stephen J. Chanock, Siobhan M. Dolan, Isabel Fortier, Montserrat Garcia-Closas, Marta Gwinn, Julian P. T. Higgins, A. Cecile J. W. Janssens, James M. Ostell, Ryan P. Owen, Roberta A. Pagon, Timothy R. Rebbeck, Nathaniel Rothman, Jonine L. Bernstein, Paul R. Burton, Harry Campbell, Anand P. Chokkalingam, Helena Furberg, Julian Little, Thomas R. O’Brien, Daniela Seminara, Paolo Vineis, Deborah M. Winn, Wei Yu, and John P. A. Ioannidis

Reprinted with permission from the American Journal of Epidemiology, 2009;170:269-279.

The rapid growth in published genetic association studies (1) and the more recent successes of genome-wide association studies (GWAS) in finding disease susceptibility loci for several common diseases (2) present a major challenge for knowledge synthesis and dissemination. Knowledge synthesis is needed to guide further research, drug discovery efforts (3), and translational efforts for personalized risk assessment and therapy. The recent trend for direct-to-consumer advertising of whole genome analysis by several companies underscores the importance of a credible process for data synthesis and evaluation of the validity and utility of claims related to genetic prediction of disease risks (47).

In 2008, over 7,000 original articles were published on human genome epidemiology and the annual number has been rising rapidly (Table 12.1) (8). Furthermore, the published literature represents only a fraction of the data actually collected and analyzed. In addition, until recently, most studies have targeted one or a few gene variants (the candidate gene approach), but many new articles report the results of GWAS, and are expected to become increasingly common. More than 400 GWAS have been published total, not just since 2007, but the pace has accelerated since 2007 (8). Only a few of these studies, however, have been deposited into accessible online databases such as the Database on Genotypes and Phenotypes (dbGAP) at the National Library of Medicine (9), the Cancer Genetic Markers of Susceptibility database (CGEMS) at the National Cancer Institute (10), and the Wellcome Trust Case Control Consortium (11). This number is expected to increase under new policies governing data sharing for GWAS (12), although type of access and confidentiality issues may continue to need careful consideration (13).

Despite a massive amount of primary data, the conclusions of genetic association studies are not always clear, requiring an evidence-based synthesis that takes into account the amount of evidence, the extent of replication, and protection from bias. Although approximately 1,000 systematic reviews and meta-analyses have been published since 2001, most have addressed only one or a few specific gene–disease associations at a time (8). Moreoever, the amount of accumulated data that needs to be integrated continues to grow rapidly, with high-throughput genotyping platforms raising the challenge exponentially.

As part of ongoing efforts in this field, we report here findings and recommendations from a multidisciplinary workshop, including geneticists, epidemiologists, journal editors, and bioinformatics experts, that was sponsored by the Human Genome Epidemiology Network (HuGENet) and held in Atlanta on January 24–25, 2008. The meeting was convened to discuss synthesis and appraisal of cumulative evidence on genetic associations and to develop a strategy for an online encyclopedia on genetic variation and common human diseases.

Progress in the HuGENet Road Map

HuGENet (14,15) is an informal global collaboration of individuals and organizations interested in accelerating the development of the knowledge base on genetic variation and human health. HuGENet has developed a “road map” (16) with several components: (i) working with genetic epidemiology study platforms (primarily consortia and networks) to improve the execution and output of these groups under the rubric of “Network of Networks” (17); (ii) promoting the publication of methodologically sound genetic association studies with transparent reporting of their methods (STrengthening the REporting of Genetic Associations or STREGA) (18) and avoidance of selective reporting; (iii) developing methods for synthesis and meta-analysis of the literature on genetic associations (the HuGE Review Handbook, version 1.0) (18); and (iv) developing “field synopses” (19) with an online encyclopedia summarizing what we know and what we do not know about genetic associations through a systematic assessment of their cumulative evidence. Such field synopses were also called for in a Nature Genetics 2006 editorial (20).

Field Synopses: Assessing Cumulative Evidence for Genetic Associations

An initial meeting of the Network of Networks in 2005 led to the formation of a working group on methods for assessing cumulative evidence. A workshop organized in Venice, Italy in 2006 (21) generated interim guidelines for grading the cumulative evidence in genetic associations based on three criteria: (i) the amount of evidence; (ii) the extent of replication; and (iii) protection from bias (22). The proposed scheme allows for three categories of descending credibility (A,B,C) for each of these criteria and also for a composite assessment of “strong,” “moderate,” or “weak” credibility (see Appendix and Reference 20 for more details). Briefly, an overall “strong” rating is reserved for a AAA rating, while an overall “weak” rating is reserved for associations with one or more C ratings. The rest are labeled as “moderate.” We note that these ratings could change over time with data accruing from additional studies. The panel also discussed issues of biological and other experimental evidence and of the clinical importance of genetic associations. Pilot studies were planned in selected fields to assess cumulative evidence on gene– disease associations, calibrate the proposed guidelines, and integrate the findings into comprehensive field synopses. As of August 2009, pilot field synopses have been conducted for several diseases including Alzheimer disease, bladder cancer, schizophrenia, preterm birth, and coronary heart disease, as well as DNA repair genes and cancer phenotypes.

A field synopsis is a regularly updated snapshot of the current state of knowledge about genetic associations in a particular field of research defined by a disease (e.g., Alzheimer disease), phenotype (e.g., body mass index), or family of genes (e.g., DNA repair genes). The ideal attributes of a field synopsis are that it (a) is freely available; (b) uses online databases that are curated by researchers to develop regularly updated “online tables” on the volume of the evidence and magnitude of the associations between the disease and all genetic variants investigated; (c) uses objective and transparent criteria for grading the credibility of cumulative evidence; (d) summarizes the information in peer-reviewed articles; and (e) updates information on a regular basis. The first field synopsis—the source of AlzGene, the Alzheimer disease genetic association database (23)—was developed by Bertram, et al. and published in January 2007. This was followed by the publication of a field synopsis on schizophrenia (24) and one on DNA repair genes (25), while three other synopses are under development or peer review.

Experience with Field Synopses to Date

At the HuGENet workshop, several teams presented findings and experiences in developing field synopses, and on grading the epidemiologic evidence according to the interim Venice guidelines (22). Key features of these efforts are summarized in Table 12.2. All synopses include multiple meta-analyses involving large numbers of data sets, except for preterm birth, where evidence is sparse. Researchers performing synopses have used different thresholds or trigger points for conducting a meta-analysis. For example, in the coronary heart disease fields synopsis, investigators have considered only those associations for which at least one previous effort has been made to perform a meta-analysis. Data from GWAS have been incorporated in synopses on Alzheimer disease, schizophrenia, DNA repair genes, and bladder cancer. The preterm birth field synopsis points out the need for further research on the genetic contribution to this major public health challenge.

Many associations in the Alzheimer, schizophrenia, and two cancer-related field synopses yielded formally statistically significant results at the p < 0.05 level (Table 12.2). Nevertheless, only a few associations met the designation of “strong” evidence according to the Venice criteria. Similarly, in several synopses, none of the probed associations attained the status of “strong” evidence. Finally, so far, field synopses have examined only one main phenotype, except in the case of DNA repair genes. In addition to main effects, synopses have investigated genetic effects according to different genetic models, and for subgroups—for example, subgroups based on exposure, ethnic group, participant characteristics, or phenotypic subgroups. Decisions to undertake additional analyses need to be made on the basis of data availability. For example, in most field synopses, investigators were able to assess different genetic models. Often, available epidemiologic evidence may be stronger for one genetic model than for another. By contrast, there have been relatively fewer subgroup analyses based on exposures and participant characteristics, because of suboptimal reporting of these factors in genetic epidemiology studies, a deficiency that the STREGA guidance aims to address (18).

Insights from Current Field Synopses

The pilot field synopses provided detailed insight about the grading process in the three specified areas: amount of evidence, replication, and protection from bias. They identified limitations that will help refine the current approach.

Amount of Evidence

For amount of evidence, synopses have used a classification scheme based on the sample size of the minor genetic group (participants or alleles, depending on the genetic model). This is a simple measure that is readily available and has a close connection to power, Bayes factors, or false discovery rate (22). For candidate-gene variants, several postulated associations fail to reach grade A evidence (see Table 12.3). Currently, with large collaborative efforts stemming from GWAS and subsequent replication studies, this is likely to be less of a problem at least for common variants with a frequency greater than 5%–10%. For variants with lower frequency, very large sample sizes may be required. Nevertheless, some consortia have the potential of reaching even sample sizes exceeding 100,000, which means more than 1,000 for the minor allele, even for variants that occur in 0.5% of the general population. For example, the international consortium on osteoporosis (Genetic Factors for Osteoporosis, GEFOS) funded by the European Commission includes 61 studies with 133,333 participants, and for at least 14 of these studies, investigators have already conducted or plan to conduct GWAS. We may need to revisit the criteria on amount of evidence once we have a better sense of the effect sizes regularly encountered for more rare variants.


Field synopses have used I2 to assign grades for inconsistency (amount of heterogeneity) (i.e., A for <25%, B for 25%–50%, C for >50%) across studies (2325). One-third to one-half of the formally significant associations has moderate or large I2values. However, I2 often has large uncertainty when there are only a few studies (26). Moreover, qualitative epidemiologic considerations about the presence of and potential explanation for heterogeneity would need to be taken into account in judging replication. For example, the association between N-acetyltransferase type 2 (NAT2) variants and bladder cancer risk is expected to be exposure-specific; thus, heterogeneity may readily be expected between populations with different exposures (e.g., different types of tobacco in European populations versus other populations) (27,28).

Another consideration is whether I2 reflects heterogeneity of estimates around the null value, or heterogeneity in the magnitude of association. The former would question the presence of an association, whereas the latter would question the strength of the association. For instance, even for a consistent association such as the glutathione S-transferase M1 (GSTM1) null genotype and bladder cancer risk, there is some evidence for heterogeneity in the magnitude of the association across studies (28).

However, such epidemiologic insight must be considered with caution, to avoid introducing subjective, speculative processes in the grading. At a minimum, considerations for upgrading or downgrading should be explicit. It may be reasonable to grade as A on this criterion associations with moderate or high heterogeneity with an extensive replication record. This replication includes a P-value for the summary effect (excluding the discovery data set), of p < 10–7 even in random-effect models that account for between-study heterogeneity or have a false-positive report probability rate less than 10% or a Bayes factor less than 10–5.

For example, the apparent heterogeneity in the effect of NAT2 slow acetylation on bladder cancer risk can be explained by differences in the pattern of tobacco smoking across study populations (28). However, the presence of heterogeneity would reflect even in these cases the possibility that, bias set aside, one would need to identify the sources of heterogeneity in subsequent studies. These could include not only differential effects under different exposures, but also the possibility that the association is with a correlated phenotype and not the one tested (e.g., the fat mass and obesity-associated gene, diabetes, and obesity) (29), the impact of the different ascertainment schemes used in different studies (30), genotype misclassification (especially in isolated candidate gene studies), or a marker polymorphism that is in variable linkage disequilibrium with the causative variant across the populations (31). The latter scenario could become common in associations that emerge out of “agnostic” GWAS, where it is unlikely that the causal variant will be directly identified. In the setting of GWAS, it is easy to check whether linkage disequilibrium structures are different in different populations; in the presence of similar linkage disequilibrium structure, a cause of heterogeneity can be quickly excluded. It has been demonstrated that beyond a given threshold of inconsistency, no matter how large the studies we conduct, we may never have enough power to replicate an association (nonreplicability threshold) (32).

Another issue is the ability of the cumulative evidence to exclude an association based on lack of replication. It is notable that the Venice criteria include, under “replication C,” also the possibility of “no association and failed replication,” based on traditional nonsignificant results for the meta-analysis. Minute effects can never be excluded, and in fact, in GWAS, many true associations yield modest results that do not cross genome-wide association p-value thresholds or have equivalently low false report probability rates. Many true findings do not rise to the top of the single nucleotide polymorphism p-value ranks in phase 1 of a genome-wide association study (33). Despite extremely large sample sizes and cumulative meta-analyses of many GWAS, many associations may remain undiscovered and/or inconclusive. The Venice criteria should not be used to conclude that there is strong evidence for a null association.

Protection from Bias

A research finding cannot reach sufficient credibility (>50%) unless the probability of a false-positive association is less than the prestudy odds of an association’s being true (34). The Venice criteria include an extensive checklist for sources of biases in different settings. The checklist has different considerations depending on whether the evidence comes from retrospective meta-analyses of published data or prospective GWAS and replication studies from collaborative consortia with harmonization of data collection and analysis.

Bias checks that have been adopted in these synopses for retrospective meta-analysis include automated checks that can be readily applied to all meta-analyses of published data. These are shown in Table 12.4, along with a list of issues that need to be considered. General checks (that can be applied automatically to all fields) have the advantage of being objective and unambiguous, but they cannot provide definitive proof for the presence or absence of bias. For instance, a small effect size (e.g., odds ratio < 1.15) could be explained by bias, but many of the confirmed associations between single nucleotide polymorphisms and chronic diseases are of this order of magnitude. Therefore, small effect sizes, if seen consistently across many studies and with no evidence for publication bias, should not be automatically penalized. For prospective evidence, such as data accumulated from one or more GWAS with prospective replication across several teams in a consortium or prospective meta-analysis of many GWAS from collaborative studies (35) the considerations are quite different. Here, the small magnitude of effect size should not be invoked as evidence of lack of protection from bias, and similarly small-study effect bias or an excess of single studies with significant findings is not an issue here, provided there is no selective reporting of results (there is no reason for such selective reporting in a consortium).

For example, in the schizophrenia synopsis (24), of the 24 associations with nominal statistical significance, 9 associations were graded as “A” and 15 as “C” for “protection from bias.” The main reasons for low grades were a small summary odds ratio (odds ratio < 1.15) in what are retrospective meta-analyses of published data (n = 6 associations), and loss of significance after excluding the initial study (n = 6). Less common reasons were loss of significance after excluding studies that violated Hardy-Weinberg equilibrium and significant differences in effect between small and larger studies.

Issues to Consider for Moving Forward

Defining Thresholds for Evaluating Credibility

The threshold for considering an association for further assessment must be defined in each synopsis, but it may be difficult to reach full consensus on this issue. Given that current synopses have used a large amount of evidence from candidate gene studies, most have considered for grading all probed associations that pass very lenient levels of statistical significance in meta-analysis (typically, p < 0.05 inferred from random-effects calculations). However, experience to date indicates that associations with grade A for the amount of evidence but p-values just below 0.05 have either very small effects (and get a C for protection from bias if a retrospective meta-analysis) or moderate/large heterogeneity (and thus get a B or C for replication consistency). Even for such associations that stem from the candidate gene era, it is uncommon to get a rating of “strong” epidemiologic evidence grading (AAA), unless the p-value for the summary effect is substantially lower. Associations that arise out of GWAS require an even more demanding threshold. Thresholds may be set based either on p-value criteria for genome-wide significance or using Bayesian approaches, of which there are several variants (3639).

In view of the potential multiplicity of phenotypes examined and analyses performed, some authors believe that the rigorous criteria for statistical significance used in GWAS should be applied to candidate gene-derived associations. If so, p-values of 10–7 or lower would be required for a locus to be considered “confirmed” (40,41). Figure 12.1a shows the distribution of p-values of the loci identified by GWAS for binary outcome phenotypes and which have been included in the National Human Genome Research Institute (NHGRI) GWAS catalog as of October 14, 2008 (42,43). Of the 466 entries in the catalog, after excluding those pertaining to studies that did not reach any hits with p < 10–5 and those that had nonbinary outcomes, 223 loci are included here. As shown, fewer than two-thirds of them (142/223) have a p-value < 10–7 and only 39% (87/223) have a p-value < 10–10. When several studies and data sets are combined in genome-wide investigations, typically researchers have used pooled, stratified, or simple fixed effects analyses; random effects or other approaches that also take into account the heterogeneity between data sets often would have yielded even more conservative p-values (44). This suggests that the majority of signals emerging from current GWAS and early replication efforts do not yet cross stringent levels of “genome-wide significance.” This further highlights the need to include far more data from additional GWAS and replication data sets, and this can be routinely accomplished in the setting of field synopses collating all of this information.

Bayesian approaches offer the advantage of allowing different prior probabilities for an association being present based on external evidence (thus bridging agnostic and candidate approaches) (3639). These methods also allow consideration of the impact of different assumptions about the genetic effect sizes. Empirical evidence from GWAS can offer insight about typical discovered effects. Figure 12.1b shows the distribution of the odds ratios (typically per allele, as reported in the NHGRI catalog) (42,43) in the 223 GWAS-discovered loci. As shown, the median effect corresponds to an odds ratio of 1.28, and the same median is seen for the 142 associations with p < 10–7 (Figure 12.1c). These estimates may be inflated compared to the true effects, due to the “winner’s curse” phenomenon (inflation of effects selected based on significance thresholds) (45,46). A median true odds ratio of 1.1–1.2 is therefore reasonable for these associations, and some effects many be even smaller, while exceptions of large odds ratios are probably uncommon. Nevertheless, one should acknowledge that the effect of the causal factor that is in the neighborhood of the tagging polymorphism may be larger, and we cannot yet exclude the possibility of considerably larger odds ratios for low frequency variants (47). Such variants were not assessed in the first wave of GWAS, but they are being increasingly targeted in current and future efforts (48,49).

As more synopses accrue, we can examine the stability of the Venice grading for various associations. This will help us understand whether some types of associations can change from having weak credibility to having strong credibility (and vice versa). As is described below, gathering empirical evidence into field synopsis databases will allow greater insight in the assessment of cumulative evidence on genetic associations.

Defining Conglomerate Evidence

It is already established practice for hypotheses about specific postulated associations to be tested using data from combinations of prospective consortia analyses stemming from GWAS and their meta-analyses and replication studies; possibly several consortia working on the same disease and phenotypes; additional scattered studies by teams that are not included in any of the consortia; and even retrospective meta-analyses encompassing some/many/all of these sources of data. Such “conglomerate evidence” from various sources of data may appear in various time sequences. The Venice criteria suggested that one should consider the highest possible level of evidence when data come from disparate sources. Perhaps the best currently available source is a well-designed prospective consortium analysis including several teams that have performed GWAS and replications. The results of such an analysis should have a much greater weight than the results of scattered smaller studies. If the consortium evidence results in “strong” evidence, it would not be reasonable to underrate this evidence because of a few small, scattered, inconclusive studies. However, the challenge will become more serious when many consortia with one or more genome-wide platforms are available, and when the scattered or retrospectively meta-analyzed data are much larger in amount than the original consortium-level data on which the reported association was based. Dealing efficiently with this situation requires transparent and comprehensive availability of the evidence from these diverse studies as discussed below.

Global Collaboration: From Data to Knowledge

After reviewing pilot field synopses, participants in the HuGENet workshop discussed how to link emerging data on genetic associations with other sources of information on the biology of genes and gene–disease relations. Clearly, the advent of GWAS in large-scale collaborative studies involving networks and consortia is a crucial first step toward the generation of large-scale data sets. Furthermore, the deposition of these data in accessible public databases can help to address the problem of publication bias commonly seen in candidate gene association studies. Nevertheless, additional efforts are needed to transform data into a knowledge base. Systematic reviews and meta-analyses represent a crucial step in building the knowledge base on genetic variation and human health. Such efforts need to be transparent and their results made available in online databases and publications. The willingness of journal editors to contribute to these efforts is critical, as investigators and systematic reviewers struggle to gain academic recognition for their work, which is often part of multinational, multiple investigator studies. Finally, the National Library of Medicine has a leading role in linking genetic association studies with other existing databases on gene sequences, products, and linkages to disease processes (50).

At the HuGENet workshop, a vision emerged of collaboration to create a sustainable, credible knowledge base on genetic variation and human diseases. As shown in Figure 12.2, the collaboration involves research investigators, systematic reviewers, online publishers, and database developers with variable degrees of overlap among the groups. For example, investigators who are part of research consortia have their own informatics tools and databases, and they can conduct systematic reviews of their own field based on their own data or also including data from teams external to the consortium. In addition, other reviewers could contribute to these efforts, as evidenced by many previous efforts in meta-analyses and Human Genome Epidemiology (HuGE) reviews. Figure 12.2 shows the flow from generation of new data to systematic appraisal and synthesis and to online dissemination via journals and databases.

A successful example of collaboration already exists in the field of type 2 diabetes. Investigators from diverse consortia have combined efforts to conduct comprehensive meta-analyses of all GWAS and replication studies. A first meta-analysis combined three GWAS with a total of over 10,000 samples; this was followed by a second stage of replication of the most interesting signals in over 22,000 independent samples and a subsequent third stage of replication on over 57,000 samples, with data being combined by means of formal meta-analysis methods (51). Similar meta-analyses are being designed and carried out by collaborating consortia in several other fields (for example, the Psychiatric GWAS Consortium, which is conducting meta-analyses within and between five psychiatric disorders).

Several global collaborations focused on genotype-phenotype correlations can help support fields where large-scale studies are still in the making. For example, HuGENet sponsors the HuGE Navigator (5), a knowledge base with online tools for capturing and organizing the most up-to-date information on genetic associations and other related information. The Human Variome Project (HVP) is focused on the production and synthesis of gene- and gene-variant-centered databases with linked phenotypic outcomes (52). The Public Population Project in Genomics (P3G) (53) aims to harmonize data collected from large-scale cohort studies and biobanks around the world. Cross links among HuGENet, P3G, HVP, and other groups are crucial to convene and facilitate collective efforts in developing the knowledge base on genetic variation and human diseases. Efforts in coordinating these global collaborations are already under way through cross-linking of these enterprises. For example, P3G has an international working group in epidemiology and biostatistics that is closely related to the HuGENet movement. Another, more specialized online knowledge base development effort that can be synergistic is PharmGKB (the Pharmacogenomics Knowledge Base) (54). In addition, GeneReviews are expert-authored, peer-reviewed disease descriptions focused on the use of genetic testing in the diagnosis, management, and genetic counseling of patients and their families. GeneReviews are part of the GeneTests Web site, which also includes international directories of genetics clinics and genetics laboratories (55,56). Finally, it is important for epidemiologic efforts to be linked with biological efforts, including experimental work, assessment of endophenotypes, and functional studies in different model systems.

Schizophrenia: Field Synopsis and Example of Development of a Knowledge Base

As an example of the collaboration among primary investigators, systematic reviewers, and online publishers, Bertram et al. provide a model approach to a distributed knowledge base of genetic variants that features collaboration among the three groups outlined above. They have synthesized primary research on genetic associations in schizophrenia, and they developed a regularly updated, online SzGene database, that collects and curates published results in this area. A peer-reviewed field synopsis summarizes the cumulative evidence and evaluates it according to the Venice criteria. The field synopsis is regularly updated online with updated cumulative meta-analyses. Bertram et al. have developed similar resources for Alzheimer disease.

The HuGE Navigator Web site serves to link field-specific efforts like SzGene with other online databases through the HuGEpedia. The HuGEpedia can be accessed by using either phenotype (Phenopedia) or a gene (Genopedia) as the starting point. For example, searching the Phenopedia for schizophrenia leads users to a page that provides an up-to-date summary of genes studied for association with schizophrenia, links to abstracts of the original publications in PubMed, meta-analyses and HuGE reviews, and abstracted meta tables. The HuGE Navigator can also be searched to locate investigators in the field and to display geographic and temporal trends in the published literature. Finally, HuGE Navigator attempts to identify and link to all published GWAS in the field, as well as to data sets deposited and available through the National Center for Biotechnology Information’s (NCBI) dbGaP. Although HuGE Navigator is not a comprehensive data repository, it serves as a first stop for orientation and links to more authoritative data sources and field synopses. The highest level of data integration in this example occurs through links with NCBI databases (such as PubMed, Entrez Gene, and dbGaP). The NCBI online book, Genes and Diseases (57), could also expand to accommodate the most current synopses in individual fields.

Concluding Remarks

This is a crucial time in human genomics research, when advances in genome-wide analysis platforms coupled with declining costs are producing an unprecedented outpouring of replicated genetic associations with common diseases. To make the most of the research enterprise and to promote reliable and timely knowledge synthesis, the multidisciplinary working group offers the following recommendations.

First, data from GWAS should be made available for interested researchers to avoid selective positive reporting of spurious associations and to facilitate meta-analyses of particular associations. Involvement of the primary investigators of the GWAS in collaborative projects and meta-analyses should be encouraged. There is a risk of errors and misconceptions being introduced if the primary investigators who are intimately familiar with the data are not involved. Second, researchers and research networks should develop field synopses that use meta-analysis to integrate published and unpublished data and evaluate the cumulative evidence. The Venice guidelines offer interim guidance, and further empirical research is needed to assess the stability and implementation of these guidelines. Third, we encourage the development of field-specific databases, such as the SzGene database discussed above. Fourth, we encourage journal editors to publish field synopses with regular updates as called for by Nature Genetics in 2006 (20). Fifth, we recommend that journals and online publishers develop and make widely available databases that include standardized and systematically collected information from original research for research synthesis. The HuGE Navigator is one approach presented here, but others could emerge in the future. The rapidity of data accumulation necessitates such a systematic approach as a starting point for evaluating the gaps in our knowledge base. To succeed, these efforts depend on collaboration fueled by the availability of funding, not only for generating original research data, but also for efforts in research synthesis and dissemination. Finally, we need to ensure that epidemiologic research synthesis discussed here is accompanied by critical appraisal and synthesis of biologic research. The combination of epidemiology and biology is crucial to enhance the credibility of genetic associations and to accelerate their applications in clinical medicine and population health.

 Top of Page



 Top of Page


  1. Lin B, Clyne M, Walsh M, et al. Tracking the epidemiology of human genes in the literature: the HuGE published literature database. Am J Epidemiol. 2006;164:1–4.
  2. Topol E, Murray SS, Frazer KA. The genomics gold rush. JAMA. 2007;298:218–221.
  3. Kingsmore SF, Lindquist IE, Mudge J, et al. Genome-wide association studies: progress and potential for drug discovery and development. Nat Rev Drug Discov. 2008;7:221–230.
  4. Hunter DJ, Khoury MJ, Drazen JM. Letting the genome out of the bottle: will we get our wish. N Engl J Med. 2008;358:105–107.
  5. Editorial. Risky business. Nat Genet. 2007;39:1415.
  6. Editorial. Positively disruptive. Nat Genet. 2008;40:119.
  7. NCI_NHGRI Working Group on Replication in Association Studies. Replicating genotype-phenotype associations. Nature. 2007;447:655–660.
  8. Yu W, Gwinn M, Clyne M, et al. A navigator for human genome epidemiology. Nat Genet. 2008;40:124–125. Also available at
  9. National Library of Medicine. Database on genotypes and phenotypes (DbGAP).
  10. National Cancer Institute. Cancer genetic markers of susceptibility.
  11. Wellcome Trust Case-Control Consortium.
  12. National Institutes for Health. Policy for sharing data obtained in NIH supported or conducted genome-wide association studies (GWAS).
  13. Homer N, Szelinger S, Redman M, et al. Resolving individuals contributing trace amounts of DNA to highly complex mixtures using high-density SNP genotyping microarrays. PLoS Genet. 2008;4:e1000167.
  14. Khoury MJ, Dorman JS. The Human Genome Epidemiology Network. Am J Epidemiol. 1998;148:1–3.
  15. Centers for Disease Control and Prevention. The Human Genome Epidemiology Network (HuGENet).
  16. Ioannidis JP, Gwinn M, Little J, et al. A road map for efficient and reliable human genome epidemiology. Nat Genet. 2006;38:3–5.
  17. Ioannidis JP, Bernstein J, Boffetta P, et al. A network of investigator networks in human genome epidemiology. Am J Epidemiol. 2005;162:302–304.
  18. HuGENet workshop. (STREGA).
  19. Little J, Higgins JPT, eds. The HuGENet™ HuGE Review Handbook, version 1.0.
  20. Editorial. Embracing risk. Nat Genet. 2006;38:1.
  21. .
  22. Ioannidis JP, Boffetta P, Little J, et al. Assessment of cumulative evidence on genetic associations: interim guidelines. Int J Epidemiol. 2008;37:120–132.
  23. Bertram L, McQueen MB, Mullin K, et al. Systematic meta-analyses of Alzheimer genetic association studies: the AlzGene database. Nat Genet. 2007;39:17–23.
  24. Allen NC, Bagade S, McQueen MB, et al. Systematic meta-analyses and field synopsis of genetic association studies in schizophrenia: the SzGene Database. Nat Genet. 2008;40:827–834.
  25. Vineis P, Manuguerra M, Kavvoura FK, et al. A field synopsis on low-penetrance variants in DNA repair genes and cancer susceptibility. J Natl Cancer Inst. 2009;101(1):24–36.
  26. Ioannidis JP, Patsopoulos NA, Evangelou E. Uncertainty in heterogeneity estimates in meta-analyses. BMJ. 2007;335:914–916.
  27. García-Closas M, Malats N, Silverman D, et al. NAT2 slow acetylation and GSTM1 null genotypes increase bladder cancer risk: results from the Spanish Bladder Cancer Study and meta-analyses. Lancet. 2005;366:649–659.
  28. Rothman N, Garcia-Closas M, Hein DW. Commentary: reflections on G. M. Lower and colleagues’ 1979 study associating slow acetylator phenotype with urinary bladder cancer: meta-analysis, historical refinements of the hypothesis, and lessons learned. Int J Epidemiol. 2007;36:23–28.
  29. Frayling TM, Timpson NJ, Weedon MN, et al. A common variant in the FTO gene is associated with body mass index and predisposes to childhood and adult obesity. Science. 2007;316:889–894.
  30. Burton PR, Palmer LJ, Jacobs K, et al. Ascertainment adjustment: where does it take us? Am J Hum Genet. 2000;67:1505–1514.
  31. Ioannidis JP Non-replication and inconsistency in the genome-wide association setting. Hum Hered. 2007;64:203–213.
  32. Moonesinghe R, Khoury MJ, Liu T, et al. Required sample size and nonreplicability thresholds for heterogeneous genetic associations. Proc Natl Acad Sci USA. 2008;105:617–622.
  33. Thomas G, Jacobs KB, Yeager M, et al. Multiple loci identified in a genome-wide association study of prostate cancer. Nat Genet. 2008;40:310–315.
  34. Ioannidis JP. Why most published research findings are false. PLoS Med. 2005;2:e124.
  35. Zeggini E, Weedon MN, Lindgren CM, et al. Replication of genome-wide association signals in UK samples reveals risk loci for type 2 diabetes. Science. 2007;316:1336–1341.
  36. Wakefield JA. Bayesian measure of the probability of false discovery in genetic epidemiology studies. Am J Hum Genet. 2007;81:208–227.
  37. Wakefield J. Bayes factors for genome-wide association studies: comparison with P-values. Genet Epidemiol. July 18, 2008;33(1):79–86. [Epub ahead of print]
  38. Wacholder S, Chanock S, Garcia-Closas M, et al. Assessing the probability that a positive report is false: an approach for molecular epidemiology studies. J Natl Cancer Inst. 2004;96:434–442.
  39. Ioannidis JP. Calibration of credibility of agnostic genome-wide associations. Am J Med Genet B Neuropsychiatr Genet. 2008;147B:964–972.
  40. Hoggart CJ, Clark TG, De Iorio M, et al. Genome-wide significance for dense SNP and resequencing data. Genet Epidemiol. 2008;32:179–185.
  41. Pe’er I, Yelensky R, Altshuler D, Daly MJ. Estimation of the multiple testing burden for genomewide association studies of nearly all common variants. Genet Epidemiol. 2008;32:381–385.
  42. Hindorff LA, Junkins HA, Manolio TA. A Catalog of Published Genome-Wide Association Studies.
  43. Manolio TA, Brooks LD, Collins FS. A HapMap harvest of insights into the genetics of common disease. J Clin Invest. 2008;118:1590–1605.
  44. Ioannidis JP, Patsopoulos NA, Evangelou E. Heterogeneity in meta-analyses of genome-wide association investigations. PLoS ONE. 2007;2:e841.
  45. Zollner S, Pritchard JK. Overcoming the winner’s curse: estimating penetrance parameters from case-control data. Am J Hum Genet. 2007;80:605–615.
  46. Ioannidis JP. Why most discovered true associations are inflated. Epidemiology. 2008;19:640–648.
  47. Wright AF, Charlesworth B, Rudan I, et al. A polygenic basis for late-onset disease. Trends Genet. 2003;19:97–106.
  48. Bodmer W, Bonilla C. Common and rare variants in multifactorial susceptibility to common diseases. Nat Genet. 2008;40:695–701.
  49. Walsh T, McClellan JM, McCarthy SE, et al. Rare structural variants disrupt multiple genes in neurodevelopmental pathways in schizophrenia. Science. 2008;320:539–543.
  50. National Center for Biotechnology Information Databases.
  51. Zeggini E, Scott LJ, Saxena R, et al. Meta-analysis of genome-wide association data and large-scale replication identifies several additional susceptibility loci for type 2 diabetes. Nat Genet. 2008;40:638–645.
  52. Cotton RG, Appelbe W, Auerbach AD, et al. Recommendations of the 2006 Human Variome Project meeting. Nat Genet. 2007;39:433–436.
  53. Public Population Project in Genomics (P3G).
  54. Klein TE, Chang JT, Cho MK, et al. Integrating genotype and phenotype information: an overview of the PharmGKB project. Pharmacogenetics Research Network and Knowledge Base. Pharmacogenomics J. 2001;1:167–170.
  55. Genereviews (genetests).
  56. Pagon RA. GeneTests: an online genetic information resource for healthcare providers. Med Libr Assoc. 2006;94:343–348.
  57. National Center for Biotechnology Information Genes and Diseases.

 Top of Page