Chapter 10 - STrengthening the REporting of Genetic Association studies (STREGA)—an extension of the STROBE statement Tables
This web page is archived for historical purposes and is no longer being maintained or updated.
Human Genome Epidemiology (2nd ed.): Building the evidence for using genetic information to improve health and prevent disease
necessarily represent the views of the funding agency.”
Julian Little, Julian P. T. Higgins, John P. A. Ioannidis, David Moher, France Gagnon, Erik von Elm, Muin J. Khoury, Barbara Cohen, George Davey Smith, Jeremy Grimshaw, Paul Scheet, Marta Gwinn, Robin E. Williamson, Guang Yong Zou, Kimberley Hutchings, Candice Y. Johnson, Valerie Tait, Miriam Wiens, Jean Golding, Cornelia M. van Duijn, John McLaughlin, Andrew Paterson, George Wells, Isabel Fortier, Matthew Freedman, Maja Zecevic, Richard A. King, Claire Infante-Rivard, Alexandre Stewart, and Nick Birkett
|STROBE Guideline||Extension for Genetic Association
|Title and Abstract||1||(a) Indicate the study’s design with a commonly used term in the title or the abstract.
(b) Provide in the abstract an informative and balanced summary of what was done and what was found.
|Background rationale||2||Explain the scientific background and rationale for the investigation being reported.|
|Objectives||3||State specific objectives, including any pre-specified hypotheses.||State if the study is the first report of a genetic association, a replication effort, or both.|
|Study design||4||Present key elements of study design early in the paper.|
|Setting||5||Describe the setting, locations and relevant dates, including periods of recruitment, exposure, follow-up, and data collection.|
|Participants||6||(a) Cohort study—Give the eligibility criteria, and the sources and methods of selection of participants. Describe methods of follow-up.
Case-control study—Give the eligibility criteria, and the sources and methods of case ascertainment and control selection. Give the rationale for the choice of cases and controls.
Cross-sectional study—Give the eligibility criteria, and the sources and methods of selection of participants.
(b) Cohort study—For matched studies, give matching criteria and number of exposed and unexposed.
Case-control study—For matched studies, give matching criteria and the number of controls per case.
|Give information on the criteria and methods for selection of subsets of participants from a larger study, when relevant.|
|Variables||7||(a) Clearly define all outcomes, exposures, predictors, potential confounders, and effect modifiers. Give diagnostic criteria, if applicable.||(b) Clearly define genetic exposures (genetic variants) using a widely used nomenclature system. Identify variables likely to be associated with population stratification (confounding by ethnic origin).|
|8*||(a) For each variable of interest, give sources of data and details of methods of assessment (measurement). Describe comparability of assessment methods if there is more than one group.||(b) Describe laboratory methods, including source and storage of DNA, genotyping methods and platforms (including the allele calling algorithm used, and its version), error rates and call rates. State the laboratory/center where genotyping was done. Describe comparability of laboratory methods if there is more than one group. Specify whether genotypes were assigned using all of the data from the study simultaneously or in smaller batches.|
|Bias||9||(a) Describe any efforts to address potential sources of bias.||(b) For quantitative outcome variables, specify if any investigation of potential bias resulting from pharmacotherapy was undertaken. If relevant, describe the nature and magnitude of the potential bias, and explain what approach was used to deal with this.|
|Study size||10||Explain how the study size was arrived at.|
|Quantitative variables||11||Explain how quantitative variables were handled in the analyses. If applicable, describe which groupings were chosen, and why.||If applicable, describe how effects of treatment were dealt with.|
|Statistical methods||12||(a) Describe all statistical methods, including those used to control for confounding.
(b) Describe any methods used to examine subgroups and interactions.
(c) Explain how missing data were addressed.
(d) Cohort study—If applicable, explain how loss to follow-up was addressed.
Case-control study—If applicable, explain how matching of cases and controls was addressed.
Cross-sectional study—If applicable, describe analytical methods taking account of sampling strategy.
(e) Describe any sensitivity analyses.
|State software version used and options (or settings) chosen.|
|(f) State whether Hardy–Weinberg equilibrium was considered and, if so, how.
(g) Describe any methods used for inferring genotypes or haplotypes.
(h) Describe any methods used to assess or address population stratification.
(i) Describe any methods used to address multiple comparisons or to control risk of false-positive findings.
(j) Describe any methods used to address and correct for relatedness among subjects.
|Participants||13*||(a) Report the numbers of individuals at each stage of the study—e.g., numbers potentially eligible, examined for eligibility, confirmed eligible, included in the study, completing follow-up, and analyzed.
(b) Give reasons for nonparticipation at each stage.
(c) Consider use of a flow diagram.
|Report numbers of individuals in whom genotyping was attempted and numbers of individuals in whom genotyping was successful.|
|Descriptive data||14*||(a) Give characteristics of study participants (e.g., demographic, clinical, social) and information on exposures and potential confounders.
(b) Indicate the number of participants with missing data for each variable of interest.
(c) Cohort study—Summarize follow-up time (e.g., average and total amount).
|Consider giving information by genotype|
|Outcome data||15*||Cohort study—Report numbers of outcome events or summary measures over time.
Case-control study—Report numbers in each exposure category, or summary measures of exposure.
Cross-sectional study—Report numbers of outcome events or summary measures.
|Report outcomes (phenotypes) for each genotype category over time.
Report numbers in each genotype category.
Report outcomes (phenotypes) for each genotype category.
|Main results||16||(a) Give unadjusted estimates and, if applicable, confounder-adjusted estimates and their precision (e.g., 95% confidence intervals). Make clear which confounders were adjusted for and why they were included.
(b) Report category boundaries when continuous variables were categorized.
(c) If relevant, consider translating estimates of relative risk into absolute risk for a meaningful time period.
|(d) Report results of any adjustments for multiple comparisons.|
|Other analyses||17||(a) Report other analyses done—e.g., analyses of subgroups and interactions, and sensitivity analyses.|
|(b) If numerous genetic exposures (genetic variants) were examined, summarize results from all analyses undertaken.
(c) If detailed results are available elsewhere, state how they can be accessed.
|Key results||18||Summarize key results with reference to study objectives.|
|Limitations||19||Discuss limitations of the study, taking into account sources of potential bias or imprecision. Discuss both direction and magnitude of any potential bias.|
|Interpretation||20||Give a cautious overall interpretation of results considering objectives, limitations, multiplicity of analyses, results from similar studies, and other relevant evidence.|
|Generalizability||21||Discuss the generalizability (external validity) of the study results.|
|Funding||22||Give the source of funding and the role of the funders for the present study and, if applicable, for the original study on which the present article is based.|
STREGA = STrengthening the REporting of Genetic Association studies; STROBE = STrengthening the REporting of OBservational studies in Epidemiology.
* Give information separately for cases and controls in case-control studies and, if applicable, for exposed and unexposed groups in cohort and cross-sectional studies.
|Specific Issue in Genetic Association Studies||Rationale for Inclusion in STREGA||Item(s) in STREGA||Specific Suggestions for Reporting|
|Main areas of special interest (see also main text).|
|Genotyping errors (misclassification of exposure)||Non-differential genotyping errors will usually bias associations towards the null [65,66]. When there are systematic differences in genotyping according to outcome status (differential error), bias in any direction may occur.||8(b): Describe laboratory methods, including source and storage of DNA, genotyping methods and platforms (including the allele calling algorithm used, and its version), error rates and call rates. State the laboratory/centre where genotyping was done. Describe comparability of laboratory methods if there is more than one group. Specify whether genotypes were assigned using all of the data from the study simultaneously or in smaller batches.
13(a): Report numbers of individuals in whom genotyping was attempted and numbers of individuals in whom genotyping was successful.
|Factors affecting the potential extent of misclassification (information bias) of genotype include the types and quality of samples, timing of collection, and the method used for genotyping [18,61,67].
When high throughput platforms are used, it is important to report not only the platform used but also the allele calling algorithm and its version. Different calling algorithms have different strengths and weaknesses ( and supplementary information in ). For example, some of the currently used algorithms are notably less accurate in assigning genotypes to single nucleotide polymorphisms with low minor allele frequencies (<0.10) than to single nucleotide polymorphisms with higher minor allele frequencies . Algorithms are continually being improved. Reporting the allele calling algorithm and its version will help readers to interpret reported results, and it is critical for reproducing the results of the study given the same intermediate output files summarizing intensity of hybridization.
For some high throughput platforms, the user may choose to assign genotypes using all of the data from the study simultaneously, or in smaller batches, such as by plate ([71,72] and supplementary information in ). This choice can affect both the overall call rate and the robustness of the calls.
For case-control studies, whether genotyping was done blind to case-control status should be reported, along with the reason for this decision.
|Population stratification (confounding by ethnic origin)||When study sub-populations differ both in allele (or genotype) frequencies and disease risks, then confounding will occur if these sub-populations are unevenly distributed across exposure groups (or between cases and controls).||12(h): Describe any methods used to assess or address population stratification.||In view of the debate about the potential implications of population stratification for the validity of genetic association studies, transparent reporting of the methods used, or stating that none was used, to address this potential problem is important for allowing the empirical evidence to accrue.
Ethnicity information should be presented (for example, Winker ), as should genetic markers or other variables likely to be associated with population stratification. Details of case-family control designs should be provided if they are used.
As several methods of adjusting for population stratification have been proposed , explicit documentation of the methods is needed.
|Modelling haplotype variation||In designs considered in this article, haplotypes have to be inferred because of lack of available family information. There are diverse methods for inferring haplotypes.||12(g): Describe any methods used for inferring genotypes or haplotypes.||When discrete “windows” are used to summarize haplotypes, variation in the definition of these may complicate comparisons across studies, as results may be sensitive to choice of windows. Related “imputation” strategies are also in use [69,76,77].
It is important to give details on haplotype inference and, when possible, uncertainty. Additional considerations for reporting include the strategy for dealing with rare haplotypes, window size and construction (if used), and choice of software.
|Hardy-Weinberg equilibrium (HWE)||Departure from Hardy-Weinberg equilibrium may indicate errors or peculiarities in the data . Empirical assessments have found that 20% to 69% of genetic associations were reported with some indication about conformity with Hardy-Weinberg equilibrium, and that among some of these, there were limitations or errors in its assessment .||12(f): State whether Hardy-Weinberg equilibrium was considered and, if so, how.||Any statistical tests or measures should be described, as should any procedure to allow for deviations from Hardy-Weinberg equilibrium in evaluating genetic associations .|
|Replication||Publications that present and synthesize data from several studies in a single report are becoming more common.||3: State if the study is the first report of a genetic association, a replication effort, or both.||The selected criteria for claiming successful replication should also be explicitly documented.|
|Selection of participants||Selection bias may occur if
(i) genetic associations are investigated in one or more subsets of participants (sub-samples) from a particular study; or
(ii) there is differential nonparticipation in groups being compared; or
(iii) there are differential genotyping call rates in groups being compared.
|6(a): Give information on the criteria and methods for selection of subsets of participants from a larger study, when relevant.
13(a): Report numbers of individuals in whom genotyping was attempted and numbers of individuals in whom genotyping was successful.
|Inclusion and exclusion criteria, sources and methods of selection of sub-samples should be specified, stating whether these were based on a priori or post hoc considerations.|
|Rationale for choice of genes and variants investigated||Without an explicit rationale, it is difficult to judge the potential for selective reporting of study results. There is strong empirical evidence from randomised controlled trials that reporting of trial outcomes is frequently incomplete and biased in favour of statistically significant findings [79-81]. Some evidence is also available in pharmacogenetics .||7(b): Clearly define genetic exposures (genetic variants) using a widely used nomenclature system. Identify variables likely to be associated with population stratification (confounding by ethnic origin).||The scientific background and rationale for investigating the genes and variants should be reported.
For genome-wide association studies, it is important to specify what initial testing platforms were used and how gene variants are selected for further testing in subsequent stages. This may involve statistical considerations (for example, selection of P value threshold), functional or other biological considerations, fine mapping choices, or other approaches that need to be specified.
Guidelines for human gene nomenclature have been published by the Human Gene Nomenclature Committee [83,84]. Standard reference numbers for nucleotide sequence variations, largely but not only SNPs are provided in dbSNP, the National Center for Biotechnology Information’s database of genetic variation . For variations not listed in dbSNP that can be described relative to a specified version, guidelines have been proposed [86,87].
|Treatment effects in studies of quantitative traits||A study of a quantitative variable may be compromised when the trait is subjected to the effects of a treatment (for example, the study of a lipid-related trait for which several individuals are taking lipid-lowering medication). Without appropriate correction, this can lead to bias in estimating the effect and loss of power.||9(b): For quantitative outcome variables, specify if any investigation of potential bias resulting from pharmacotherapy was undertaken. If relevant, describe the nature and magnitude of the potential bias, and explain what approach was used to deal with this.
11: If applicable, describe how effects of treatment were dealt with.
|Several methods of adjusting for treatment effects have been proposed . As the approach to deal with treatment effects may have an important impact on both the power of the study and the interpretation of the results, explicit documentation of the selected strategy is needed.|
|Statistical methods||Analysis methods should be transparent and replicable, and genetic association studies are often performed using specialized software.||12(a): State software version used and options (or settings) chosen.|
|Relatedness||The methods of analysis used in family-based studies are different from those used in studies that are based on unrelated cases and controls. Moreover, even in the studies that are based on apparently unrelated cases and controls, some individuals may have some connection and may be (distant) relatives, and this is particularly common in small, isolated populations, for example, Iceland. This may need to be probed with appropriate methods and adjusted for in the analysis of the data.||12(j): Describe any methods used to address and correct for relatedness among subjects||For the great majority of studies in which samples are drawn from large, non-isolated populations, relatedness is typically negligible and results would not be altered depending on whether relatedness is taken into account. This may not be the case in isolated populations or those with considerable inbreeding. If investigators have assessed for relatedness, they should state the method used [89-91] and how the results are corrected for identified relatedness.|
|Reporting of descriptive and outcome data||The synthesis of findings across studies depends on the availability of sufficiently detailed data.||14(a): Consider giving information by genotype.
15: Cohort study – Report outcomes (phenotypes) for each genotype category over time
Case-control study – Report numbers in each genotype category
Cross-sectional study – Report outcomes (phenotypes) for each genotype category
|Volume of data||The key problem is of possible false-positive results and selective reporting of these. Type I errors are particularly relevant to the conduct of genome-wide association studies. A large search among hundreds of< thousands of genetic variants can be expected by chance alone to find thousands of false positive results (odds ratios significantly different from 1.0).||12(i): Describe any methods used to address multiple comparisons or to control risk of false positive findings.
16(d): Report results of any adjustments for multiple comparisons.
17(b): If numerous genetic exposures (genetic variants) were examined, summarize results from all analyses undertaken.
17(c): If detailed results are available elsewhere, state how they can be accessed.
|Genome-wide association studies collect information on a very large number of genetic variants concomitantly. Initiatives to make the entire database transparent and available online may supply a definitive solution to the problem of selective reporting .
Availability of raw data may help interested investigators reproduce the published analyses and also pursue additional analyses. A potential drawback of public data availability is that investigators using the data second-hand may not be aware of limitations or other problems that were originally encountered, unless these are also transparently reported. In this regard, collaboration of the data users with the original investigators may be beneficial. Issues of consent and confidentiality [92,93] may also complicate what data can be shared, and how. It would be useful for published reports to specify not only what data can be accessed and where, but also briefly mention the procedure. For articles that have used publicly available data, it would be useful to clarify whether the original investigators were also involved and if so, how.
The volume of data analyzed should also be considered in the interpretation of findings.
Examples of methods of summarizing results include giving distribution of P values (frequentist statistics), distribution of effect sizes, and specifying false discovery rates.
Source: Reprinted from (94) with permission of the Annals of Internal Medicine; the European Journal of Epidemiology; the European Journal of Clinical Investigation; Genetic Epidemiology; Human Genetics; the Journal of Clinical Epidemiology, and PLoS Medicine.