Location:Home >> Detail
This work is licensed under aCreative Commons Attribution 4.0 International License
J Psychiatry Brain Sci. 2016;1(1):1; https://doi.org/10.20900/jpbs.20160001
1 State Key Laboratory of Medical Genetics, Central South University, Changsha, China
2 Department of Psychiatry, University of Illinois at Chicago, Chicago, USA
3 Department of Psychiatry and Behavioral Neuroscience, The University of Chicago, Chicago, USA
Correspondence: Dr. Chunyu Liu, Department of Psychiatry University of Illinois at Chicago
After a brief review of major factors that influence results of genetic studies, we summarize key considerations in designing genetic studies of psychiatric disorders. A sufficient sample size is needed, along with an appropriate threshold for statistical significance. Rigorous quality assessment, controls of covariates and population structure, and replication of results in a second data set are critical to avoid spurious findings. Data sharing should be promoted to enable powerful big data analyses. Gradually, functional studies will be incorporated into genetic studies to improve the quality of genetic studies and biological interpretation of statistical associations.
Incorporating a genetic component can be appealing to an investigator designing a new study of a neuropsychiatric disorder. After all, it is well known that most neuropsychiatric disorders are heritable to one degree or another. Incorporating a genetic component into a study may seem straightforward : draw blood from subjects, submit it to a service lab, get the genotypes back, and test them for an association with the phenotype(s) being studied. In reality, it is considerably more complicated than that. Putting aside the complexity of choosing the proper platforms for genotyping and phenotyping, there are a number of issues that must be addressed in the initial study design if the results are to be replicable and relevant to the current state of knowledge of the disorder in question. The three major issues are effect sizes, Type I error rates, and genetic and phenotypic heterogeneity.
The effect size of a particular gene variant can come as a bit of surprise to researchers unfamiliar with genetics of complex traits or common disorders. It is a major feature that sets common disorders apart from rare, Mendelian disorders. For a disease with prevalence of roughly 1%, a measure of effect size is the ratio of variant frequency in patients to that in controls. The single gene variants for rare diseases such as Huntington’s disease typically have effect sizes greater than 1000, but almost all discovered common variant associations for psychiatric disorders have effect sizes between 1.1 and 1.4 . Since sample size and effect size are the primary determinants of statistical power, genome-wide association study (GWAS) of complex disorders, including neuropsychiatric disorders, require many thousands of samples in a case-control study to produce any genome-wide significant associations.
Here, “genome-wide significant associations” indicates that the association p-value has been corrected for multiple testing, a correction which controls for Type I error. When performing as many tests as are required for a genome-wide association test, the multiple testing burden becomes a major problem. For example, testing 10,000 loci for association with a phenotype, using the standard p-value cutoff of 0.05, will produce 500 false positives, or Type I errors, without some type of statistical correction. Type I error has been the primary concern in small sample studies with low allele frequency variants, which tend to give spurious associations.
Heterogeneity refers to diversity among samples. In neuropsychiatric studies, diversity among the disorder-associated traits present in the case subjects is a particular concern . Clinicians well understand that a single psychiatric diagnosis can cover a broad range of cognitive function or mood changes; problems with psychiatric nosology have been discussed for over a hundred years . This phenotypic heterogeneity among cases is likely one of the reasons that the results of psychiatric case-control studies are often inconclusive or hard to replicate. Efforts are being made to bring clinical diagnostic categories into line with the underlying biology, via new phenotyping constructs such as those in the NIMH’s Research Domain Criteria (RDoC) . Any screening that can reduce sample heterogeneity would lead to better power in detecting genetic factors . There has been little success so far, however, so the problem of phenotypic heterogeneity, either within diagnostic categories or across traits like behavior or imaging, must be kept in mind when designing neuropsychiatric association studies.
Case samples can also be, and probably are, genetically heterogeneous, meaning that what appears to be a single phenotype can actually be associated with any one of multiple genetic variants. For example, in psychiatry, we know that almost 30% of adults with the rare 22q11.2 deletion are diagnosed with schizophrenia . However, the vast majority (99%) of schizophrenia patients do not have this deletion , so their disease must be attributable to other genetic and environmental factors (although there could also be common effects of genes within the fairly large 22q11.2 deletion). Schizophrenia is therefore a genetically heterogeneous disorder. Different mutations of different genes can lead to similar clinical phenotypes and diagnoses. Meanwhile, the same mutation could present different clinical features in different individuals likely due to genetic background differences, which brings us back to the problem of phenotypic heterogeneity.
Genetic heterogeneity can also refer to population stratification, which is a problem in GWASs in general, not just for neuropsychiatric conditions . Population stratification can produce spurious association results when there are allele frequencies between ethnic groups that are unrelated to disease, so a statistical correction must be incorporated into association study designs.What follows is a checklist for taking the above issues into account when designing or performing a psychiatric genetic study
1. Make Sure You Have the Right Samples for the Kind of Study You Want to Perform. There are many choices in designing genetic studies, from family-based to case-control, from genome-wide to candidate gene, and from linkage to association. Selection of methods depends on the specific hypothesis being tested. Detection of rare de novo mutations requires nuclear families. Association tests of common variants work better with large case-control cohorts. Private large effect mutations may be detected through large familial samples. Each type of study requires a particular statistical method, each of which comes with its own null hypothesis and basic assumptions.
2. Perform a Power Analysis. The number of subjects recruited into the study should be based on the expected effect sizes for the corresponding genetic variants and phenotypes in the study (as explained above, they will probably be quite small). One should be prepared to have 80% power to detect the effect at a given Type I error rate, taking the multiple testing correction into account. It should be noted that rare copy number variants and functional de novo mutations are predicted to have larger effect sizes ; therefore, studies with smaller sample sizes can be useful, as well. Strong effect rare events can be detected with smaller sample sizes than are needed for GWAS using algorithms to collapse the different events into summary statistics, like gene-wide or region-wide tests .
3. Select the Appropriate Threshold of Significance. Because of the multiple testing burden, GWASs typically use a p-value of 2e-8 or 5e-8 as a threshold for genome-wide significance. If a study focuses on some candidate variants or genes, the number of linkage equilibrium (LD) independent tests could be used to adjust that threshold. False discovery rate (FDR) , Bonferroni correction, or permutation are a few commonly used methods for correcting multiple testing inflation of Type I error rates.
4. Perform Rigorous Quality Control (QC) Measures. Insufficient quality control can lead to both false positives and false negatives. Both sample and measurement quality should be carefully assessed. Different experimental platforms have their own specific quality evaluation procedures; for example, a microarray-based genotyping should be evaluated for call rates for each sample and each SNP. Proper filtering should be applied to remove low-quality data before association tests are performed. Read-depth and sequence quality should be assessed for sequencing-based genotype data. The Hardy-Weinberg equilibrium test is frequently used to identify some potential genotyping errors. Because of the potential for sample handling errors, identity checks and sex checks are also frequently used to capture errors.
5. Control for the Effects of Co-variates. When the effects of co-variates are not controlled for statistically, they can skew the association results to produce false positives or false negatives. Many biological or environmental factors can be confounded with case-control status, such as sex, age, and drug exposure. There are also many technical artifacts that can interfere with results, such as experimental batch effects; these are the differences that arise between experimental batches due to anything from different reagent lots to different technicians running the experiments to variation in atmospheric pressure on the days the experiments were run . There are programs designed to statistically remove these effects from data, such as COMBAT , which removes batch effects from gene expression microarray data. These programs are not a cure-all, so a careful plan for randomized placement of samples in experiments is still required.
6. Account for Population Stratification in Case-control Studies. Ensuring that the case and control groups have comparable population composition is necessary to avoid false positives. Software like STRUCTURE  or methods like principal component analysis (PCA) are frequently used to address the problem using unlinked or ancestry informative markers (AIMs) that represent different geographical origins. These methods for controlling population stratification are better used when the stratification is mild. When samples come from very different populations, like from different continental groups, it may be better to study the different population samples separately, followed by a combined statistical analysis.
7. Include Validations, i.e., Replication of Results, in Study Design and Budget. Validation of any positive results is essential. It is not enough to observe a significant result in only one data set. Ideally, an appropriate second data set will be available with which to re-run the analysis and confirm the findings. The significance level of the initial findings can be an indicator of the probability that the results will be reproduced. However, the results may not be reproduced in a second data set, due to either false positives in the initial study or sample heterogeneity problems. Note that the significance threshold for replication can be much lower than in the initial discovery, which is typically genome-wide, because only the loci with significant associations in the initial data will be tested in the replicate data.
8. Share Data. Since sample size and data quality is critical for the success of genetic studies, data sharing is becoming the trend in the research community, due in part to support from funding agencies and publishers. The NIH now mandates data sharing for large grants with genetics and genomics data. The Psychiatric Genomics Consortium (PGC, http://www.med.unc.edu/pgc/) and Enhancing Neuro Imaging Genetics Through Meta Analysis (ENIGMA, http://enigma.ini.usc.edu/) have set good examples.
9. Remember that Proof of Biological Function is Needed to Substantiate Genetic Statistical Associations. Much, though not all, of the low-hanging fruit in genetic association studies have already been plucked. The field has moved past looking for simple Mendelian factors, and towards identifying functional aspects of the molecular changes underlying challenging complex disorders. Today, pathway analysis, network analysis [15, 16] and in silico prediction of functional variation produced by coding variants are common practice. Functional analysis of non-coding variants is a major challenge facing the field . Expression quantitative trait loci (eQTL) mapping is one way to address the issue, since it identifies variants with potential roles in regulating gene expression . Association with other phenotypes like brain imaging traits is another way of investigating possible functional effects of genetic variants [19, 20]. As more genomic and epigenomic data accumulate and more effort is put into studies of non-coding variants, particularly for human brains , we should acquire more and more evidence as to the functional effects of the genetic variants identified by genetic association studies. Converting statistical associations into biological causalities will be the major task after genetic studies.
To summarize, investigators must be aware that heterogeneity is a particular problem in genetic studies of neuropsychiatric diseases: what is referred to as a single diagnosis can actually represent a range of phenotypic traits, while on the flip side a single phenotypic trait can be produced by multiple alleles. Study design must include sufficient power to detect very small effect sizes, proper quality and covariate control procedures, a validation step, and a significance threshold that will control for Type I error. Ideally, studies will produce functional evidence to support identified statistical associations between genetic variants and phenotypes. Data sharing should be promoted as a common practice of genetic studies.
Liu C, Grennan KS, Gershon ES. Current Practices in Genetics Research of Psychiatric Disorders. J Psychiatry Brain Sci. 2016;1(1):1; https://doi.org/10.20900/jpbs.20160001