The Post-GWAS Era: From Association to Function

After the discovery of the structure of DNA and the genetic code in the early 1950s, the field of human genetics was largely focused on understanding the structure and function of protein-coding genes and how rare mutations in these genes might be associated with causing disease or increasing risk of disease. Furthermore, the central dogma of molecular biology had decided that “genes are first transcribed into messenger RNA (mRNA), after which the mRNA is translated into protein.” Because of the straightforward nature of the genetic code –– it SEEMED easy to predict how alterations of the underlying DNA sequence would change the gene product (amino-acid sequence of the resulting protein). In addition, it was clear from Mendelian genetics that diseases “that run in families in predictable patterns” are caused by mutations in a single gene. Beginning with the mapping of the genetic cause (e.g. of sickle-cell anemia and the neurodegenerative disorder Huntington Disease), the causative mutations underlying many Mendelian diseases were elucidated by positional cloning, and an important hurdle had been accomplished in our understanding of the genetic bases of human disease.

However, many of the most common and (financially and emotionally) burdensome diseases –– such as cardiovascular disease, cancer, Alzheimer disease, Parkinsons disease, and type-2 diabetes –– are typically not (or never) caused by single mutations. Such ‘‘multifactorial traits’’ are instead influenced by a combination of multiple genetic, epigenetic, and environmental risk factors, and thus do not follow “simple” Mendelian inheritance patterns. The departure from a ‘‘one-gene, one-mutation, one-outcome’’ model posed a formidable challenge to elucidating the biology of these diseases. Multifactorial traits, by definition, are influenced by many genes (polygenic). Human height, for example, appears to be affected by genetic variation at hundreds if not thousands of loci across the genome. These genetic loci may interact in additive, or in non-additive (i.e., epistatic; gene-gene interactions), ways.

Yet, while it may not always be necessary to understand the cause of a disease in order to successfully treat it, such a mechanistic understanding certainly increases the likelihood that a successful therapeutic intervention will be achieved. The attached review summarizes what has happened since the first genome-wide association studies (GWAS) during the 2oo2-2oo6 era –– linking genetic variation to identify loci that harbor genetic variants [typically single-nucleotide variants (SNVs) or polymorphisms (SNPs)] that are associated with risk for complex diseases and quantitative traits. The earliest two GWAS that I can find include: the lymphotoxin-a gene (LTA) linked to myocardial infarction (2oo2) and the complement factor H gene (CFH) linked to age-related macular degeneration (2oo5). Today, the GWAS era has been successful in the sense that thousands of loci have been statistically significantly associated with risk for diseases and traits, and a notable number of these loci are well-replicated –– suggesting that they are true associations.

Several factors have made it difficult, however, to bridge the gap between the statistical associations linking locus-and-trait and a functional understanding of the biology underlying disease risk. First, the association of a DNA locus with disease does not specify which variant (or variants) at that locus is actually causing the association (the ‘‘causal variant’’) –– nor which gene (or genes) is affected by the causal variant (the ‘‘target gene’’). The former problem is due to the fact that there are often many co-inherited variants in strong linkage disequilibrium (LD; the non-random association of alleles at different loci along the same strand of DNA, same chromosome, in a given population) with the most significant (or ‘‘sentinel’’) disease-associated variant, comprising a haplotype. Within the haplotype, genetic variants in strong LD often have statistically indistinguishable associations with disease risk; as a consequence, empirical validation might be needed to determine which of the linked variants are functional. Second, more than 90% of disease-associated SNVs are located in non-protein-coding regions of the genome, and many of them are far away from the nearest known gene.

Am J Hum Genet 3 May 2o18; 102: 717–730

This entry was posted in Center for Environmental Genetics. Bookmark the permalink.