Analyzing our ability to understand each individual human genome

The effort of a large consortium is underscored in the attached exciting articles…!! What can we infer from the differences between each person’s genetic code, with regard to their individual development and health? Several factors have hampered researchers’ ability to answer this question. First, understanding genetic variation requires analyzing a very large number of sequences, because humans carry many rare variants; while most of these have no effect, the occasional few rare variants DO cause genetic diseases. Second, most of our understanding of genetic variation has come from studying single-nucleotide variants (SNVs), but structural DNA variants (SVs; i.e. those that are more than 50 nucleotides long) can have a larger impact on physiological traits, and can be major contributors to disease. Third, we lack an understanding of variation outside protein-coding sequences. Herein [see attached four papers plus editorial) the genome aggregation database (gnomAD) consortium has ventured forth to increase our knowledge in these areas.

The gnomAD project is successor to the game-changing exome aggregation consortium (ExAC) project, which catalogued

genetic variation in the protein-coding parts of the genome (called exomes), from >60,000 people. ExAC set a new standard for harmonized analysis — bringing in data from diverse projects for reanalysis into a common pipeline — and for data-sharing. The ExAC database has had a profound impact on how researchers, physicians and genetic counselors interpret genomes of people with genetic diseases.

In the first of the four [attached] papers, authors describe the gnomAD consortium’s collection of 125,748 exomes and 15,708 whole genomes. The plan to sequence whole genomes is particularly exciting, because analysis of non-coding sequences provides information about both structural variation and variation in DNA sequences that regulate gene expression — described in the companion papers. The gnomAD resource includes sequences from diverse populations — including individuals from Africa and Asia; however, representation from even more diverse populations will be needed to obtain the full spectrum of human variation and to capture more population-specific variation.

In the second of the four [attached] papers, authors investigated why genes that seem intolerant to predicted loss-of-function (pLoF) can sometimes carry these variants with (apparently) little consequence. Genes can be transcribed in different ways — with some protein-coding regions (i.e. from exons) expressed only in a limited fashion. Authors showed that, when an individual carries a pLoF variant in an ‘intolerant’ gene, the variant is often in an exon that exhibits this restricted expression, thus limiting its effect.

In the third of the four [attached] papers, authors assessed how the pLoF database might improve our ability to identify genetic targets for drugs (i.e. ‘druggable’ targets). Identification of individuals who carry two pLoF variants in a given gene is desirable in drug discovery (i.e. if these individuals also show a change in a particular trait, it provides evidence that the gene could be a good drug target). Authors demonstrated that there are still many errors when identifying pLoF variants; that quality control is needed when identifying these variants; and that instances of an individual carrying two pLoF variants in the same gene are sufficiently rare that we will need cohorts roughly 1,000 times bigger than gnomAD to gather definitive evidence of their existence in most genes.

In the last of the four [attached] papers, authors produced a catalogue of structural variants. There have been excellent efforts at cataloguing structural variants — using long-read sequencing technology. However, sample sizes have been small, owing to the expense and lack of standardized analysis pipelines for this approach (although this situation is expected to improve in the near future). In contrast, identifying structural variants in short-read sequences is technically challenging, because the variants are often larger than a typical short-sequence read, and they can arise through a variety of mutational mechanisms, resulting in many variant types (e.g. duplication, deletion or inversion of DNA), which each leave different footprints in the genome. This has led to the development of many tools for identifying structural variants from short reads, but no ‘standard’ pipeline to date.

The gnomAD resource, like ExAC before it, will change how we interpret individual genomes. This consortium’s work has revealed how much information about human variation we had been missing and has provided tools that help us to better understand the genome at both the population and individual level. 😊 😊


Nature 28 May 2020; 581: 434-443, 444-451, 452-458, 459-464 & editorial pp 385-386

This entry was posted in Center for Environmental Genetics. Bookmark the permalink.