Assembly of complete genome from 910 Africans: ~10% more DNA than the current human reference genome !!!

From the time of the first publication of the “complete human genome sequence” [Feb 2001; which was hardly “complete”, and which champagne celebration party at the National Library of Medicine in Bethesda, I (fortunately or unfortunately) attended?], the human genome “consensus sequence” has undergone continual improvements — aimed at filling in all the “gaps” and correcting errors. The latest release, GRCh38, spans 3.1 gigabases (Gb; billion bases), with “just” 875 remaining gaps. The ongoing effort to improve the human reference genome, led by the international Genome Reference Consortium (GRC) has, in recent years, added alternative loci for genomic regions where variation cannot be captured by single-nucleotide variants (SNVs) or small insertions and deletions (indels). These alternative loci, which comprise 261 scaffolds in GRCh38, capture a small amount of population variation and improve read-mapping for “some” datasets.

Despite these efforts, the current human reference genome is derived primarily from a single individual, thus limiting its usefulness for genetic studies — especially among admixed populations, such as those representing the African diaspora (human migrations out of Africa). In recent years, a growing number of researchers have emphasized the importance of capturing and representing sequencing data from diverse populations and incorporating these data into the reference genome. The alternative loci in GRCh38 offer one possible way to add such diversity, although it is unclear whether such a solution is sustainable (as more and more distinct ethnic populations are sequenced).

The lack of diversity in the reference genome poses many challenges — when analyzing individuals whose genetic background does not match the reference. This problem may be addressed by using large databases of known SNVs, but this solution only addresses SNV differences and small indels and is not adequate for larger variants (i.e. copy number variants (CNVs) and large insertions & deletions of hundred or thousands of bases) Findings from the 1000 Genomes Project indicate that differences between populations are quite large; examination of 26 populations across five continents revealed that 86% of discovered variants were present in only one continental group. In that study, the five African populations (because they have existed for the longest period of time on this planet) examined had the highest number of variant sites, compared with the remaining 21 populations.

One way to address limitations of a single reference genome is to sequence and assemble reference genomes for other subpopulations. The 1000 Genomes Project, Genome in a Bottle, and other projects have assembled draft genomes from various populations — including Chinese, Korean, and Ashkenazi individuals. Other groups have used highly homogenous populations (e.g. Danish, Dutch, or Icelandic), together with assembly-based approaches, to discover SNVs and structural variants — including up to several megabases of non-reference sequence common to these populations. Authors [see attached article] used a deeply-sequenced dataset of 910 individuals, all of African descent, to construct a set of DNA sequences that is present in these individuals but missing from the reference human genome.

Authors aligned 1.19 trillion reads from the 910 individuals with the GRCh38 reference genome, collected all reads that failed to align, and assembled these reads into contiguous sequences (contigs). Authors then compared all contigs to one another — to identify a set of unique sequences representing regions of the African pan-genome missing from the reference genome. Their analysis revealed 296,485,284 bp, present in populations of African descent, demonstrating that the African pan-genome contains ~10% more DNA than the current human reference genome..!! Although the functional significance of nearly all of this sequence is unknown, 387 of the novel contigs fall within 315 distinct protein-coding genes, and the rest appear to be intergenic (i.e. in between protein-coding genes).


Nat Genet Jan 2019; 51: 30–35

This entry was posted in Center for Environmental Genetics. Bookmark the permalink.