The ENCODE Project; Round 3 (awesome)…!!

Less than 2% of the human genome (more than 3 billion base-pairs) encodes proteins. But a major challenge for genomics has been mapping the functional elements (regions that determine the extent to which genes are expressed, when temporally, and in what cell types) — in the remaining 98% of our DNA. The Encyclopedia of DNA Elements (ENCODE) project, among other large collaborative efforts, was established in 2003 to create a catalog of these functional elements and to outline their roles in regulating gene expression. In nine papers in the 30 Jul 2020 issue of Nature, the ENCODE consortium delivers the third phase of this valuable project.

ENCODE 1, in 2007, the pilot phase of the ENCODE project, searched for functional elements in 1% of the genome in a few human cell lines. The consortium catalogued two types of elements: [a] DNA regions that are transcribed into RNA (both protein-coding and non-protein-coding); [b] DNA regions that regulate gene transcription, known as cis-regulatory elements (CREs) — (these regions can be identified by their accessibility to DNA-cleaving enzymes such as DNase I, by DNA-binding proteins such as transcription factors, or by specific molecular modifications on histone proteins to which DNA is bound in a complex called chromatin).

ENCODE 2 in 2012, the second phase of the ENCODE project extended the search for these functional elements to the whole genome in more human cell lines, laying down a solid foundation for the encyclopedia; this included analysis of cell and tissue repertoires of RNA transcription, chromatin structure and modification, DNA methylation, chromatin looping, and occupancy of DNA segments by transcription factors and RNA-binding proteins. Similar efforts were extended to the mouse genome in 2014, deepening our understanding of these elements from an evolutionary perspective.

In the current third phase of ENCODE, [see attached article, plus Perspectives on pp 693-698 and seven other articles starting on pages 711, 720, 729, 737, 744, 752 & 760], the consortium moved from cell lines to cells taken directly from human and mouse tissues —which obviously provides a more biologically relevant encyclopedia. ENCODE 3 also introduced assays to investigate the broader aspects of functional elements [e.g. to characterize elements embedded in RNAs or to analyze chromatin looping — which brings separate CREs into close proximity to enable gene regulation (illustrated in Fig. 1 of editorial) ].

In the flagship article [attached], The ENCODE Project Consortium provides a bird’s-eye view of the updated encyclopedia — which contains newly added data sets from 6,000 experiments, performed on ~1,300 samples. By integrating these data sets, the consortium has created an online registry of candidate CREs. Most are classified as

promoters or enhancers (i.e. CREs respectively located near, or at some distance, from the genomic site at which transcription of a gene is initiated). The consortium tracked the activity of each candidate CRE (along with the proteins that bind to it) in many different samples from various tissues. Authors used chromatin-looping data to link enhancers to genes that they most likely regulate. This online registry marks a true milestone, turning an overwhelming amount of genomic information into a searchable, filterable and retrievable encyclopedia of DNA elements, freely accessible at https://screen.encodeproject.org.

The Perspectives and other companion articles, plus papers in several Nature’s sister journals [see go.nature.com/encode], delve deep into the biology behind this project. These studies leveraged the scale and variety of the ENCODE data sets — to reveal the principles that govern how functional elements work. Together, all these papers demonstrate the value of large-scale data production in biology. Authors have developed a registry of 926,535 human and 339,815 mouse candidate CREs, covering 7.9% (human) and 3.4% (mouse) of their respective genomes — by integrating selected datatypes associated with gene regulation. There are enough data here to provide any single person with a year of reading and studying. Authors constructed a web-based server, SCREEN [http://screen.encodeproject.org] — to provide flexible, user-defined access to this resource. Collectively, the ENCODE data and registry provide an EXPANSIVE resource for the scientific community to build a better understanding of the organization and function of the human and mouse genomes. [whew] 😊

DwN
Nature 30 Jul 2020; 583: 699-710 & Editorial pp 685-686

COMMENT:
Hey Dan, I agree this is a great step forward. I have a couple of comments. In para 1 of your email summary, you say “~2% of the human genome is coding,” leaving ~98% uncharacterized. Now in para 7 you state that “7.9% of the genome is now ascribed a function.” That leaves 92.1% still uncharacterized.

The ENCODE project in phase 2 was criticized—because they claimed that a very high percent of the genome was functional (about 82%, if I recall correctly). I think the problem was that they counted repeat elements as functional elements. Repeat elements make up ~78% of the genome; this was attacked as a misrepresentation.

The question now is—how much can be accounted for, and how much is terra incognito? if 82% + 7.9% is reasonable, that covers very near to 90%, with some assigned role as a repeat, or as a functional element. I think there are about 1,500 transcription factors; I do not know how many they have assayed by ChiP-Seq. Presumably there is room for more discovery. Again, this whole project is fantastic. It is also good to know what remains still to be discovered!

This entry was posted in Center for Environmental Genetics. Bookmark the permalink.