Upon completion of the first rough draft of the human genome, it was realized approximately 20,000 protein-coding genes comprises only about 1.5% of the genome; this is now called “the exome”. Only one-fifth of transcription across the human genome is associated with protein-coding genes, demonstrating at least four times more long non-coding than coding RNA sequences. Such large-scale complementary DNA (cDNA) sequencing projects as FANTOM (Functional Annotation of Mammalian cDNA) reveal the complexity of this transcription. The FANTOM Project began as a consortium in 2ooo and had expanded into four Phases (cf. http://www.osc.riken.jp/english/contents/fantom/ for details). With this 2o17 publication [attached], FANTOM Phase 5 data have now arrived.
The FANTOM3 Phase had identified ~35,000 non-coding transcripts from ~10,000 distinct loci that bear many signatures of mRNAs, including 5’ capping, splicing, and poly-adenylation; however, these have little or no open reading frames (ORFs). Unambiguously identifying ncRNAs within these cDNA libraries is challenging, however, because it can be difficult to distinguish protein-coding transcripts from non-coding transcripts. Long non-coding RNAs (lncRNAs) are largely heterogeneous and functionally uncharacterized –– although they are realized to be very pivotal (in regulation at numerous levels) of many critical-life functions in virtually all tissues. Testis and neural tissues appear to express the greatest abundance of lncRNAs, compared with all other tissue types examined.
In the attached landmark publication, using FANTOM5 cap analysis of gene expression (CAGE) data, authors integrated multiple transcript collections to generate a comprehensive atlas of 27,919 human lncRNA genes –– with high-confidence 5′ ends and expression profiles across 1,829 samples from the major human primary cell-types and tissues. Genomic and epigenomic classification of these lncRNAs reveals that most intergenic lncRNAs originate from enhancers rather than from promoters.
Incorporating genetic and expression data, authors show that lncRNAs that overlap trait-associated single-nucleotide variants are specifically expressed in cell-types that are relevant to the traits, implicating these lncRNAs in multiple diseases. Authors also demonstrate that lncRNAs that overlap expression quantitative trait loci (eQTL)-associated single-nucleotide variants of messenger RNAs (mRNAs) are co-expressed with the corresponding mRNAs –– suggesting their potential roles in transcriptional regulation. Combining these findings with conservation data, authors have identified 19,175 potentially functional lncRNAs in the human genome.
Nature 9 Mar 2o17 543: 199-204