High-throughput annotation of full-length long noncoding RNAs (lncRNAs) — success using RNA Capture Long Seq (CLS)

Long noncoding RNAs (lncRNAs), formerly called “long intergenic noncoding RNAs” (lincRNAs), represent a vast and relatively unexplored component of the mammalian genome. They are defined as “>200 nucleotides, and up to many thousands of nucleotides, that are transcribed into RNA but not translated in a protein product.” LncRNAs have been implicated in associations with certain human complex diseases (e.g. schizophrenia, autism spectrum disorder, and cancers) and therefore are relevant to “gene-environment interactions” because LncRNAs are yet-another form of GENOTYPE that influences/affects the PHENOTYPE (multifactorial trait).

Assignment of lncRNA functions depends on the availability of high-quality transcriptome (mRNAs, coding RNA, transcribed from DNA) annotations. At present, such annotations are still rudimentary: we have little idea of the total number of lncRNAs, and for those that have been identified, transcript structures remain largely incomplete. Projects –– using diverse approaches –– have helped to increase both the number and size of available lncRNA annotations. Early gene sets, derived from a mixture of FANTOM cDNA sequencing efforts and public databases, were combined with lncRNA sets discovered through chromatin signatures. More recently, researchers have applied transcript-reconstruction software, but annotation efforts continue to face a necessary compromise between throughput and quality. Hence, there is growing divergence between large automated annotations of uncertain quality (e.g. 101,700 genes for NONCODE versus 15,767 genes for GENCODE version 25).

Annotation incompleteness takes two forms. First, genes may be entirely missing from an annotation; many genomic regions are suspected to transcribe RNA but contain no annotation, including ‘orphan’ small RNAs with presumed long precursors, enhancers, and ultra-conserved elephants. Second, annotated lncRNAs may represent partial gene structures. Start- and end-sites frequently lack independent supporting evidence, and lncRNAs are shorter and have fewer exons than mRNAs. To accelerate lncRNA annotation, the GENCODE consortium has developed RNA Capture Long Seq (CLS), which combines targeted RNA-capture with third-generation long-read sequencing. Authors [attached article] present an experimental reannotation of the GENCODE intergenic lncRNA populations in matched human and mouse tissues –– that resulted in novel transcript models for 3,574 and 561 gene loci, respectively.

CLS approximately doubled the annotated complexity of the targeted loci, outperforming existing short-read techniques. Full-length transcript models produced by CLS enabled these authors definitively to characterize the genomic features of lncRNAs –– including promoter and gene structure, and protein-coding potential. Therefore, CLS can remove a long-standing bottleneck in transcriptome annotation, by generating manual-quality full-length transcript models at high-throughput scales.

Nat Genet Dec 2o17; 49: 1731–1740

This entry was posted in Center for Environmental Genetics. Bookmark the permalink.