Recurrent functional misinterpretation of RNA-Seq data caused by sample-specific gene length bias

These days, researchers can easiy analyze profiles of entire tissue, or even one-cell, transcriptomes (sum total of all messenger RNA molecules expressed from all genes in the sample). This used to be carried out by expression microarrays, but now RNA-sequencing (RNA-Seq) is a method that has transformed biological research over the last decade — from a formidable task to one that is readily accessible to most small laboratories. Consequently, RNA-Seq is one of the most widely used techniques in biological and medical research and is routinely applied for multiple goals — including elucidation of key transcriptional networks that drive different biological processes, and identification of diagnostic and prognostic expression signatures for many diseases, to monitoring the effects of a drug or environmental toxicant on individual tissues or cell-types.

Data normalization is critical in RNA-Seq studies, allowing for accurate estimation and detection of differential expression. The aim of normalization — is to remove any artifactual effects that occur in the data (so as to ensure that technical bias has minimal impact on the results). Numerous normalization methods are now standardized for RNA-Seq data — including reads per kilobase of transcript length per million reads (RPKM), edgeR’s Trimmed Mean of M values (TMM), DESeq’s relative log expression (RLE), and upper-quartile (UQ) normalization.

A well-known inherent technical effect in RNA-Seq experiments relates to gene length and stems from the fact that in standard RNA-seq protocols, RNA (or cDNA) molecules are fragmented prior to sequencing — in such a way that longer transcripts are sheared into more fragments than shorter ones are. Therefore, the number of reads for a given transcript is proportional not only to its expression level but also to its length. Thus, RPKM divides gene counts by gene length (in addition to library size), aiming to adjust expression estimates for this length effect. A well-known consequence — [of the fact that longer genes tend to get more counts than equally expressed shorter genes] — is over-representation of long genes among the ones that pass statistical tests for differential expression (termed “length bias”), because of their increased statistical power.

This stochastic sample-specific effect is not corrected by common normalization methods. Authors [see attached article] show that this bias causes recurrent false-positive calls (i.e. calling something significant when it actually is not) by gene-set enrichment analysis (GSEA) methods, thereby leading to frequent misinterpretation of data. Gene sets characterized by markedly short genes (e.g. ribosomal protein genes) or long genes (e.g. extracellular matrix genes) are particularly prone to such false calls. This sample-specific length bias is effectively removed by the conditional quantile normalization (cqn) and EDASeq methods, which allow integration of gene length as a sample-specific covariate.

Consequently, authors [see attached article] show that, using these normalization methods, their studies led to substantial reduction in GSEA false results while retaining true ones. Further, authors found that application of GSEA tests that take into account gene–gene correlations attenuates false-positive rates caused by gene length bias, but statistical power is decreased, as well. These results [a] emphasize the inspection and correction of sample-specific length biases as default steps in RNA-Seq analysis pipelines and [b] reiterate the need to account for intergene correlations when performing gene-set enrichment tests to lessen false interpretation of transcriptomic data.

This ongoing discussion needs one more opinion — from someone knowledgeable and working “in the trenches” — which should now bring closure to this topic. 😉


PLoS Biol Nov 2019; 17: e3000481

This entry was posted in Center for Environmental Genetics. Bookmark the permalink.