"GC-Content Normalization for RNA-Seq Data" by Davide Risso, Katja Schwartz et al.

U.C. Berkeley Division of Biostatistics Working Paper Series

Title

GC-Content Normalization for RNA-Seq Data

Authors

Davide Risso, Department of Statistical Sciences, Università degli Studi di PadovaFollow
Katja Schwartz, Department of Genetics, Stanford UniversityFollow
Gavin Sherlock, Department of Genetics, Stanford UniversityFollow
Sandrine Dudoit, Division of Biostatistics and Department of Statistics, University of California, BerkeleyFollow

Abstract

Background: Transcriptome sequencing (RNA-Seq) has become the assay of choice for high-throughput studies of gene expression. However, as is the case with microarrays, major technology-related artifacts and biases affect the resulting expression measures. Normalization is therefore essential to ensure accurate inference of expression levels and subsequent analyses thereof.

Results: We focus on biases related to GC-content and demonstrate the existence of strong sample-specific GC-content effects on RNA-Seq read counts, which can substantially bias differential expression analysis. We propose three simple within-lane gene-level GC-content normalization approaches and assess their performance on two different RNA-Seq datasets, involving different species and experimental designs. Our methods are compared to state-of-the-art normalization procedures in terms of bias and mean squared error for expression fold-change estimation and in terms of Type I error and p-value distributions for tests of differential expression. The exploratory data analysis and normalization methods proposed in this article are implemented in the open-source Bioconductor R package EDASeq.

Conclusions: Our within-lane normalization procedures, followed by between-lane normalization, reduce GC-content bias and lead to more accurate estimates of expression fold-changes and tests of differential expression. Such results are crucial for the biological interpretation of RNA-Seq experiments, where downstream analyses can be sensitive to the supplied lists of genes.

Disciplines

Bioinformatics | Computational Biology

Suggested Citation

Risso, Davide; Schwartz, Katja; Sherlock, Gavin; and Dudoit, Sandrine, "GC-Content Normalization for RNA-Seq Data" (August 2011). U.C. Berkeley Division of Biostatistics Working Paper Series. Working Paper 291.
https://biostats.bepress.com/ucbbiostat/paper291

Download

Included in

Bioinformatics Commons, Computational Biology Commons

COinS

Collection of Biostatistics Research Archive

U.C. Berkeley Division of Biostatistics Working Paper Series

Title

Authors

Abstract

Disciplines

Suggested Citation

Included in

Browse

Search

Author Corner

UCB Biostatistics

Collection of Biostatistics Research Archive

U.C. Berkeley Division of Biostatistics Working Paper Series

Title

Authors

Abstract

Disciplines

Suggested Citation

Included in

Share

Browse

Search

Author Corner

UCB Biostatistics