"MODEL-BASED QUALITY ASSESSMENT AND BASE-CALLING FOR SECOND-GENERATION " by Rafael A. Irizarry and Hector Corrada Bravo

Johns Hopkins University, Dept. of Biostatistics Working Papers

Title

MODEL-BASED QUALITY ASSESSMENT AND BASE-CALLING FOR SECOND-GENERATION SEQUENCING DATA

Authors

Rafael A. Irizarry, Johns Hopkins University, Bloomberg School of Public Health, Department of BiostatisticsFollow
Hector Corrada Bravo, Post-doctoral Fellow, Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health

Abstract

Second-generation sequencing (sec-gen) technology can sequence millions of short fragments of DNA in parallel, and is capable of assembling complex genomes for a small fraction of the price and time of previous technologies. In fact, a recently formed international consortium, the 1,000 Genomes Project, plans to fully sequence the genomes of approximately 1,200 people. The prospect of comparative analysis at the sequence level of a large number of samples across multiple populations may be achieved within the next five years. These data present unprecedented challenges in statistical analysis. For instance, analysis operates on millions of short nucleotide sequences, or reads—strings of A,C,G, or T’s, between 30-100 characters long—which are the result of complex processing of noisy continuous fluorescence intensity measurements known as base-calling. The complexity of the base-calling discretization process results in reads of widely varying quality within and across sequence samples. This variation in processing quality results in infrequent but systematic errors that we have found to mislead downstream analysis of the discretized sequence read data. For instance, a central goal of the 1000 Genomes Project is to quantify across-sample variation at the single nucleotide level. At this resolution, small error rates in sequencing prove significant, especially for rare variants. Sec-gen sequencing is a relatively new technology for which potential biases and sources of obscuring variation are not yet fully understood. Therefore, modeling and quantifying the uncertainty inherent in the generation of sequence reads is of utmost importance. In this paper we present a simple model to capture uncertainty arising in the base-calling procedure of the Illumina/Solexa GA platform. Model parameters have a straightforward interpretation in terms of the chemistry of base-calling allowing for informative and easily interpretable metrics that capture the variability in sequencing quality. Our model provides these informative estimates readily usable in quality assessment tools while significantly improving base-calling performance.

Disciplines

Bioinformatics | Computational Biology

Suggested Citation

Irizarry, Rafael A. and Bravo, Hector Corrada, "MODEL-BASED QUALITY ASSESSMENT AND BASE-CALLING FOR SECOND-GENERATION SEQUENCING DATA" (September 2009). Johns Hopkins University, Dept. of Biostatistics Working Papers. Working Paper 184.
https://biostats.bepress.com/jhubiostat/paper184

Download

Included in

Bioinformatics Commons, Computational Biology Commons

COinS

Collection of Biostatistics Research Archive

Johns Hopkins University, Dept. of Biostatistics Working Papers

Title

Authors

Abstract

Disciplines

Suggested Citation

Included in

Browse

Search

Author Corner

JHU Biostatistics

Collection of Biostatistics Research Archive

Johns Hopkins University, Dept. of Biostatistics Working Papers

Title

Authors

Abstract

Disciplines

Suggested Citation

Included in

Share

Browse

Search

Author Corner

JHU Biostatistics