"Supervised Detection of Conserved Motifs in DNA Sequences with cosmo" by Oliver Bembom, Sunduz Keles et al.

U.C. Berkeley Division of Biostatistics Working Paper Series

Title

Supervised Detection of Conserved Motifs in DNA Sequences with cosmo

Authors

Oliver Bembom, Division of Biostatistics, School of Public Health, University of California, BerkeleyFollow
Sunduz Keles, Dept. of Statistics & Dept. of Biostatistics & Medical Informatics, University of Wisconsin, Madison
Mark J. van der Laan, Division of Biostatistics, School of Public Health, University of California, BerkeleyFollow

Abstract

A number of computational methods have been proposed for identifying transcription factor binding sites from a set of unaligned sequences that are thought to share the motif in question. We here introduce an algorithm, called cosmo, that allows this search to be supervised by specifying a set of constraints that the position weight matrix of the unknown motif must satisfy. Such constraints may be formulated, for example, on the basis of prior knowledge about the structure of the transcription factor in question. The algorithm is based on the same two-component multinomial mixture model used by MEME, with stronger reliance, however, on the likelihood principle instead of more ad-hoc criteria like the E-value. The intensity parameter in the ZOOPS and TCM models, for instance, is estimated based on a profile-likelihood approach, and the width of the unknown motif is selected based on BIC. These changes allow cosmo to outperform MEME even in the absence of any constraints, as evidenced by 2- to 3-fold greater sensitivity in some simulation studies. Additional improvements in performance can be achieved by selecting the model type (OOPS, ZOOPS, or TCM) data-adaptively or by supplying correctly specified constraints, especially if the motif appears only as a weak signal in the data. The algorithm can data-adaptively choose between working in a given constrained model or in the completely unconstrained model, guarding against the risk of supplying mis-specified constraints. Simulation studies suggest that this approach can offer 3 to 3.5 times greater sensitivity than MEME. The algorithm has been implemented in the form of a stand-alone C program as well as a web application that can be accessed at http://cosmoweb.berkeley.edu. An R package is available through Bioconductor (http://bioconductor.org).

Disciplines

Laboratory and Basic Science Research

Suggested Citation

Bembom, Oliver; Keles, Sunduz; and van der Laan, Mark J., "Supervised Detection of Conserved Motifs in DNA Sequences with cosmo" (July 2006). U.C. Berkeley Division of Biostatistics Working Paper Series. Working Paper 209.
https://biostats.bepress.com/ucbbiostat/paper209

Download

Included in

Laboratory and Basic Science Research Commons

COinS

Collection of Biostatistics Research Archive

U.C. Berkeley Division of Biostatistics Working Paper Series

Title

Authors

Abstract

Disciplines

Suggested Citation

Included in

Browse

Search

Author Corner

UCB Biostatistics

Collection of Biostatistics Research Archive

U.C. Berkeley Division of Biostatistics Working Paper Series

Title

Authors

Abstract

Disciplines

Suggested Citation

Included in

Share

Browse

Search

Author Corner

UCB Biostatistics