Gene Expression Barcodes Based on Data from 8,277 Microarrays

Matthew N. McCall, Johns Hopkins Bloomberg School of Public Health
Michael J. Zilliox, Emory University
Rafael A. Irizarry, Johns Hopkins Bloomberg School of Public Health

Abstract

The ability to measure gene expression based on a single microarray hybridization is necessary for microarrays to be a useful clinical tool. In its simplest form, this amounts to estimating whether or not each gene is expressed in a given sample. Surprisingly, this problem is quite challenging and has been disregarded for the most part in favor of estimating relative expression. We purpose addressing this problem by: (1) using the distribution of observed log2 intensities across a wide variety of tissues to estimate an expressed and an unexpressed distribution for each gene, and (2) for each gene in a sample, denoting it as expressed if its observed log2 intensity is more likely under the expressed distribution than under the unexpressed distribution and as unexpressed otherwise. The first step is accomplished by fitting a hierarchical mixture model to the plethora of publicly available data. To guarantee that each gene will be unexpressed in at least one tissue, we hybridized yeast samples to human microarrays and included these arrays when estimating the distributions. The output of our algorithm is a vector of ones and zeros denoting which genes are estimated to be expressed (ones) and unexpressed (zeros). We call this a gene expression barcode.

To investigate the performance of the barcode algorithm, we use 8277 publicly available microarrays from Affymetrix’s HGU133a platform. We illustrate the agree- ment of our algorithm with the results from a controlled experiment and an alternative technology and demonstrate its utility by predicting sample types in two difficult sce- narios. The methods described here are implemented in the R package barcode and are currently available for download at http://biostat.jhsph.edu/∼mmccall/software/.