Gene Expression Analysis with the Parametric Bootstrap


Published in Biostatistics (2001), 2(3), pp. 1-17.


Recent developments in microarray technology make it possible to capture the gene expression profiles for thousands of genes at once. One very important use of such data is the identification of groups of genes with similar (and interesting) patterns of expression. Currently, cluster analysis and related techniques are being employed; although a useful approach, such exploratory methods alone do not provide the opportunity to engage in statistical inference. Consequently, the analyst does not get any type of significance level for features in the data or a theoretical basis for purposeful experimental design. These two issues are particularly crucial when dealing with the high dimensional data structures and, all too often, relatively small samples presented by microarray experiments. In this paper, we propose the use of a deterministic rule, applied to the parameters of the gene expression distribution, to select a target subset of genes that are of biological interest. This target subset presents an interesting parameter which can now be estimated from microarray data by applying the rule to sample statistics. We focus on rules that operate on the mean and covariance; we also employ the output of a cluster analysis methodology ("partitioning around medoids" or PAM) to further refine the subset. The parametric bootstrap (based on a multivariate normal model) is used to estimate the distribution of these estimated subjects; relevant summary measures of this distribution are proposed. We prove consistency of the subset estimates and asymptotic validity of this parametric bootstrap under the assumption that the sample size converges faster to infinity than the logarithm of the number of genes. We also prove a conservative sample size formula guaranteeing that the sample mean and sample covariance matrix are uniformly within a distance of the population mean and covariance. The practical performance of the method is illustrated with a simulation study. The method is also used to analyze a publicly available data set.


Bioinformatics | Computational Biology | Design of Experiments and Sample Surveys | Genetics | Microarrays | Multivariate Analysis | Statistical Methodology | Statistical Theory

This document is currently not available here.