Hybrid Clustering of Gene Expression Data with Visualization and the Bootstrap


Paper copy available from biostat@berkeley.edu. Include a surface mail address with your request.


Large-scale gene expression studies are coming increasingly common as new technologies make it possible to capture expression profiles for thousands of genes at once. One important goal with these high dimensional data structures is to find biologically important subsets and clusters of genes. In this paper, we propose a hybrid clustering method, Hierarchical Ordered Partitioning And Collapsing Hybrid (HOPACH), which is a hierarchical tree of clusters. The methodology combines the strengths of both partitioning (or divisive) and agglomerative clustering methods. At each node, a cluster is split into two or more smaller clusters with an enforced ordering of the clusters. We propose to visualize the clusters at any level of the tree by plotting the distance matrix corresponding with the ordering of the clusters and an ordering of genes within the clusters. A collapsing step uniting the two closest clusters into one cluster can be used to correct for errors in the number of clusters. A final ordered list of genes is obtained by running down the tree completely, possibly intervening with collapsing steps. Visual comparison of the distance matrix for different levels of the tree with the final distance matrix typically identifies the main clustering structure. After identifying the cluster, the bootstrap can be used to establish the reproducibility of these clusters and the overall variability of the followed procedure. The power of the methodology is illustrated with simulated and publicly available data sets consisting of cell lines from a variety of tumors.


Bioinformatics | Computational Biology | Multivariate Analysis

This document is currently not available here.