"Cluster Analysis of Genomic Data with Applications in R" by Katherine S. Pollard and Mark J. van der Laan

U.C. Berkeley Division of Biostatistics Working Paper Series

Title

Cluster Analysis of Genomic Data with Applications in R

Authors

Katherine S. Pollard, Center for Molecular Science and Engineering, University of California, Santa CruzFollow
Mark J. van der Laan, Division of Biostatistics, School of Public Health, University of California, BerkeleyFollow

Abstract

In this paper, we provide an overview of existing partitioning and hierarchical clustering algorithms in R. We discuss statistical issues and methods in choosing the number of clusters, the choice of clustering algorithm, and the choice of dissimilarity matrix. In particular, we illustrate how the bootstrap can be employed as a statistical method in cluster analysis to establish the reproducibility of the clusters and the overall variability of the followed procedure. We also show how to visualize a clustering result by plotting ordered dissimilarity matrices in R. We present a new R package, hopach, which implements the hybrid clustering method, Hierarchical Ordered Partitioning And Collapsing Hybrid (HOPACH). The methodology combines the strengths of both partitioning and agglomerative hierarchical clustering methods. At each node, a cluster is split into two or more smaller clusters with an enforced ordering of the clusters. Collapsing steps uniting the two closest clusters into one cluster are used to correct for errors made in the partitioning steps. The hopach function uses the median split silhouette (MSS) criterion to automatically choose (i) the number of children at each node, (ii) which clusters to collapse, and (iii) the main clusters (pruning the tree to produce a partition of homogeneous clusters). The methodology is illustrated with gene expression data.

Disciplines

Suggested Citation

Pollard, Katherine S. and van der Laan, Mark J., "Cluster Analysis of Genomic Data with Applications in R" (January 2005). U.C. Berkeley Division of Biostatistics Working Paper Series. Working Paper 167.
https://biostats.bepress.com/ucbbiostat/paper167

Download

Included in

Bioinformatics Commons, Computational Biology Commons, Genetics Commons, Microarrays Commons, Multivariate Analysis Commons, Numerical Analysis and Computation Commons

COinS

Collection of Biostatistics Research Archive

U.C. Berkeley Division of Biostatistics Working Paper Series

Title

Authors

Abstract

Disciplines

Suggested Citation

Included in

Browse

Search

Author Corner

UCB Biostatistics

Collection of Biostatistics Research Archive

U.C. Berkeley Division of Biostatistics Working Paper Series

Title

Authors

Abstract

Disciplines

Suggested Citation

Included in

Share

Browse

Search

Author Corner

UCB Biostatistics