In this paper, we provide an overview of existing partitioning and hierarchical clustering algorithms in R. We discuss statistical issues and methods in choosing the number of clusters, the choice of clustering algorithm, and the choice of dissimilarity matrix. In particular, we illustrate how the bootstrap can be employed as a statistical method in cluster analysis to establish the reproducibility of the clusters and the overall variability of the followed procedure. We also show how to visualize a clustering result by plotting ordered dissimilarity matrices in R. We present a new R package, hopach, which implements the hybrid clustering method, Hierarchical Ordered Partitioning And Collapsing Hybrid (HOPACH). The methodology combines the strengths of both partitioning and agglomerative hierarchical clustering methods. At each node, a cluster is split into two or more smaller clusters with an enforced ordering of the clusters. Collapsing steps uniting the two closest clusters into one cluster are used to correct for errors made in the partitioning steps. The hopach function uses the median split silhouette (MSS) criterion to automatically choose (i) the number of children at each node, (ii) which clusters to collapse, and (iii) the main clusters (pruning the tree to produce a partition of homogeneous clusters). The methodology is illustrated with gene expression data.
Bioinformatics | Computational Biology | Genetics | Microarrays | Multivariate Analysis | Numerical Analysis and Computation
Pollard, Katherine S. and van der Laan, Mark J., "Cluster Analysis of Genomic Data with Applications in R" (January 2005). U.C. Berkeley Division of Biostatistics Working Paper Series. Working Paper 167.