Abstract
In this paper, we provide an overview of existing partitioning and hierarchical clustering algorithms in R. We discuss statistical issues and methods in choosing the number of clusters, the choice of clustering algorithm, and the choice of dissimilarity matrix. In particular, we illustrate how the bootstrap can be employed as a statistical method in cluster analysis to establish the reproducibility of the clusters and the overall variability of the followed procedure. We also show how to visualize a clustering result by plotting ordered dissimilarity matrices in R. We present a new R package, hopach, which implements the hybrid clustering method, Hierarchical Ordered Partitioning And Collapsing Hybrid (HOPACH). The methodology combines the strengths of both partitioning and agglomerative hierarchical clustering methods. At each node, a cluster is split into two or more smaller clusters with an enforced ordering of the clusters. Collapsing steps uniting the two closest clusters into one cluster are used to correct for errors made in the partitioning steps. The hopach function uses the median split silhouette (MSS) criterion to automatically choose (i) the number of children at each node, (ii) which clusters to collapse, and (iii) the main clusters (pruning the tree to produce a partition of homogeneous clusters). The methodology is illustrated with gene expression data.
Disciplines
Bioinformatics | Computational Biology | Genetics | Microarrays | Multivariate Analysis | Numerical Analysis and Computation
Suggested Citation
Pollard, Katherine S. and van der Laan, Mark J., "Cluster Analysis of Genomic Data with Applications in R" (January 2005). U.C. Berkeley Division of Biostatistics Working Paper Series. Working Paper 167.
https://biostats.bepress.com/ucbbiostat/paper167
Included in
Bioinformatics Commons, Computational Biology Commons, Genetics Commons, Microarrays Commons, Multivariate Analysis Commons, Numerical Analysis and Computation Commons