Date on Master's Thesis/Doctoral Dissertation
Bioinformatics and Biostatistics
Biostatistics with a concentration in Decision Science, MS
Committee Co-Chair (if applicable)
Cluster analysis; Biology--Mathematical models; Algorithms; Biomathematics
Determining the best clustering algorithm and ideal number of clusters for a particular dataset is a fundamental difficulty in unsupervised clustering analysis. In biological research, data generated from Next Generation Sequencing technology and microarray gene expression data are becoming more and more common, so new tools and resources are needed to group such high dimensional data using clustering analysis. Different clustering algorithms can group data very differently. Therefore, there is a need to determine the best groupings in a given dataset using the most suitable clustering algorithm for that data. This paper presents the R package optCluster as an efficient way for users to evaluate up to ten clustering algorithms, ultimately determining the optimal algorithm and optimal number of clusters for a given set of data. The selected clustering algorithms are evaluated by as many as nine validation measures classified as “biological”, “internal”, or “stability”, and the final result is obtained through a weighted rank aggregation algorithm based on the calculated validation scores. Two examples using this package are presented, one with a microarray dataset and the other with an RNA-Seq dataset. These two examples highlight the capabilities the optCluster package and demonstrate its usefulness as a tool in cluster analysis.
Sekula, Michael N., "OptCluster : an R package for determining the optimal clustering algorithm and optimal number of clusters." (2015). Electronic Theses and Dissertations. Paper 2147.