Date on Master's Thesis/Doctoral Dissertation


Document Type

Master's Thesis

Degree Name



Bioinformatics and Biostatistics

Degree Program

Biostatistics with a concentration in Decision Science, MS

Committee Chair

Datta, Susmita

Committee Co-Chair (if applicable)

Datta, Somnath

Committee Member

Gill, Ryan


Cluster analysis; Biology--Mathematical models; Algorithms; Biomathematics


Determining the best clustering algorithm and ideal number of clusters for a particular dataset is a fundamental difficulty in unsupervised clustering analysis. In biological research, data generated from Next Generation Sequencing technology and microarray gene expression data are becoming more and more common, so new tools and resources are needed to group such high dimensional data using clustering analysis. Different clustering algorithms can group data very differently. Therefore, there is a need to determine the best groupings in a given dataset using the most suitable clustering algorithm for that data. This paper presents the R package optCluster as an efficient way for users to evaluate up to ten clustering algorithms, ultimately determining the optimal algorithm and optimal number of clusters for a given set of data. The selected clustering algorithms are evaluated by as many as nine validation measures classified as “biological”, “internal”, or “stability”, and the final result is obtained through a weighted rank aggregation algorithm based on the calculated validation scores. Two examples using this package are presented, one with a microarray dataset and the other with an RNA-Seq dataset. These two examples highlight the capabilities the optCluster package and demonstrate its usefulness as a tool in cluster analysis.