Date on Master's Thesis/Doctoral Dissertation


Document Type

Doctoral Dissertation

Degree Name

Ph. D.


Interdisciplinary and Graduate Studies

Degree Program

Interdisciplinary Studies with a specialization in Bioinformatics, PhD

Committee Chair

Rai, Shesh

Committee Member

Rouchka, Eric

Committee Member

McClain, Craig

Committee Member

Merchant, Michael

Committee Member

Pal, Subhadip

Author's Keywords

Gene set analysis; QTL; gene expression; statistical approach; microarrays; scRNA-seq; differential expression


Recently, gene set analysis has become the first choice for gaining insights into the underlying complex biology of diseases through high-throughput genomic studies, such as Microarrays, bulk RNA-Sequencing, single cell RNA-Sequencing, etc. It also reduces the complexity of statistical analysis and enhances the explanatory power of the obtained results. Further, the statistical structure and steps common to these approaches have not yet been comprehensively discussed, which limits their utility. Hence, a comprehensive overview of the available gene set analysis approaches used for different high-throughput genomic studies is provided. The analysis of gene sets is usually carried out based on gene ontology terms, known biological pathways, etc., which may not establish any formal relation between genotype and trait specific phenotype. Further, in plant biology and breeding, gene set analysis with trait specific Quantitative Trait Loci data are considered to be a great source for biological knowledge discovery. Therefore, innovative statistical approaches are developed for analyzing, and interpreting gene expression data from Microarrays, RNA-sequencing studies in the context of gene sets with trait specific Quantitative Trait Loci. The utility of the developed approaches is studied on multiple real gene expression datasets obtained from various Microarrays and RNA-sequencing studies. The selection of gene sets through differential expression analysis is the primary step of gene set analysis, and which can be achieved through using gene selection methods. The existing methods for such analysis in high-throughput studies, such as Microarrays, RNA-sequencing studies, suffer from serious limitations. For instance, in Microarrays, most of the available methods are either based on relevancy or redundancy measures. Through these methods, the ranking of genes is done on single Microarray expression data, which leads to the selection of spuriously associated, and redundant gene sets. Therefore, newer, and innovative differential expression analytical methods have been developed for Microarrays, and single-cell RNA-sequencing studies for identification of gene sets to successfully carry out the gene set and other downstream analyses. Furthermore, several methods specifically designed for single-cell data have been developed in the literature for the differential expression analysis. To provide guidance on choosing an appropriate tool or developing a new one, it is necessary to review the performance of the existing methods. Hence, a comprehensive overview, classification, and comparative study of the available single-cell methods is hereby undertaken to study their unique features, underlying statistical models and their shortcomings on real applications. Moreover, to address one of the shortcomings (i.e., higher dropout events due to lower cell capture rates), an improved statistical method for downstream analysis of single-cell data has been developed. From the users’ point of view, the different developed statistical methods are implemented in various software tools and made publicly available. These methods and tools will help the experimental biologists and genome researchers to analyze their experimental data more objectively and efficiently. Moreover, the limitations and shortcomings of the available methods are reported in this study, and these need to be addressed by statisticians and biologists collectively to develop efficient approaches. These new approaches will be able to analyze high-throughput genomic data more efficiently to better understand the biological systems and increase the specificity, sensitivity, utility, and relevance of high-throughput genomic studies.