Date on Master's Thesis/Doctoral Dissertation


Document Type

Doctoral Dissertation

Degree Name

Ph. D.


Interdisciplinary and Graduate Studies

Degree Program

Interdisciplinary Studies with a specialization in Bioinformatics, PhD

Committee Chair

Rai, Shesh N.

Committee Co-Chair (if applicable)

Rouchka, Eric C.

Committee Member

Rouchka, Eric C.

Committee Member

Kalbfleisch, Ted

Committee Member

Merchant, Michael L.

Committee Member

Mitra, Riten

Author's Keywords

Tissue storage; tissue extraction; technical variability; imputation; negative binomial distribution; generalized linear model


The designing and determination of sample size are important for conducting high-throughput biological experiments such as proteomics experiments and RNA-Seq expression studies, thus leading to better understanding of complex mechanisms underlying various biological processes. The variations in the biological data or technical approaches to data collection lead to heterogeneity for the samples under study. We critically worked on the issues of technical and biological heterogeneity. The quantitative measurements based on liquid chromatography (LC) coupled with mass spectrometry (MS) often suffer from the problem of missing values (MVs) and data heterogeneity. We considered a proteomics data set generated from human kidney biopsy material to investigate the technical effects of sample preparation and the quantitative MS. We studied the effect of tissue storage methods (TSMs) and tissue extraction methods (TEMs) on data analysis. There are two TSMs: frozen (FR) and FFPE (formalin-fixed paraffin embedded); and three TEMs: MAX, TX followed by MAX and SDS followed by MAX. We assessed the impact of different strategies to analyze the data while considering heterogeneity and MVs. We found that the FFPE is better than that of FR for tissue storage. We also found that the one-step TEM (MAX) is better than those of two-steps TEMs. Furthermore, we found the imputation method is a better approach than excluding the proteins with MVs or using unbalanced design. We introduce a web application, PWST (Proteomics Workflow Standardization Tool) to standardize the proteomics workflow. The tool will be helpful in deciding the most suitable choice for each step and studying the variability associated with technical steps as well as the effects of continuous variables. We have used the special cases of general linear model - ANCOVA and ANOVA with fixed effects to study the effects due to various sources of variability. We introduce an interactive tool, “SATP: Statistical Analysis Tool for Proteomics”, for analyzing proteomics expression data that is scalable to large clinical proteomic studies. The user can perform differential expression analysis of proteomics data either at the protein or peptide level using multiple approaches. We have developed statistical approaches for calculating sample size for proteomics experiments under allocation and cost constraints. We have developed R programs and a shiny app “SSCP: Sample Size Calculator for Proteomics Experiment” for computing sample sizes. We have proposed statistical approaches for calculating sample size for RNA-Seq experiments considering allocation and cost. We have developed R programs and shiny apps to calculate sample size for conducting RNA-Seq experiments.