Date on Master's Thesis/Doctoral Dissertation


Document Type

Doctoral Dissertation

Degree Name

Ph. D.


Bioinformatics and Biostatistics

Degree Program

Biostatistics, PhD

Committee Chair

Rai, Shesh N.

Committee Co-Chair (if applicable)

Wu, Dongfeng

Committee Member

Brock, Guy N. (Co-Advisor)

Committee Member

Rouchka, Eric

Committee Member

Gill, Ryan S.

Committee Member

Gaskins, Jeremy

Author's Keywords

sample sizes; RNA-seq; normalization methods; power; differentially expressed genes (DEGs)


High-throughput RNA sequencing (RNA-seq) has become the preferred choice for transcriptomics and gene expression studies. With the rapid growth of RNA-seq applications, sample size calculation methods for RNA-seq experiment design and data normalization methods for DEG analysis are important issues to be explored and discussed. The underlying theme of this dissertation is to develop novel sample size calculation methods in RNA-seq experiment design using test statistics. I have also proposed two novel normalization methods for analysis of RNA-seq data. In chapter one, I present the test statistical methods including Wald’s test, log-transformed Wald’s test and likelihood ratio test statistics for RNA-seq data with a negative binomial distribution. Following the test statistics, I present the five sample calculation methods based on a one-sided test. A comparison of my five methods and an existing method was performed by calculating the sample sizes and the simulated power in different scenarios. Due to the limitations of these methods, in chapter two, I have further derived two explicit sample size calculation methods based on a generalized linear model with a negative binomial distribution in RNA-seq data. These two sample size methods based on a two-sided Wald’s test are presented under a wide range of settings including the imbalanced design and unequal read depth, which is applicable in many situations. In chapter 3, I have a literature review of the existing normalization methods and describe the challenge of choosing an optimal normalization method due to multiple factors contributing to read count variability that effect overall the sensitivity and specificity. Then, I present two proposed normalization methods. I evaluate the performance of the commonly used methods (DESeq, TMM-edgeR, FPKM-CuffDiff, TC, Med, UQ and FQ) and two new methods I propose: Med-pgQ2 and UQ-pgQ2. The results from MAQC2 data shows that my proposed Med-pgQ2 and UQ-pgQ2 methods may be better choices for the differential gene analysis of RNA-seq data by improving specificity while maintaining a good detection power given a nominal FDR level. Finally, in chapter 4, I focus on data analysis in RNA-seq data using three normalization methods and two test statistic method with the aid of DESeq2 and edgeR packages. Through within-group analysis of these real RNA-seq data, I have found my normalization method, UQ-pgQ2, performs best with a lower false positive rate while maintaining a good detection power. Thus, in my work, I have derived the explicit sample size calculation methods, which is a very useful tool for researchers to quickly estimate the sample sizes in an experiment design. Furthermore, my two normalization methods can improve the performance for differential gene analysis of RNA-seq data by controlling false positives for high read count genes.

Included in

Biostatistics Commons