Date on Master's Thesis/Doctoral Dissertation


Document Type

Master's Thesis

Degree Name



Bioinformatics and Biostatistics

Committee Chair

Brock, Guy

Author's Keywords

Microarray; Empirical Bayes; Differentially expressed; Consistency; T-test; SAM


DNA microarrays--Statistical methods; Gene expression--Statistical methods


Data derived from gene expression microarrays are frequently used to identify candidate genes which can characterize and distinguish between two biological phenotypes. A key step in this process is the selection of an appropriate test statistic to identify which genes are differentially expressed between the two tissues. Although many methods have been explicitly developed for this purpose, the traditional (-test still remains a popular choice. In this study, we evaluate the empirical impact of choice of test-statistic on the resulting list of differentially expressed genes, in particular when the available sample size is small. We evaluated several different methods for detecting differentially expressed genes (t-test, empirical Bayes, and SAM) using ten different publicly available data sets. First, we obtained gene lists based on the full data using the different methods. Then, we selected subsamples from the full data, and obtained gene lists based on these subsamples. The consistency was quantified using several scores. Factors evaluated in the empirical study included the size of the subset and the length of the differentially expressed gene list. We found that when the sample size of the subset is small, the resulting gene list based on the t-test has a very low consistency, while empirical Bayes and SAM have much higher consistencies. This result is particularly evident when considering only the top ranked genes. When sample sizes are larger, all three methods have the same performance. We recommend that investigators use these moderated versions in lieu of the t-test when the sample size is small.