Date on Master's Thesis/Doctoral Dissertation


Document Type

Doctoral Dissertation

Degree Name

Ph. D.


Bioinformatics and Biostatistics

Degree Program

Biostatistics, PhD

Committee Chair

Gaskins, Jeremy

Committee Member

Kong, Maying

Committee Member

Mitra, Riten

Committee Member

Pal, Subhadip

Committee Member

Gill, Ryan

Author's Keywords

Variable screening; mixture models; shrinkage; bayesian analysis; variable selection


In this work, we seek to develop a variable screening and selection method for Bayesian mixture models with longitudinal data. To develop this method, we consider data from the Health and Retirement Survey (HRS) conducted by University of Michigan. Considering yearly out-of-pocket expenditures as the longitudinal response variable, we consider a Bayesian mixture model with $K$ components. The data consist of a large collection of demographic, financial, and health-related baseline characteristics, and we wish to find a subset of these that impact cluster membership. An initial mixture model without any cluster-level predictors is fit to the data through an MCMC algorithm, and then a variable screening step finds a set of candidate predictors that may be associated with the cluster configurations found in the initial fit. For each predictor, we choose a discrepancy measure such as frequentist hypothesis tests that will measure the differences in the predictor values across clusters. A large discrepancy provides evidence that the clusters (and the corresponding response trajectories) differ across the baseline characteristic, and these are used to choose a small set of predictors to include in a multinomial logit model for cluster membership. The stepwise logit model along with other choices is considered as a multivariate variable screening approach. The performance of this methodology is explored in both simulations and real data. Additionally, we consider the problem of variable selection in the baseline categorical logit model for categorical regression. While there are a number of studies considering variable selection in the regression paradigm with a numerical response, the research is limited for a categorical response variable. The main goal of this project is to develop a method for leveraging the features of the global-local shrinkage framework to improve variable selection in baseline categorical logistic regression by introducing new shrinkage priors that encourage similar predictors to be selected across the models for different response levels. To that end, the proposed shrinkage priors share information across response models through the local parameters that favor similar levels of shrinkage for all coefficients (log odds ratios) of a predictor. We explore different shrinkage approaches using the horseshoe and normal gamma priors within our setting and compare to a spike and slab setup and other shrinkage priors that fail to share information across models. We explore the performance of our approach in both simulations and a real data application.