Date on Master's Thesis/Doctoral Dissertation
5-2025
Document Type
Doctoral Dissertation
Degree Name
Ph. D.
Department
Bioinformatics and Biostatistics
Degree Program
Biostatistics, PhD
Committee Chair
Gaskins, Jeremy
Committee Member
Wu, Dongfeng
Committee Member
Sekula, Michael
Committee Member
Huang, Shih-Ting
Committee Member
Souza, Grace De
Author's Keywords
Bayesian methodology; hetergeneity; gaussian copula; approximate Bayesian computation; simultaneous autoregressive model
Abstract
Longitudinal data in real-world settings are frequently found to be heterogeneous and exhibit intricate spatio-temporal dependence structures. Analyzing such complex data to obtain reliable estimation while quantifying uncertainty necessitates using sophisticated Bayesian methodology. In this work, we present novel Bayesian methods developed to address these challenges. We often observe heterogeneity in longitudinal data, where the mean and variance for certain profiles meaningfully differs from the rest. Some profiles may also exhibit outliers at a limited number of measurements. Using a standard mixed effects model, which assumes homogeneity, can lead to overestimating the residual variance and inefficient estimation. In this work, we identify and account for three sources of heterogeneity in longitudinal data: incompatible mean trajectories, increased residual variance, and outliers at individual measurements. Our Bayesian mixture model incorporates binary indicators of heterogeneity for each of these features, modeled through logistic regression using covariates. We perform statistical inference using Markov chain Monte Carlo and implement model selection to evaluate the inclusion of various heterogeneous components. Simulations demonstrate that our model can accurately identify heterogeneity and produce efficient estimates of the fixed effects parameters. We further validate our approach using the CD4 data and DHEAS hormone data from the SWAN study. In the second project of this dissertation, we develop a new longitudinal count data regression model that accounts for zero-inflation and spatio-temporal correlation across responses. This project is motivated by an analysis of Iowa Fluoride Study (IFS) data, a longitudinal cohort study with data on caries (cavity) experience scores measured for each tooth across five time points. To that end, we use a hurdle model for zero-inflation with two parts: the presence model indicating whether a count is non-zero through logistic regression and the severity model that considers the non-zero counts through a shifted Negative Binomial distribution allowing overdispersion. To incorporate dependence across measurement occasion and teeth, these marginal models are embedded within a Gaussian copula that introduces spatio-temporal correlations. A distinct advantage of this formulation is that it allows us to determine covariate effects with population-level (marginal) interpretations in contrast to mixed model choices. Standard Bayesian sampling from such a model is infeasible, so we use approximate Bayesian computing for inference. This approach is applied to the IFS data to gain insight into the risk factors for dental caries and the correlation structure across teeth and time.
Recommended Citation
Mukherjee, Anish, "New Bayesian methods for longitudinal data analysis with complex dependence structures." (2025). Electronic Theses and Dissertations. Paper 4572.
Retrieved from https://ir.library.louisville.edu/etd/4572
Included in
Biostatistics Commons, Dental Public Health and Education Commons, Longitudinal Data Analysis and Time Series Commons, Multivariate Analysis Commons