Date on Master's Thesis/Doctoral Dissertation

5-2025

Document Type

Doctoral Dissertation

Degree Name

Ph. D.

Department

Bioinformatics and Biostatistics

Degree Program

Biostatistics, PhD

Committee Chair

Gaskins, Jeremy

Committee Member

Wu, Dongfeng

Committee Member

Sekula, Michael

Committee Member

Huang, Shih-Ting

Committee Member

Souza, Grace De

Author's Keywords

Bayesian methodology; hetergeneity; gaussian copula; approximate Bayesian computation; simultaneous autoregressive model

Abstract

Longitudinal data in real-world settings are frequently found to be heterogeneous and exhibit intricate spatio-temporal dependence structures. Analyzing such complex data to obtain reliable estimation while quantifying uncertainty necessitates using sophisticated Bayesian methodology. In this work, we present novel Bayesian methods developed to address these challenges. We often observe heterogeneity in longitudinal data, where the mean and variance for certain profiles meaningfully differs from the rest. Some profiles may also exhibit outliers at a limited number of measurements. Using a standard mixed effects model, which assumes homogeneity, can lead to overestimating the residual variance and inefficient estimation. In this work, we identify and account for three sources of heterogeneity in longitudinal data: incompatible mean trajectories, increased residual variance, and outliers at individual measurements. Our Bayesian mixture model incorporates binary indicators of heterogeneity for each of these features, modeled through logistic regression using covariates. We perform statistical inference using Markov chain Monte Carlo and implement model selection to evaluate the inclusion of various heterogeneous components. Simulations demonstrate that our model can accurately identify heterogeneity and produce efficient estimates of the fixed effects parameters. We further validate our approach using the CD4 data and DHEAS hormone data from the SWAN study. In the second project of this dissertation, we develop a new longitudinal count data regression model that accounts for zero-inflation and spatio-temporal correlation across responses. This project is motivated by an analysis of Iowa Fluoride Study (IFS) data, a longitudinal cohort study with data on caries (cavity) experience scores measured for each tooth across five time points. To that end, we use a hurdle model for zero-inflation with two parts: the presence model indicating whether a count is non-zero through logistic regression and the severity model that considers the non-zero counts through a shifted Negative Binomial distribution allowing overdispersion. To incorporate dependence across measurement occasion and teeth, these marginal models are embedded within a Gaussian copula that introduces spatio-temporal correlations. A distinct advantage of this formulation is that it allows us to determine covariate effects with population-level (marginal) interpretations in contrast to mixed model choices. Standard Bayesian sampling from such a model is infeasible, so we use approximate Bayesian computing for inference. This approach is applied to the IFS data to gain insight into the risk factors for dental caries and the correlation structure across teeth and time.

Share

COinS