Date on Master's Thesis/Doctoral Dissertation

12-2022

Document Type

Doctoral Dissertation

Degree Name

Ph. D.

Department

Bioinformatics and Biostatistics

Degree Program

Biostatistics, PhD

Committee Chair

Gaskins, Jeremy

Committee Co-Chair (if applicable)

Mitra, Riten

Committee Member

Kong, Maiying

Committee Member

Gill, Ryan

Committee Member

Sekula, Michael

Author's Keywords

Bayesian; Gaussian graphical models; sparse graph estimation; partial correlation; covariance selection; joint estimation; breast cancer; TNBC; network estimation

Abstract

Graphical models determine associations between variables through the notion of conditional independence. Gaussian graphical models are a widely used class of such models, where the relationships are formalized by non-null entries of the precision matrix. However, in high-dimensional cases, covariance estimates are typically unstable. Moreover, it is natural to expect only a few significant associations to be present in many realistic applications. This necessitates the injection of sparsity techniques into the estimation method. Classical frequentist methods, like GLASSO, use penalization techniques for this purpose. Fully Bayesian methods, on the contrary, are slow because they require iteratively sampling over a quadratic number of parameters in a space constrained by positive definiteness. In the second chapter, we propose a Bayesian graph estimation method based on an ensemble of Bayesian neighborhood regressions. An attractive feature of our methods is the scope of easy parallelization across different graphical neighborhoods, thus invoking a computational efficiency far greater than most existing methods. We consider our strategy inducing sparsity with a horseshoe shrinkage prior with a novel variable selection step based on the marginal likelihood from the predictor ranks. We have also used a variable selection under an alternative shrinking and diffusing coefficient prior (BASAD) and Hyper-g prior. This ability to translate the covariance selection problem into a sequence of regression problems opens the door to a broad class of Bayesian neighborhood methods that can be combined with almost any sparsity prior. Finally, our method appropriately combines the estimated regression coefficients to produce a graph estimate, as well as a matrix of estimated partial correlations. Performance of various methods was assessed using measures like False Discovery Rate, True Positive Rate, MCC, etc. Extensive simulations demonstrate our competitive performance across a variety of cases. Finally, we applied all these methods to learn the dependence structure across genes in women with Triple Negative Breast cancer (TNBC) and present our findings. Biostatistical research problems related to unknown networks often naturally necessitate the joint inference for multiple related graphs. Complex biological networks across related disease subcategories or other relevant groupings targeted by the same drug can be naturally assumed to share some common characteristics. Danaher et al., 2014 and Guo et al., 2011 introduced the idea in frequentist models where they estimate multiple related GGMs for observations belonging to distinct classes. In the Bayesian context, such multiple graph problems are getting attention only recently. In the third chapter of this dissertation, we extend the method discussed in the first project to a two-group problem. We frame this as a neighborhood selection approach where we use individual Bayesian regressions that can be combined later in estimating the graph along with the partial correlation matrices. The intuition behind such a problem is governed by the breast cancer (BC) data, that we used in the first project from where we have patients with TNBC and Receptor Positive Breast Cancer also called Luminal-A. Luminal-A breast cancers are likely to benefit from hormone therapy and may also benefit from chemotherapy. Luminal-A cancers tend to grow at a much slower rate than TNBC and have better prognosis, higher median survival, and lower recurrence rates compared to TNBC. Obtaining two separate graphs for these two groups of patients is possible, but we expect the two graphs for each group of patients to be very similar in nature. Statistically speaking the graphs should be similar due to sharing of sparsity patterns across the two covariance matrices and our plan is to exploit this similarity across groups. To that end, we build a nodewise regression framework that predicts each outcome conditionally on the remaining variables with group-determined interaction e ects. Our method appropriately combines these estimated regression coefficients to produce a graph estimate, as well as a matrix of estimated partial correlations. Performance of various methods was assessed using measures like False Discovery Rate, True Positive Rate, MCC, and Hamming distance. Extensive simulations demonstrate the competitiveness of our approach across a variety of cases. Finally, we applied our method to learn the dependence structure across genes in the mentioned groups of women with Triple Negative Breast cancer and Receptor Positive Breast (Luminal-A) Cancers.

Share

COinS