Date on Master's Thesis/Doctoral Dissertation


Document Type

Master's Thesis

Degree Name



Bioinformatics and Biostatistics

Committee Chair

Kim, Seongho

Committee Member

Wu, Dongfeng

Committee Member

Zhang, Xiang

Author's Keywords

Compound identification; MS data; Penalized regression; Dot product; Metabolomics


Chemical detectors; Regression analysis; Ridge regression (Statistics)


In this study, we propose a new method for compound identification using penalized linear regression. Compound identification is often achieved by matching the experimental mass spectra to the mass spectra stored in a reference library based on mass spectral similarity. In the context of the linear regression, the response variable is an experimental mass spectrum (i.e., query) and all the compounds in the reference library are the independent variables. However, the number of compounds in the reference library is much larger than the range of m/z values so that the data become high dimensional data with suffering from singularity. For this reason, we use penalized linear regression such as ridge regression and the Lasso. Furthermore, we also propose two-step approaches using dot product and Pearson’s correlation along with the penalized linear regression in this study.