Date on Master's Thesis/Doctoral Dissertation

5-2024

Document Type

Doctoral Dissertation

Degree Name

Ph. D.

Department

Computer Engineering and Computer Science

Degree Program

Computer Science and Engineering, PhD

Committee Chair

Chang, Dar-jen

Committee Member

Kantardzic, Mehmed M.

Committee Member

Imam, Ibrahim N.

Committee Member

Elmaghraby, Adel S.

Committee Member

Park, Juw Won

Author's Keywords

document inversion; inverted document; GPGPU; research computing; computing cluster

Abstract

Bioinformatics is a domain that has experienced rapid research growth in recent years, as evidenced by the increasing number of articles in biomedical databases such as PubMed, which adds over a million publications every year. However, this also poses a challenge for researchers who need to find relevant citations for their work. Therefore, developing efficient indexing and searching methods for text data is crucial for Bioinformatics. One key technique for information retrieval is document inversion, which involves creating an inverted index to enable efficient searching through vast collections of text or documents. This Ph.D. research aims to design the research computing environment and implement a document inversion system on the multi-core Graphics Processing Unit (GPU) as a multithreaded application using a linear-time, hash-based, single program multiple data algorithm. The GPU is a powerful tool for general-purpose computing, especially for parallel and data-intensive applications. However, the GPU architecture differs from the Central Processing Unit (CPU) architecture, which creates two main challenges for GPU computing. The first challenge is to design the thread blocks and distribute the data among them. The second challenge is efficiently using the GPU memory by each type, such as global memory, constant memory, and shared memory, to achieve high-performance solutions. The dissertation research evaluates the performance of the system with two test datasets from PubMed abstracts and e-commerce product reviews. It shows that the multithreaded application on the GPU can perform document inversion around two to three times faster than the sequential one on the CPU. The research computing environments for this work include the Computer Science and Engineering Research Network and the Genomics cluster, which is a high-performance computing cluster with CPU/GPU computing nodes, large-size storage devices, and virtual environment systems. The cluster was initially designed for the Bioinformatics researchers and research groups in the Department of Computer Science and Engineering. The dissertation contributes to information retrieval by proposing a novel and efficient document inversion system on the GPU for extensive document collections and to Bioinformatics researchers by providing a flexible and efficient research computing environment design with massive computing power and enough space.

Share

COinS