Date on Master's Thesis/Doctoral Dissertation

5-2009

Document Type

Master's Thesis

Degree Name

M. Eng.

Department

Computer Engineering and Computer Science

Committee Chair

Rouchka, Eric Christian

Subject

Bioinformatics; Genetics--Methodology; Genetics--Data processing

Abstract

Due to the complex nature of interactions at the genomic level as well as the large number of proteins present in an organism, understanding the functions of various genes that are expressed is essential. Creating an analysis pipeline for Expressed Sequence Tags (ESTs) is one way to accomplish this, allowing a researcher to quickly take a set of sequences, perform all necessary analysis operations, and publish the data in a database with a graphical user interface (GUI). This pipeline falls into several steps. First, the data must be preprocessed to remove any extraneous sequence data, low-complexity regions, and regions that repeat throughout the genome. Next, it is necessary to combine a large number of ESTs into larger sequences that better describe the underlying mRNA. After larger contiguous sequences have been constructed, putative functions can be assigned to each sequence, whether part of a larger grouping or a singleton. An application of this pipeline using 3906 ESTs generated from trichome tissue of Pelargonium xhotorum (commonly, the geranium plant) resulted in 425 contiguous sequences using the CAP3 program. These sequences, along with the 2208 sequences that are not a part of a contig, were then BLASTed against the non-redundant protein database to assign putative functions to each sequence. Finally, BLAST2GO was run on these BLAST results in order to assign a GO (Gene Ontology) to each sequence. These annotations were then added to the database for later investigation by researchers. In order to aid researchers in the further analysis of the annotated sequences, a mySQL database was used for data storage and a GUI was developed using Java and Java Server Pages. In addition, an applet for viewing the Sanger trace files for each sequence is included to further aid the researcher in determining the validity of the data.

Share

COinS