Date on Master's Thesis/Doctoral Dissertation


Document Type

Doctoral Dissertation

Degree Name

Ph. D.


Computer Engineering and Computer Science

Degree Program

Computer Science and Engineering, PhD

Committee Chair

Rouchka, Eric

Committee Co-Chair (if applicable)

Nasraoui, Olfa

Committee Member

Nasraoui, Olfa

Committee Member

Chang, Dar-Jen

Committee Member

Park, Juw Won

Committee Member

Petruska, Jeffrey

Author's Keywords

custom CDF; microarrays; gene expression; probeset; probe group; UTR


A DNA microarray is a high-throughput technology used to identify relative gene expression. One of the most widely used platforms is the Affymetrix® GeneChip® technology which detects gene expression levels based on probe sets composed of a set of twenty-five nucleotide probes designed to hybridize with specific gene targets. Given a particular Affymetrix® GeneChip® platform, the design of the probes is fixed. However, the method of analysis is dynamic in nature due to the ability to annotate and group probes into uniquely defined groupings. This is particularly important since publicly available repositories of microarray datasets, such as ArrayExpress and NCBI’s Gene Expression Omnibus (GEO) have made millions of samples readily available to be reanalyzed computationally without the need for new biological experiments. One way in which the analysis can dynamically change is by correcting the mapping between probe sets and targets by creating custom Chip Description Files (CDFs) to arrange which probes belong to which probe set based on the latest genomic information or specific annotations of interest. Since default probe sets in Affymetrix® GeneChip® platforms are specific for a gene, transcript or exon, the analyses are then limited to profile differential expression at the gene, transcript or individual exon level. However, it has been revealed that untranslated regions (UTRs) of mRNA have important impacts on the regulation of proteins. We therefore developed a new probe mapping protocol that addresses three issues of Affymetrix® GeneChip® data analyses: removing nonspecific probes, updating probe target mapping based on the latest genome information and grouping the probes into region (UTR, individual exon), gene and transcript level targets of interest to support a better understanding of the effect of UTRs and individual exons on gene expression levels. Furthermore, we developed an R package, affyCustomCdf, for users to dynamically create custom CDFs. The affyCustomCdf tool takes annotations in a General/Gene Transfer Format File (GTF), aligns probes to gene annotations via Nested Containment List (NCList) indexing and generates a custom Chip Description File (CDF) to regroup probes into probe sets based on a region (UTR and individual exon), transcript or gene level. Our results indicate that removing probes that no longer align to the genome without mismatches or align to multiple locations can help to reduce false-positive differential expression, as can removal of probes in regions overlapping multiple genes. Moreover, our method based on regions can detect changes that would have been missed by analysis based on gene and transcript. It also allows for a better understanding of 3’ UTR dynamics through the reanalysis of publicly available data.