Date on Master's Thesis/Doctoral Dissertation

5-2024

Document Type

Doctoral Dissertation

Degree Name

Ph. D.

Department

Computer Engineering and Computer Science

Degree Program

Computer Science and Engineering, PhD

Committee Chair

Frigui, Hichem

Committee Co-Chair (if applicable)

Nasraoui, Olfa

Committee Member

Nasraoui, Olfa

Committee Member

Karem, Andrew David

Committee Member

Baidya, Sabur Hassan

Committee Member

Inanc, Tamer

Author's Keywords

deep neural networks; DNNs; attention-guided data augmentation; ADA-Vit

Abstract

For over a decade, Deep Neural Networks (DNNs) have been rapidly progressing and achieving great success, forming a robust foundation of state of the art machine learning algorithms that impacted various domains. The advances in data acquisition and processing have undeniably played a major role in these breakthroughs. Data is a crucial component in building successful DNNs, as it enables machine learning models to optimize complex architectures, necessary to perform certain difficult tasks. However, acquiring large-scale data sets is not enough to learn robust models with generalizable features. Instead, an ideal training set should be diverse enough and contain enough variations within each class for the model to learn the most optimal decision boundaries. The poor performance of a machine learning model can often be traced back to the existence of under-represented regions in the feature space of the training data. These sparse regions can prevent the model from capturing the large intra-class variations. Data augmentation is a common technique that has been used to inflate training datasets with new samples, as an attempt to improve the model performance. However, these techniques usually focus on expanding the data in size and do not necessarily aim to cover the under-represented regions of the feature space. This dissertation presents a novel Attention-guided Data Augmentation technique for Vision Transformers, called ADA-ViT. Our method is tailored to be specifically applied to Transformerbased vision models. These models are considered the state of the art learning strategy in almost all computer vision applications, and they have gained more interest in recent research than their classic counterparts, e.g. convolution-based networks. Our proposed data augmentation method aims to improve the diversity of the training set by selecting informative samples with respect to their potential contributions of improving the model performance. We leverage the attention scores computed within the transformer model to get an insight on the image regions that caused the misclassification. The identified image regions form misclassification concepts that explain the model limitations in a given class. These learned concepts indicate the presence of under-represented regions in the training dataset that contributed to the misclassifications. We leverage this information to guide our data augmentation process by identifying new samples and using them to augment the training data in an effort to improve the coverage of the identified under-represented regions. We achieve this by designing a utility function to rank and select new samples from secondary image repositories based on their similarity to the extracted misclassification concepts. ADA-ViT aims beyond increasing the data in size. It focuses on improving the diversity of the training set by finding and covering under-represented regions in the feature space of the training data. To the best of our knowledge, no prior work has considered this aspect for the case of Vision Transformer models. The advantage of our approach is that it leverages available noisy web data repositories for augmentation, thus alleviating the need for large labeled data. This is because ADA-ViT uses a ranking system that can filter out noisy and irrelevant samples. We evaluate our data augmentation technique on two computer vision applications, and using multiple scenarios. We conduct extensive experiments and analysis to demonstrate the problem of under-represented regions in the training feature space and show the effectiveness of our method in addressing this issue. We also compare our method, using benchmark datasets, to baseline models trained using the available labeled data only, and using the augmented labeled data and state-of-theart data augmentation methods. We show that our proposed augmentation consistently improves the results. We also perform an in-depth analysis to justify the observed improvements.

Share

COinS