Date on Master's Thesis/Doctoral Dissertation
5-2024
Document Type
Doctoral Dissertation
Degree Name
Ph. D.
Department
Computer Engineering and Computer Science
Degree Program
Computer Science and Engineering, PhD
Committee Chair
Frigui, Hichem
Committee Co-Chair (if applicable)
Nasraoui, Olfa
Committee Member
Nasraoui, Olfa
Committee Member
Karem, Andrew David
Committee Member
Baidya, Sabur Hassan
Committee Member
Inanc, Tamer
Author's Keywords
deep neural networks; DNNs; attention-guided data augmentation; ADA-Vit
Abstract
For over a decade, Deep Neural Networks (DNNs) have been rapidly progressing and achieving great success, forming a robust foundation of state of the art machine learning algorithms that impacted various domains. The advances in data acquisition and processing have undeniably played a major role in these breakthroughs. Data is a crucial component in building successful DNNs, as it enables machine learning models to optimize complex architectures, necessary to perform certain difficult tasks. However, acquiring large-scale data sets is not enough to learn robust models with generalizable features. Instead, an ideal training set should be diverse enough and contain enough variations within each class for the model to learn the most optimal decision boundaries. The poor performance of a machine learning model can often be traced back to the existence of under-represented regions in the feature space of the training data. These sparse regions can prevent the model from capturing the large intra-class variations. Data augmentation is a common technique that has been used to inflate training datasets with new samples, as an attempt to improve the model performance. However, these techniques usually focus on expanding the data in size and do not necessarily aim to cover the under-represented regions of the feature space. This dissertation presents a novel Attention-guided Data Augmentation technique for Vision Transformers, called ADA-ViT. Our method is tailored to be specifically applied to Transformerbased vision models. These models are considered the state of the art learning strategy in almost all computer vision applications, and they have gained more interest in recent research than their classic counterparts, e.g. convolution-based networks. Our proposed data augmentation method aims to improve the diversity of the training set by selecting informative samples with respect to their potential contributions of improving the model performance. We leverage the attention scores computed within the transformer model to get an insight on the image regions that caused the misclassification. The identified image regions form misclassification concepts that explain the model limitations in a given class. These learned concepts indicate the presence of under-represented regions in the training dataset that contributed to the misclassifications. We leverage this information to guide our data augmentation process by identifying new samples and using them to augment the training data in an effort to improve the coverage of the identified under-represented regions. We achieve this by designing a utility function to rank and select new samples from secondary image repositories based on their similarity to the extracted misclassification concepts. ADA-ViT aims beyond increasing the data in size. It focuses on improving the diversity of the training set by finding and covering under-represented regions in the feature space of the training data. To the best of our knowledge, no prior work has considered this aspect for the case of Vision Transformer models. The advantage of our approach is that it leverages available noisy web data repositories for augmentation, thus alleviating the need for large labeled data. This is because ADA-ViT uses a ranking system that can filter out noisy and irrelevant samples. We evaluate our data augmentation technique on two computer vision applications, and using multiple scenarios. We conduct extensive experiments and analysis to demonstrate the problem of under-represented regions in the training feature space and show the effectiveness of our method in addressing this issue. We also compare our method, using benchmark datasets, to baseline models trained using the available labeled data only, and using the augmented labeled data and state-of-theart data augmentation methods. We show that our proposed augmentation consistently improves the results. We also perform an in-depth analysis to justify the observed improvements.
Recommended Citation
Baili, Nada, "Attention guided data augmentation for improving the classification performance of vision transformers." (2024). Electronic Theses and Dissertations. Paper 4327.
https://doi.org/10.18297/etd/4327