Date on Master's Thesis/Doctoral Dissertation
12-2016
Document Type
Doctoral Dissertation
Degree Name
Ph. D.
Department
Computer Engineering and Computer Science
Degree Program
Computer Science and Engineering, PhD
Committee Chair
Kantardzic, Mehmed
Committee Co-Chair (if applicable)
Elmaghraby, Adel S.
Committee Member
Elmaghraby, Adel S.
Committee Member
Nasraoui, Olfa
Committee Member
Lauf, Adrian
Committee Member
Zurada, Jozaf
Author's Keywords
streaming data; partial labeling; class imbalance; concept drift; classification; multi class
Abstract
Stream processing frameworks are designed to process the streaming data that arrives in time. An example of such data is stream of emails that a user receives every day. Most of the real world data streams are also imbalanced as is in the stream of emails, which contains few spam emails compared to a lot of legitimate emails. The classification of the imbalanced data stream is challenging due to the several reasons: First of all, data streams are huge and they can not be stored in the memory for one time processing. Second, if the data is imbalanced, the accuracy of the majority class mostly dominates the results. Third, data streams are changing over time, and that causes degradation in the model performance. Hence the model should get updated when such changes are detected. Finally, the true labels of the all samples are not available immediately after classification, and only a fraction of the data is possible to get labeled in real world applications. That is because the labeling is expensive and time consuming. In this thesis, a framework for modeling the streaming data when the classes of the data samples are imbalanced is proposed. This framework is called Reduced Labeled Samples (RLS). RLS is a chunk based learning framework that builds a model using partially labeled data stream, when the characteristics of the data change. In RLS, a fraction of the samples are labeled and are used in modeling, and the performance is not significantly different from that of the 100% labeling. RLS maintains an ensemble of classifiers to boost the performance. RLS uses the information from labeled data in a supervised fashion, and also is extended to use the information from unlabeled data in a semi supervised fashion. RLS addresses both binary and multi class partially labeled data stream and the results show the basis of RLS is effective even in the context of multi class classification problems. Overall, the RLS is shown to be an effective framework for processing imbalanced and partially labeled data streams.
Recommended Citation
Arabmakki, Elaheh, "A reduced labeled samples (RLS) framework for classification of imbalanced concept-drifting streaming data." (2016). Electronic Theses and Dissertations. Paper 2602.
https://doi.org/10.18297/etd/2602