Date on Master's Thesis/Doctoral Dissertation

5-2024

Document Type

Doctoral Dissertation

Degree Name

Ph. D.

Department

Computer Engineering and Computer Science

Degree Program

Computer Science and Engineering, PhD

Committee Chair

Yampolskiy, Roman

Committee Co-Chair (if applicable)

Nasraoui, Olfa

Committee Member

Nasraoui, Olfa

Committee Member

Lauf, Adrian

Committee Member

Losavio, Michael

Author's Keywords

Stylometry; multimodal; authorship identification; feature fusion; text mining; source code stlometry

Abstract

This dissertation introduces multimodal stylometry, a novel approach to authorship identification that integrates text and source code features for a comprehensive understanding of an author's unique style. Traditional stylometric methods have primarily focused on either text stylometry or source code stylometry, thereby neglecting the potential insights that multimodality may provide. This research aims to bridge this gap by proposing a framework that combines textual and source code data to enhance the accuracy and reliability of authorship identification. The study begins by reviewing existing literature on authorship identification and stylometry, highlighting the limitations of unimodal approaches. Leveraging recent advancements in multimodal biometrics and feature fusion, the research introduces a methodology that extracts stylometric features from written text and source code. These multimodal features are then integrated using an extended feature fusion technique that introduces an extra layer of feature selection. To validate the proposed approach, a diverse dataset comprising texts and corresponding source code data from various authors is curated. The dissertation explores the effectiveness of multimodality when compared to unimodality. Furthermore, the research investigates the transferability of the proposed multimodal stylometry framework in distinguishing AI and Human generated text and source code. The findings not only advance authorship identification techniques but also hold implications for applications in forensic linguistics, digital humanities, and content analysis. Ultimately, this research underscores the significance of multimodal stylometry in estimating the identity of an author.

Included in

Data Science Commons

Share

COinS