Date on Master's Thesis/Doctoral Dissertation


Document Type

Doctoral Dissertation

Degree Name

Ph. D.


Computer Engineering and Computer Science

Degree Program

Computer Science and Engineering, PhD

Committee Chair

Kantardzic, Mehmed

Committee Co-Chair (if applicable)

Chang, Dar-Jen

Committee Member

Chang, Dar-Jen

Committee Member

Zhang, Harry

Committee Member

Imam, Ibrahim

Committee Member

Zurada, Jozef

Committee Member

Vickers-Smith, Rachel

Author's Keywords

text mining; social network mining; data science; data mining; NLP


As regulations surrounding cannabis continue to develop, the demand for cannabis-based products is on the rise. Despite not producing the psychoactive effects commonly associated with THC, products containing cannabidiol (CBD) have gained immense popularity in recent years as a potential treatment option for a range of conditions, particularly those associated with pain or sleep disorders. However, due to current federal policies, these products have yet to undergo comprehensive safety and efficacy testing. Fortunately, utilizing advanced natural language processing (NLP) techniques, data harvested from social networks have been employed to investigate various social trends within healthcare, such as disease tracking and drug surveillance. By leveraging Twitter data, NLP can offer invaluable insights into public perceptions around CBD, as well as the marketing tactics employed by those marketing such loosely-regulated substances to the general public. Given the lack of comprehensive clinical CBD testing, the various health claims made by CBD sellers regarding their products are highly dubious and potentially perilous, as is evident from the ongoing COVID-19 misinformation. It is therefore critically important to efficiently identify unsupportable claims to guide public health policy and action. To this end, we present our proposed framework, the Cannabidiol Tweet Miner (CBD-TM), which utilizes advanced natural language processing (NLP) techniques, including text mining and sentiment analysis, to analyze the similarities and differences between commercial and personal tweets that mention CBD. CBD-TM enables us to identify conditions typically associated with commercial CBD advertising, or conditions not associated with positive sentiment, that are also absent from personal conversations. Through our technical contributions, including NLP, text mining, and sentiment analysis, we can effectively uncover areas where the public may be misled by CBD sellers. Since the rise in popularity of CBD, advertisements making bold claims about its benefits have become increasingly prevalent. The COVID-19 pandemic created a new opportunity for sellers to promote and sell products that purportedly treat and/or prevent the virus, with CBD being one of them. Although the U.S. Food and Drug Administration issued multiple warnings to CBD sellers, this type of misinformation still persists. In response, we have extended the CBD-TM framework with an additional layer of tweet classification designed to identify tweets that make potentially misleading claims about CBD's efficacy in treating and/or preventing COVID-19. Our approach harnesses modern NLP algorithms, utilizing a transformer-based language model to establish the semantic relationship between statements extracted from the FDA's website that contain false information and tweets conveying similar false claims. Our technical contributions build upon the impressive performance of deep language models in various natural language processing and understanding tasks. Specifically, we employ transfer learning via pre-trained deep language models, enabling us to achieve improved misinformation identification in tweets, even with relatively small training sets. Furthermore, this extension of CBD-TM can be easily adapted to detect other forms of misinformation. Through our innovative use of NLP techniques and algorithms, we can more effectively identify and combat false and potentially harmful claims related to CBD and COVID-19, as well as other forms of misinformation. As the conversations surrounding CBD on Twitter evolve over time, concept drift can occur, leading to changes in the topics being discussed. We observed significant changes within the CBD Twitter data stream with the emergence of COVID-19, introducing a new medical condition associated with CBD that would not have been discussed in conversations prior to the pandemic. These shifts in conversation introduce concept drift into CBD-TM, which has the potential to negatively impact our tweet classification models. Therefore, it is crucial to identify when such concept drift occurs to maintain the accuracy of our models. To this end, we propose an innovative approach for identifying potential changes within social network streams, allowing us to determine how and when these conversations evolve over time. Our approach leverages a BERT-based topic model, which can effectively capture how conversations related to CBD change over time. By incorporating advanced NLP techniques and algorithms, we are able to better understand the changes in topic that occur within the CBD Twitter data stream, allowing us to more effectively manage concept drift in CBD-TM. Our technical contributions enable us to maintain the accuracy and effectiveness of our tweet classification models, ensuring that we can continue to identify and address potentially harmful misinformation related to CBD.