Master of Science, University of Toledo, 2023, Cyber Security
Email communication is a vital component of modern-day communication. However, an increase in spam emails is a significant threat to individuals and organizations which can result in financial and resource losses. Even though the development of effective spam detection mechanisms is essential to safeguarding email security and ensuring safe and secure communication, existing spam email detection methods have some limitations such as the limited capacity to handle the high volume, complexity, and variability of natural language and requiring extensive feature engineering. Moreover, they have the limited ability to remember information from previous time steps, resulting in poor performance, including high false positive rates, low recall, and the inability to detect new types of spam emails.
This thesis proposes a novel spam email detection approach that fine-tunes XLNet through supervised training on a labeled dataset of spam and non-spam emails without requiring hand-engineered features. The fine-tuned model is used to predict the class of previously unseen emails. Additionally, the thesis proposes a spam detection model that can handle the high volume, complexity, and variability of natural language in spam emails. The proposed spam email detection model is evaluated on various benchmark datasets, including SpamAssassin, Enron, and Ling-Spam, and its performance is compared with existing models. The model either outperforms or is at least comparable to the state-of-the-art (SOTA) models, achieving an accuracy, the area under the receiver operating characteristic curve (AUC), and F1 scores of 0.9869, 0.9817, and 0.9869 on the SpamAssassin dataset; 0.9892, 0.9893, and 0.9892 on the Enron dataset; 0.9944, 0.9967, and 0.9944 on the Ling-Spam dataset; and 0.9888, 0.9889, and 0.9888 on the combined dataset, respectively.
The proposed model's superior performance in detecting spam emails demonstrates its potential to become a key component of email security measure (open full item for complete abstract)
Committee: Jared Oluoch (Committee Chair); Junghwan Kim (Committee Member); Weiqing Sun (Committee Co-Chair)
Subjects: Artificial Intelligence; Computer Engineering; Computer Science; Engineering; Experiments; Information Science; Information Technology