Analysis and classification of spam email using artificial intelligence to identify cyberthreats

  1. Jáñez Martino, Francisco
Supervised by:
  1. Enrique Alegre Gutiérrez Tutor
  2. Víctor González Castro Director
  3. Rocío Aláiz Rodríguez Director

Defence university: Universidad de León

Fecha de defensa: 21 December 2023

Committee:
  1. Arturo Montejo Ráez Chair
  2. Laura Fernández Robles Secretary
  3. Petr Motlicek Committee member

Type: Thesis

Abstract

In this Thesis, we propose new models, methodologies, approaches and datasets to analyze and identify rising cybertreats in spam emails. Motivated by our collaboration with the Spanish National Institute of Cybersecurity (INCIBE), we focus our efforts on developing applications and conducting studies to improve the earlier detection of these risky and harmful emails. Several of the contributions presented in this dissertation are planned to be incorporated in tools developed by INCIBE to launch more detailed and earlier warnings to organizations and citizens about potential risks associated with spam emails. Our approach heavily relies on the application of Natural Language Processing, as well as Machine and Deep Learning techniques, mainly centred around supervised learning methods. First, we aimed at employing text classification methods to classify spam emails related to cybersecurity topic for the first time in the literature. Our supervised approaches have lead us to building customand novel datasets for each contribution. In this case, we created a dataset called SPam EMail Classification dataset (SPEMC), a novel dataset that includes eleven classes of spam emails based on cybersecurity topics. SPEMC is composed of two sub-datasets, i.e., SPEMC-E-15K and SPEMC-S-15K, which contain emails written in English and Spanish, respectively. We used SPEMC to evaluate the combination of four text representation techniques along with four Machine Learning models. The combination of Term Frequency - Inverse Document Frequency (TF-IDF) with Logistic Regression (LR) achieved the highest performance in the assessment done with the emails in English, 0.953 ofMacro F1-score, while TF-IDF with Naïve Bayes (NB) achieved 0.945 in the Spanish dataset. In both languages, TF-IDF with LR was the fastest combination with 2.0 ms and 2.2 ms per email, English and Spanish respectively. Secondly, we aimed at understanding the role of persuasion in spam emails to combat cybersecurity threats more effectively. We developed intelligent systems to detect persuasion and used techniques through Natural Language Processing at three granularity levels: full emails, sentences, and specific text spans (i.e., a group of one or more words shorter than a sentence). We replicated the Proppy (Barrón-Cedeño et al., 2019) classifier to spot persuasion in full emails and built our binary and multilabel models on top of RoBERTa (Liu et al., 2019) for sentence and text spans classification (based on Chernyavskiy et al. (2020)). We created a novel dataset called Persuasion Sentence in Spam Emails (PerSentSE) containing annotated sentences based on binary, i.e., persuasion or not, and multilabel classification. For the multilabel approach, we considered eight persuasion techniques: Appeal to authority, Appeal to fear/prejudice, Doubt, Exaggeration or minimization, Flag-waving, Loaded Language, Name Calling or Labeling and Repetition. We collected spam emails from the Bruce Guenter repository. Lastly, our objective was to create an intelligent system capable of detecting potentially risky spam emails for both individuals and organizations. We created Spam Email Risk Classification (SERC-4K), a novel dataset encompassing spam emails classified in two categories based on the potential risk for users due to their content, low and high risk, as well as a continuous value from 1 to 10. The dataset is composed of two subdatasets, onewith spam emails shared by INCIBE (SERC-I) and another collected fromthe Bruce Guenter repository, Spam Archive (SERC-BG). SERC-I contains English and Spanish emails, while in the case of SERC-BG almost all of them are written in English. Firstly, our approach attempted to extract potentially worthy features from headers, text, attachments, URLs and protocols (56 features in total). Then, the sets of features along with three popularMachine Learning classifiers were evaluated resulting in Random Forest as the highest classifier-performance (0.914 of F1-score). Regarding regression approach, the Random Forest Regressor achieved the lowestMSE (0.579). Our work also included a feature evaluation to determine the importance of each feature and set. In the design of our methodologies, we have considered the influence of the dataset shift, as well as the spam domain is and adversarial environment. Our email processing sought to overcome some spammer strategies such as image-based spam and hidden text.