Botnet activity spotting with artificial intelligenceefficient bot malware detection and social bot identification

  1. Velasco Mata, Javier
Supervised by:
  1. Enrique Alegre Gutiérrez Director
  2. Víctor González Castro Director

Defence university: Universidad de León

Fecha de defensa: 21 December 2023

  1. Ricardo Julio Rodríguez Fernández Chair
  2. Laura Fernández Robles Secretary
  3. Alberto Barrón Cedeño Committee member

Type: Thesis


In the cybercrime scope, botnets are networks of bots, automated entities that follow instructions from a cybercriminal. The capacity of these networks to operate en masse has made them one of the most popular tools to carry out malicious activities, fromspam distribution to distributed denial-of-service (DDoS) attacks. This has made botnets one of the online threats with the most significant presence, causing billionaire losses to the global economy. The motivation of this PhD Thesis is to research and propose bot detection techniques. Specifically, it is focused on two types of bots: malware bots, i.e., virus programs that can be installed in the victims’ devices without their notice; and social bots, i.e., fake accounts in Social Networks that try to masquerade as real humans to deceive regular users. The first work is dedicated to the detection of network traffic produced by malware bots. In particular, we aim to improve the performance of botnet traffic classification usingMachine Learning by selecting those features that further increase the detection rate. For this purpose, we employ two feature selection techniques, namely Information Gain and Gini Importance, which led to three candidate subsets of five, six and seven features. Then, we evaluate the three feature subsets with three models (Decision Tree, Random Forest and k-Nearest Neighbors). To test their performance, we generate two datasets based on the CTU-13 dataset, namely QB-CTU13 and EQB-CTU13. Finally, we measure the performance as the macro averaged F1 score over the computational time required to classify a sample. The results show that the highest performance is achieved by Decision Trees using a five-feature set, which obtained a mean F1 score of 0.850, classifying each sample in an average time of 0.78 microseconds. Nowadays, there are networkswith large bandwidthswhere vast amounts of traffic are generated every second, and it is hard to analyze all that information looking for threats, especially before large damage is done. Hence, the second work focuses on real-time detection of botnet traffic, even on high bandwidth networks. As a solution, we propose an approach capable of carrying out an ultra-fast network analysis (i.e. on time windows of one second), without a significant loss in the F1-score on botnet detection. We compared our model with other three literature proposals and it achieved the best performance: an F1 score of 0.926 with a processing time of 0.007 ms per sample. We also assessed the robustness of our model on saturated networks and on large bandwidths. In particular, our model is capable of working on networks with 10% of packet loss, and our results suggest that using commercial-grade cores of 2.4 GHz, our approach would only need four cores for bandwidths of 100 Mbps and 1 Gbps, and 19 cores on 10 Gbps networks. The third and fourth works shift their focus towards social bots – fake accounts in SocialNetworks – which are a growing concern due to their promotion of fraudulent content and divisive ideologies. The damage caused by social media bots ranges from individual scams to affecting the whole society, as they may be used to contaminate the public debate with fake news, and thus can also influence the political sphere. In the third work, we exploit the graph structure of Twitter to detect bots automatically. Specifically, we propose a novel pipeline approach, based on Kipf and Welling’s Graph Convolutional Network model, which solves its limitations when used in graphs that are independent of the training data. We obtained an F1 score of 0.784 on the Crescirtbust dataset using a version of our proposal trained on seven completely independent datasets, a score 24% higher than the baseline. Furthermore, we present a novel seedbased cross-validation to generate class-balanced folds thatminimize the intra-fold graph’s edge loss. The new pipeline and cross-validation methods could be applied to any other problem that involves graph data. We have realized that it is easy for a fake account to pose as a human with convincing metadata such as the user name, the account description, the location, and other public and editable information. It is also possible for bots to follow each other imitating the structure of real communities. Therefore, in the fourth work, we focus on building a Twitter bot detector based on the accounts’ publication activity. For this purpose, we created a novel dataset of Twitter users that includes 17,945 manually labeled samples into bots or humans. Moreover, our dataset includes the users’ public metadata, their whofollow- who relationships within the dataset while ensuring a dense connection between the users, and their most recent publication activity. To the best of our knowledge, our dataset is the largest in terms of completeness and manually labeled Twitter users into bots and humans. Our social bot detector proposal leverages BERTopic, a BERT-based topic predictor, to classify the tweets of the users into 102 categories. The resulting information is time-windowed at 15-minute intervals to characterize the users’ activity and used to predict them into bots or humans using our proposed classifier, an ensemble of seven LSTM-based models. Our system reported an accuracy of 0.755 and an F1 of 0.777 on our new dataset.