Detección de amenazas persistentes avanzadas en redes de comunicaciones a partir de datos de flujo

  1. Adrián Campazas Vega
Supervised by:
  1. Vicente Matellán Olivera Director
  2. Ángel Manuel Guerrero Higueras Director

Defence university: Universidad de León

Fecha de defensa: 12 July 2023

Committee:
  1. José María de Fuentes García-Romero de Tejada Chair
  2. Francisco Javier Rodríguez Lera Secretary
  3. Sonsoles López Pernas Committee member
Department:
  1. ING. MECÁNICA, INFORMÁTICA Y AEROESPACIAL

Type: Thesis

Teseo: 819109 DIALNET

Abstract

APTs represent a significant concern for governments, organizations, and enterprises due to their persistent and stealthy nature. A key characteristic of APTs is the generation of malicious traffic at various stages of the attack lifecycle. Prior literature has demonstrated the feasibility of detecting malicious traffic by leveraging machine learning models that are trained on network packets. Network packets contain all the information that is exchanged in network communications, including the payload. However, in networks that handle an enormous amount of traffic, analyzing every single packet that routers manage may not be feasible. To address this challenge, lightweight flow-based protocols are employed to analyze network activity in such infrastructures. A flow is defined as a set of IP network-layer datagrams that traverse through an observation point in the network during a specific time interval. All datagrams belonging to the same flow share certain characteristics such as IP addresses and ports, both source, and destination. Unlike network packets, network flows do not store the packet payload, thus reducing the computational load on routers. However, this approach also results in a loss of a significant portion of information contained in the datagrams. There are networks that manage so much traffic that in order to reduce the computational load on their devices, its need to select one packet from each X when generating network flows. This process is known as sampling. The primary objective of this study is to detect malicious network traffic, including that which may be generated by APTs, in such infrastructures using machine learning techniques. This research aims to enhance the security of companies, organizations, and end-users by identifying and mitigating potential cyber threats through the analysis of network flows. In order to train machine learning models effectively, it is essential to have access to accurately labeled datasets. To this end, DOROTHEA has been developed and validated as an implementation of a previously proposed framework. The primary function of DOROTHEA is to generate datasets comprising network flows generated using various sampling thresholds. This tool is designed to be both flexible and scalable, and allows for the generation of both malicious and benign traffic flows in isolation, thereby enabling unambiguous labeling of the generated flows. The use of DOROTHEA facilitates the training of machine learning models with accurate and relevant data, ultimately leading to more effective detection of malicious network traffic. To evaluate the feasibility of detecting malicious traffic in networks utilizing flow-based protocols and packet sampling, two distinct methodologies have been employed. The first approach utilizes supervised learning algorithms to train models for detecting malicious traffic, with an aim to investigate the impact of varying sampling thresholds on the efficacy of the models. The second approach employs anomaly detection techniques to identify potentially malicious traffic patterns in network flows. By utilizing these two complementary approaches, this research aims to provide a comprehensive evaluation of the ability of machine learning techniques to detect malicious network traffic in complex, high-traffic infrastructures. In the first supervised learning approach, datasets comprising network flows collected under varying sampling thresholds were generated using DOROTHEA. These datasets contained instances of port scanning attacks and were generated using sampling thresholds of 1/250, 1/500, 1/1,000, and 10,000. The generated datasets were then used to train and evaluate the performance of several machine learning models, including KNN, LR, LSVC, LSVC+SGD, MLP, and RF. To evaluate the generalizability of these models, they were subsequently tested using a dataset collected from the production routers of RedCAYLE, which is the regional academic network of Castilla y León, as well as a publicly available dataset known as BoT-IoT. By testing the models using datasets from different sources, this research aims to demonstrate the ability of the supervised learning models to effectively detect malicious network traffic in diverse environments. The results of the study demonstrate that machine learning-based detection models can effectively identify malicious network traffic in sampled flow data. However, the performance of the models varies significantly depending on the sampling rate used. Specifically, as the sampling threshold increases, certain models experience a decrease in their detection capability. However, it was found that the KNN, MLP, and RF models were able to maintain their detection capability across all of the studied thresholds. Among these models, the KNN model demonstrated the most robust performance across all sampling thresholds. In order to investigate the ability of anomaly-detection-based models to identify malicious network traffic in high-traffic network infrastructures, two models, namely OC-SVM and iForest, were evaluated using synthetic sampled flow data and actual sampled flow data from the RedCAYLE network. These models were not trained to detect a specific type of attack, but rather to identify normal network traffic patterns and classify any traffic that deviated from those patterns as anomalous. During the evaluation, the training datasets only contained benign traffic, while the test datasets included port scanning attacks. To ensure the robustness of the models, the synthetic dataset also included SQL injection attacks, which are significantly different from port scanning attacks. The results of the evaluation showed that both models were able to successfully identify anomalous traffic patterns in the test datasets, indicating that anomaly-detection-based models can be an effective tool for detecting malicious network traffic in high-traffic network infrastructures. The OC-SVM model demonstrated high accuracy in detecting network attacks as anomalies in both the synthetic traffic and flow data collected from RedCAYLE routers, indicating that such models based on anomaly detection can potentially detect unknown or zero-day attacks. In conclusion, the experiments conducted in this study demonstrate that it is possible to detect malicious traffic in networks that handle a large amount of traffic by using machine learning-based detection models and anomaly detection-based models. This capability allows potential victims to be warned of threats and increases the overall security of the network. This thesis opens up a number of possibilities for the future in relation to the detection of malicious traffic in large communication networks, being the starting point for further research to improve the ability to detect attacks in such networks.