BotChase: Integrated Unsupervised Learning with Decision Tree Classifier for Graph-Based Bot Detection
Main Article Content
Bot detection using machine learning (ML), with network flow-level features, has been extensively studied in the literature. However, existing flow-based approaches typically incur a high computational overhead and do not completely capture the network communication patterns, which can expose additional aspects of malicious hosts. Recently, bot detection systems that leverage communication graph analysis using ML have gained attention to overcome these limitations. A graph-based approach is rather intuitive, as graphs are true representation of network communications. To overcome from the issues arisen from existing models, this project uses supervised and unsupervised algorithms, and these algorithms will be trained and generate a model and this model will be applied on new request data to identify whether request is normal or attack. Using unsupervised (K-means) algorithm, we will separate dataset into Bot (attack) and BENIGN (normal) records. K-means will arrange similar records in one cluster, and we will filter out all those records which has a smaller number of requests. All high request number of records will consider as BOT or attack. After separating records, it uses graph-based features extraction technique to extract features from dataset. Dataset will be passed to graph and each IP will be consider as VERTEX and then connect source and destination with edges. Edges will have weight based on its incoming and outgoing link connections. To get edge weight we will calculate between_ness centrality, incoming edge weight, outgoing edge weight and alpha_centrality weight. After all this calculation we will extract in_degree, out_degree, in_degree_weight, out_degree_weight, between_ness, clustering and alpha_centrality as features. Any record which has high number of connections will mark its label as 1 (BOT) otherwise 0 (normal). After features extraction from graph, we will go for normalization to get mean values of each feature. Normalized features will be used to train decision tree classifier and this model can be used to predict type of future requests.