A Hybrid Approach for the Analysis of Feature Selection using Information Gain and BAT Techniques on The Anomaly Detection

Every day, millions of people in many institutions communicate with each other on the Internet. The past two decades have witnessed unprecedented levels of Internet use by people around the world. Almost alongside these rapid developments in the internet space, an ever increasing incidence of attacks carried out on the internet has been consistently reported every minute. In such a difficult environment, Anomaly Detection Systems (ADS) play an important role in monitoring and analyzing daily internet activities for security breaches and threats. However, the analytical data routinely generated from computer networks are usually of enormous size and of little use. This creates a major challenge for ADSs, who must examine all the functionality of a certain dataset to identify intrusive patterns. The selection of features is an important factor in modeling anomaly-based intrusion detection systems. An irrelevant characteristic can lead to overfitting which in turn negatively affects the modeling power of classification algorithms. The objective of this study is to analyze and select the most discriminating input characteristics for the construction of efficient and computationally efficient schemes for an ADS. In the first step, a heuristic algorithm called IG-BA is proposed for dimensionality reduction by selecting the optimal subset based on the concept of entropy. Then, the relevant and meaningful features are selected, before implementing Number of Classifiers which includes: (1) An irrelevant feature can lead to overfitting which in turn negatively affects the modeling power of the classification algorithms. Experiment was done on CICIDS-2017 dataset by applying (1) Random Forest (RF), (2) Bayes Network (BN), (3) Naive Bayes (NB), (4) J48 and (5) Random Tree (RT) with results showing better detection precision and faster execution time. The proposed heuristic algorithm outperforms the existing ones as it is more accurate in detection as well as faster. However, Random Forest algorithm emerges as the best classifier for feature selection technique and scores over others by virtue of its accuracy in optimal selection of features.


1.
Introduction Millions of people in various organizations across continents communicate with each other on the Internet. The past two decades have seen an exponential increase in the number of people using the Internet. Nearly 4 billion users worldwide currently use the Internet [3]. An intrusion detection system (IDS) monitors network traffic to identify malicious events or privacy breaches and alerts a monitoring station or initiates preventative action against a detected threat. IDS can be classified into two main categories: one which is based on the location of the installation in the network or via the detection method as shown in Fig. 1.

Fig. 1. Classification of Intrusion Detection System (IDS)
Host-based IDS: This runs directly on the client PC and starts examining information such as log documents, running procedures, and connecting clients. If changes are needed in important user or operating system files, an alarm is sent to the administrator to take appropriate action [1].

Network-based IDS:
This system monitors and examines packets traveling over a network to identify activities such as denial of service [1,2]. Based on the detection method, an IDS can also be separated into two types as the detection of abuse and irregularity. Abuse detection works by comparing customer activity with a stored signature knowledge base of known attacks. It checks an incoming connection against a stored knowledge base if there is a match, then it stops the connection and blocks it. This type has a high accuracy rate in detecting known attacks. Anomaly Detection recognizes interrupts by following irregular practices in network traffic that may specify an attack. Abnormal behavior can be defined, either as a violation of the edges recognized for the recurrence of events in the connection or as a violation by the client of the actual profile produced for normal behavior. This approach can be characterized as a statistical, data mining, learning-based method [4]. Anomalybased IDS has the ability to identify known attacks as well as new ones [6]. However, the anomaly-based approach analyzes data based on its general properties such as size, connection time, and number of packets. It is therefore not necessary to see the content of the message. It can also analyze encrypted protocols. Due to all these advantages, The anomaly detection method is used extensively to detect and prevent network attacks. Anomaly-based IDS has the ability to identify known attacks as well as new ones [6]. Thus, he does not need to see the content of the message. It can also analyze encrypted protocols. Due to all these advantages, the anomaly detection method is used extensively to detect and prevent network attacks. Previous works [9] - [13] have focused on the application of feature selection techniques in making more accurate identification of anomalies. Previous researchers have always relied upon Information gain for analysis of significant and relevant characteristics. In this study, a version of CICIDS-2017 dataset having critical features has been applied as it demonstrates highly dense traffic and possesses the capabilities to employ huge number of methods at detecting anomalies. As mentioned in [5], the learning model is affected by application of data having multiple features leading to overfit that results in decreased performance, more memory and high computation expenses. But wherever there is involvement of complex functionalities with less values, information gain tend to be supportive. Here, a new mechanism has been introduced to select ensemble features, before slotting them in categories as per their weight values. Then the five classification algorithms, namely, J48 classifier, Naive Bayes classifier (NBC) classifier Bayes Net (BNC) classifier, Random Tree (RTC) classifier and Random Forest (RFC) classifier are assigned filters by each group of entities for detecting anomalies as well as fending off attacks on the dataset. Most relevant and significant features are extracted into different entity groups that are validated after doing comparison of detection results. With more accuracy in detection results, the perception and choice about the important and relevant the feature groups is made. The weighted features which are used in information gain versus anomaly / attack detection method are used to check the relevant and significant features of the selected entity groups. The better precision results shows the features groups which are more relevant and significant. Such features are applied to various classifiers like J48 classifier, Naive Bayes classifier (NBC) classifier Bayes Net (BNC) classifier, Random Tree (RTC) classifier and Random Forest (RFC) classifier on the given data set. Finally the results are validated for relevant and significant features. The ones with better accuracy in detection results tend to be looked up as more meaningful and relevant the feature groups. In section 2 relevant research contributions made so far on this topic has been presented. In section 3, a brief discussion on the dataset and experimental setup are mentioned clearly.The experimental part, including the results and conclusions of this study has been discussed. Finally, in section 5, the conclusion and potential future work has been discussed.

2.
Related works Recently, most applications depends on the network or computer system and their behavior is to be analyzed and threaten by the known technique calledIntrusion detection. Moreover, such technique also interrupt the features of the network or computer system which includes integrity accessibility, and confidentiality of concerned data [5]. The study the characteristics related to the network traffic and also identified number of mechanisms to handle introduction mostly they were filtered, wrapper, and combination of both algorithms [8].However, feature extraction with ensemble of fitter and wrapper assign weight for the every feature and maximum ranked features applied to clustering approach [15]. In some work, most popular resampled method called synthetic minority oversampling technique (SMOTE) [14] is applied to remove class imbalance problem. Later combined two techniques one is the Selection of Ensemble Characteristics (EFS) and the Principal Component Analysis (PCA) and then applied to the AdaBoost-based IDS to improve the performance of classification.One of the most popular wrapper method used by the most of researchers known as information gain (IG) used as a feature selection mechanism and is worked to find the minimum ranking score for each feature as a result set. Next, the ranking weights are used to determineoptimal features and are to be considered as final class label. Number of researchers use weight score >0.4, > 0.001 and > 0.8 respectively [16 ] [14 ].

3.
Feature selection The mechanism used to extract important and relevant information is known as feature selection. Generally such kind of technique is used to discriminate the class label into relevant and irrelevant functionality .The relevant functionalities had information which is optimal to class and where as in non-informative functionalities the class gained very little information about class [1]. The main objective of feature selection is to filter noninformative features and identify informative features and to pass maximum information related to class output. To achieve this, number of feature selection method are available but generally which is classified into filter, wrapper and combined or ensemble approaches [17] [19]. The Filtering method, is one used to access and extract relevant features from the given data using statistical approach. However, in case of the wrapper method selection of the relevant subset of features can be done by using the classification criteria. But the wrapper method is computationally very expensive. The next, method is ensemble or integrated method used to apply feature selection with learning criteria to extract optimal features to the given data. Such kind of ensemble feature selection methods are less expensive compare to the wrapper method.

Information Gain (IG):
The well-known popular type of filter approach, called Information Gain in which the evaluationof each functionality is depend on how much amount of information is used to identify the desired type of the class attack.
Consider, F is a feature and corresponding class is to be represented as and the entropy of the given class related to the feature F is represented as: Next, from the (1) and (2) the corresponding Information Gain related to function F to be considered as: After calculation of IG all the entities are ordered depend on the calculated G  value. Finally total M features are to be considered as feature subset with relevant informative feature. Moreover, the resultant features along with G  value is to be provided suitable information and is helped to find the target output class.

3.2.
Bat Algorithm (BA) The bat algorithm [19][20][21]is derived from the motivation of the microbats behavior in the field of computational intelligence and optimization .Let consider, every bat flies with random speed to be represented as From the (5)  [0,1] to be a random vector and is to be derived from the uniform distribution.
By applying the local search the solution is derived and then a new solution related to each bat is calculated using the random walk and is to be represented as: A Hybrid Approach for the Analysis of Feature Selection using Information Gain and BAT Techniques on The Anomaly Detection.
Where is an error and is random vector derived from the uniform distribution or Gaussian distribution of the

Proposed method
Machine Learning (ML) based methods are become popular now and are used in this study to improve performance of the Anomaly Detection System (ADS) and also worked for solution to prevent attack from the providers. Ensemble optimization ML based feature selection method applied first and extracted optimal features and then set of classifiers used to detect the attack type. The approach is used a10-fold cross-validation (CV) during the experiment and to validate the model performance. Finally model is to classify attack especially benign traffic attack. The proposed method framework shown in Figure 2, and overall work is divided into major four parts and are given below:

1.
Preprocessing: The step in which original or raw data is to be converted into desired formats which are helps for further analysis.

2.
Feature Selection :The second step, applied proposed the IG-BA based feature selection approach used to retrieve the subset of date sets and retrieved most relevant or suitable features related to each type of the attack class.

3.
Classification: The last step of the proposed work is deal classification which is helps to improve overall performance of the IDS. The number of classifiers used in this work which includes : (i) Random Forest( RF) (ii) Random Tree (iii) naïve Bayes (iv) Bayesian Network and (v) J48. However, selected features all the time may not be considered as better featuresas per the redundancyamong the features. The problem of redundancy among featuresand also to work on the dimensionality reduction proposed method introduced BA algorithm as an additional step to the feature selection. The feature selection using IG-BA approach is presented in Algorithm1. In the proposed method, first step is population initialization. Later, applied set of rules for updating and helps to move the bats in the population to the research space. In order to find the best solution the BA uses the search concept based on the local random walk. Next, relevant feature subset is derived using IG and produced new solution after updation of both loudness  Classification algorithm Although several previous works have supported many diverse algorithms, in this work, number of classifiers used which includes: (i) Random Forest (RF) (ii) Random Tree (iii) naïve Bayes (iv) Bayesian Network and (v) J48.

Naive Bayes (NB)
The classification algorithms used to predict probability of a class using Bayes' theorem in terms of statisticalclassification. In some exist works [26][27]it's clear that the impact of one attribute values related to the given class is not influenced on value of other attribute.

Bayes Network (BN)
The model in which among variables there exist encoding probabilistic relation which is called the Bayesian Network (BN). On the general assumption of the behavior of the target system model, the precision of the method is determined, with any notable departure from it is likely to reduce precision in detection. Bayesian networks have been applied in a few anomaly detection studies [22] [25].

Random Forest (RF)
Random Forest, one of the classification method, a classifier in a collection of number of decision tree. Next the word, Forest represented as a collection of classifiers. The decision tree is different from one to other depends on random selection of the desired attributes corresponds to each node.Number of works has been done related to anomaly detection using random forest [22] [24].

Random tree (RT)
The decision tree which is a collection of random attributes called Random Tree and complete tree is built with the combination of two elements nodes and branches. However, node to be considered as a test attribute and branch to be the results. Decision sheets depict the final decision reached following making calculation of all attributes as class labels. This method has been included in certain anomaly detection studies [28] [30].

J48
A machine learning algorithm corresponds to family of decision tree i.e., J48 or C4.5, make use of training data to a decision tree usingentropy [43]. Unlike IDE3, this method used to create a decision tree keeping the ability togeneratesequence of attributes. The J48 algorithm applied to anomaly detection included in many research work [29].

CICIDS2017 dataset
The dataset [5], is introduced in 2018 at the Canadian Institute for Cybersecurity and is used to detect DDoS attacks. However, data set is present benign and attack processconsidering real world network traffic data. Also, data set includes 79 features which is comprise of class labels and are used to specify major attacks mentioned: (i) Brute Force SSH (ii) Brute Force FTP (iii) Infiltration (iv) Heartbleed (v) Web Attack (vi) DoS (vii) Botnet and (viii) DDoS and the complete attacks information shown in Table 3. Total 225,746 records related to DDoS and Benign attacks included in CICIDS2017 and each record comprised with total 80 features like (i) protocol (ii) stream ID (iii) source IP (iv) destination IP (v) source port, and etc. The complete records and features is included in Table 1.

Experimental setup
As an initial model fitting, the complete original data is split into two subsets one is training data (80%) and other is test data (20%). Next, applied proposed IG-BA feature selection method and extracted optimal set of feature set. The algorithm which helps to avoid irrelevant features from the data set and also improved the performance of classification. After performing the feature selection using hybrid proposed method the result subset is applied to different classifiers which are (i) Random Forest( RF) (ii) Random Tree (iii) naïve Bayes (iv) Bayesian Network and (v) J48.

Experimental results
The   The results of feature selection methods is shown in Table 4 Table 5. The Random Tree (RT) and Random Forest (RF) produced almost 95% accuracy when comparedother classification methods. However, with these features classifiers are applied to detectall attacks. Also, observed that Naïve Bayes (NB) results bad in case of the normal traffic. The performance of classification algorithms by applying feature set of size 35 is shown in Table 7. Random Forest (RF) produced almost 97% accuracy, recall i.e., 0.978 and a low FPR i.e., 0.004, and precision Nan when compared other classification methods. However, this classification algorithms results difficulties in detecting Attack 5 traffic.The experimental results with the given classification algorithms Random Forest (RF), Random Tree (RT), and J48 are promising while detecting at Attack1 to 3 and produced better FRP. Finally it is observed that Naïve Bayes(NB) produce low FRP. Similarly, while considering 52 features Random Forest (RF) produced accuracy of 97.8%, recall i.e., 0.979, and FPR i.e., 0.004 compared to other classification algorithms. However, the precision recorded NaN. From this it is noted that this algorithms failed to detectAttack 5.

CONCLUSIONS
The proposed method validates that feature selection improves the performance of feature selection on anomaly detection data. The proposed feature selection produces the ranking of features based on their weight values using IG algorithm, resulting in a subset of features to rank. Later, individual subset applied to BA algorithms and then processed which results optimal features for the further classification. From the overall Random Forest performs promising using all sizes of feature sets from 15, 28,35, and 52. Also noticed that J48 results better in case of featuresets of 35 and 52. All the traffics detects properly using feature subsets of 35, and52. However, the Bayes Naïve (BN) results low accuracy compared other classifiers. Also notice in this classification subset of features impact on reduction of FPR.In the future, work plan to conduct study on multi classification.