Towards Intelligent Machine Learning Models for Intrusion Detection System

The Internet has become an important resource for mankind. Explicitly information security is an interminable domain to the present world. Hence a more potent Intrusion Detection System (IDS) should be built. Machine Learning techniques are used in developing proficient models for IDS. Imbalanced Learning is a crucial task for many classification processes. Resampling training data towards a more balanced distribution is an effective way to combat this issue. There are most prevalent techniques like under sampling and oversampling.In this paper, the issues of imbalanced data distribution and high dimensionality are addressed using a novel oversampling technique and an innovative feature selection method respectively. Our work suggests a novel hybrid algorithm, HOK-SMOTE which considers an ordered weighted averaging (OWA) approach for choosing the best features from the KDD cup 99 data set and K-Means SMOTE for imbalanced learning. Here an ensemble model is compared against the hybrid algorithm. This ensemble integrates Support Vector Machine (SVM), K Nearest Neighbor (KNN), Gaussian Naïve Bayes (GNB) and Decision Tree (DT). Then weighted average voting is applied for prediction of outputs. In this work, much Experimentationwas conducted on various oversampling techniques and traditional classifiers. The results indicate that the proposed work is the most accurate one among other ML techniques. The precision, recall, F-measure, and ROC curve show notable outcomes. Hence K-Means SMOTE in parallel with ensemble learning has given satisfactory results and a precise solution to the imbalanced learning in IDS. It is ascertained whether ensemble modeling or oversampling techniques are dominating for Intrusion data set.

is capable of handling this imbalanced problem successfully. It generates new instances of minority class by considering the k-nearest neighbors. Ensemble learning is ascertained to augment the predictive ability by the unificationof single classifiers and has beenpragmatic to imbalanced data-sets [10].Bagging [11] is one of thetraditional ensemble methods employed for improvising classification techniques. Another popular approach Boosting [12] [13] is implemented as a sequential ensemble type.Tracking all the related issues on ensemble methods and single models on agribusiness time-series data [14] we have driven an ensemble framework for IDS. In this proposed work, the authors have ascertained whether ensemble modeling or oversamplingtechniques are dominating for Intrusion data set. Our work suggests a novel hybrid algorithm HOK-SMOTE, which considers an Ordered weighted averaging (OWA) approach for choosing the best features from the KDD cup 99 data set and K-Means SMOTE for imbalanced learning. In contrast, Ensemble methodology is applied for classification of attacks with normal data. In this work, much Experimentationwas conducted on various oversampling techniques and traditional classifiers. The succeedingfragment of this paper is planned as follows. Section 2 gives detailed and current methods in feature selection and ensemble learning. Section 3 featured the proposed practice. Section 4presents empirical work and results inferred. It also aims in presenting the comparisons of the performance of proposed and other oversampling methods. Lastly in section 5, the conclusions are presented.

Related work
Machine learning has caught the attention of a lot of researchers to provide solutions, especiallyfor wide-ranging big data problems. It operates eventually onhigh dimensional data in making prudent predictions and is gaining fresh momenta. On the confronts mentioned above, there are some previous works for building prudent IDS, for apt feature selection, oversampling, and hybrid techniques.RecentlySalo et al. [15]associates the feature selection approaches of Information Gain and Principal Component Analysis (PCA) with an ensemble learner based on Instance-Based learning algorithms (IBK), Support Vector Machine (SVM), and Multi-Layer Perceptron (MLP). Indira et.al [16] has combined Canberra distance, City block distance, Euclidean distance, Cebyshev distance, and Minkowski distance and produced a fuzzy ensemble feature selection and produced remarkable results using ensemble learning. InTama et.al [17],improved IDS based on hybrid feature selection and two-level classifier ensembles on the NSL-KDD and UNSW-NB15 datasets.For machine learning community, the focuses are well known, so we are not describing in detail here.

On Feature Selection
Based on the strategy that some specific features have more tendencies in improvising classification accuracy, feature selection is considered as a key step in data mining. In astudyof Wang et al [18], a conversion of original features with logarithms of the marginal density ratios has been done byprocuring novel, transformed features and hence refined the performance of an SVM model. At recent times, VajihehHajisalem et al. [19] illustrated a fusion of two methodssuch as artificial bee colony(ABC) and artificial fish swarm (AFS) along with the fuzzy C-means clustering (FCM) for dividing the training data set and correlation-based feature selection (CFS) techniques for feature selection. Zhang et al. [20] have specified a cost-based feature selection procedure with multi-objective particle swarm optimization (PSO). In that they have done a comparison of multi-objective PSO with several multi-objective feature selection approachesverified on five benchmark data sets.

On oversampling techniques
Recently there have been numerous approaches to handle the class imbalance problem. These approaches can be categorized into two Ways: data level approaches and algorithm level approaches. Data level approaches include oversampling, under-sampling, and SMOTE (Synthetic Minority Over-sampling Technique) techniques. Algorithm level approaches are Threshold method, one class learning, and Cost-sensitive learning [21]. Random over Sampling (ROS) is an algorithm for increasing the size of minority class instances to rebalance class distribution in a dataset. This scheme is arbitrarilyreplicating minority class samples. Thus, the learning rate of this technique is slow. The drawback of ROS may cause over fitting. On the other hand, it can duplicate the number of errors [22]. Random Under Sampling (RUS) is one of the methods to balance the imbalanced data set. This method is modifying the data beforelearning. RUS removes some majority class instances to rebalance the instances in classes in a particular dataset. This approach achieves faster learning because it has less data points than the original sample. This method has a major drawback of loss of valuable information while randomly eliminating the majority class instances. Thus, it may causemisclassification because of eliminating the important patterns in a given dataset [23]. Oversampling techniques are employed for balancing the dataset by increasing the minority one. There are varieties in oversampling techniques thatare anticipated in recent times such as Random oversampling [24], SMOTE [25].SMOTE is capable of handling this imbalanced problem successfully. It generates new instances of minority class by considering the k-nearest neighbors. Then, it takes the difference between the particular feature vector and its nearest neighbor under consideration. A random number between 0 and 1 multiplies this difference. Finally, this multiplication output adds to the particular feature vector to increase the instances of the minority class. There are several applications based on the SMOTE technique [26], borderline-SMOTE [27], safe-level-SMOTE [28], ADASYN [29], SVM-SMOTE [37]and SMOTE-RSB [30].

On ensemble and hybrid approaches
In a work done by Ren et.al [31], they have given a data augmentation method to construct IDS, named DO_IDS. In this, they used data sampling, iForest for sampling data, and fusion of Genetic Algorithm (GA) and Random Forest (RF) for optimizing sampling ratio. In the procedure of feature selection, a grouping of GA and RF is done for choosing the optimum feature subset. Then DO_IDS is assessed by the intrusion detection dataset UNSWNB15. Recently, Indira et.al has given a novel ensemble model for IDS based on two algorithms Fuzzy ensemble Feature selection and Fusion of multiple classifiers [32]. Many hybrid approaches using both feature selection and ensemble methods have been produced to improvethe performance of IDSs. In the research made by Malik et al. [33],a combination approach of Particle Swarm Optimization (PSO) and Random Forest (RF) is used for dimensionality reduction and RF is for classification. It has enhancedpercentage of the Accuracy.Pham et al. [34] built a hybrid model, which utilizes gain ratio technique as feature selection and bagging to combine tree-based base classifiers. Experimental results shownparamount performance isachieved by the bagging model that exploited J48 as the base classifier and done on a 35-feature subset of the NSL-KDD dataset. Abdullah et al. [35] also built IDS using IG based feature selection and ensemble learning algorithms. The experiments on the NSL-KDD dataset designate that the uppermost accuracy attained when using RF and PART as base classifiers with the product probability rule.

Proposed Methodology
In view of challenges in imbalanced data, intelligent techniques are obtained by taking the advances in machine learning for the implementation of effective IDS.
The proposed work augments the concernsof highdimensionality and imbalanced data distribution with an innovative feature selection method and an oversampling technique respectively.Two different strategies are used in the proposed methodology. One suggests a novel hybrid algorithm HOK-SMOTE, which considers anOrdered weighted averaging (OWA) approach for choosing the best features from the KDD cup 99 data set and K-Means SMOTE for imbalanced learning, followed by individual classifiers.The other implements an Ensemble model with the OWA Feature selection without sampling techniques. It is built with four base classifiers.Ensemble classification is powerful than data sampling techniques for the up surging capability of classification for imbalanced data. By modeling an intelligent algorithm HOK-SMOTE, oversampling of the minority class in the chosen dataset has been done.Figure1 below depicts thepictorial framework of the proposed HOK-SMOTE, which has the following four components:

DATA PREPARATION OF KDD DATASET
 Data preparation: The first level is to convert raw data into a structure fit for analysis by applying various preprocessing to the original dataset.  Feature Selection: To exceed the high-dimensionality problem, the feature selection approach based on ordered weighted averaging (OWA) is exploited to lessen the dimensionality of the data set.  K-Means SMOTE oversampling: To focus on the imbalanced dataset problem, in this we utilized, K-Means Synthetic Minority Oversampling Technique for Over-sampling the minority (illegitimate) class.  Ensemble Classification: Ensemble classifier is built based on weighted average voting on the base classifiers.
In search of a novel optimum feature set, filtering methods were used. The Intrusion data set has a majority of samples which are Quasi constant. Striving on this, we exploited aggregation operators for finding the best features. Ordered Weighted Average (OWA) [38] is thepredominant one in information aggregation and it is given by Yager. Later it was used in multiple applications. In this paper, an ordered weighted average (OWA) methodology is used to obtain feature scores. For doing so, the critical part lies with identifying weights. OWA takes 'AND', 'OR' and averaging cases. Learning the weights is done by analyzing the data set.
Definition1: An Ordered weighted Averaging (OWA) operator of dimension n is a mapping F: X n -> X, with associated weight vectors W= {u1, u2… un} T . Such that F is defined as F (a1,a2….an) = W * B T = ∑ =1 wj aid(j). Where B= {b1, b2, ..,.bn} is an argument vector of 'F' in descending order, aid(j) = bj. Then aid(j) is the largest element in the collection of the elements {a1,a2,….an}.There are some conditions on which weights are to be noticed. By taking different weights, we can implement different OWA operator. approach followed in this proposed work. In step 4 of the algorithm, the resultant of OWA that is FS is passed. FS is the number of chosen features in the KDD dataset. The process of KMEANS SMOTE [40] involves clustering, filtering, and oversampling. In step 5, the entire input space is clustered using KMEANS. In step 6, it finds 'k' clusters by reducing the within-cluster sum of squared error. Reduce the within-cluster quantity of squared error is given by arg min ∑ ∑ Є =1 ǁxj-ciǁ 2 (1) Then in step 7, repeat for each sample until reaching the closest mean and also calculate the new mean for each cluster. Now, filtering is done. In step 8; filter out clusters that have a high number of majority class samples. In step 9, assign more synthetic samples to clusters where minority class samples are sparsely distributed. In step 10, oversampling each filtered cluster is done by SMOTE. SMOTE (f,n,k) finally gives oversampled data. The parameters of SMOTE () are 'f' is the number of filtered clusters, 'n' is the number of minority samples, and 'k' is the k number of nearest neighbors. The effect is balanced samples of both majority and minority classes. Then the dataset is fed to the classifier.

Do until the dataset is balanced For each minority sample
Step 1:choose a minority class sample; find its k nearest neighbors Step 2: Randomly select an instance among the k nearest neighbors dif= ||xorigin−xk||, Step 3: compute synthetic data in feature space. Csyn=xorigin +||xorigin−xk||× Puniform Step4: End The second part encompasses the Ensemble modeling. It is indulged to obtainmore accurate and diverse classification predictions. Here Ensemble classifier is the one that combines predictions from four base learners. Firstly data is trained and model is built with K-nearest neighbor, and then followed by Gaussian Naïve Bayes, Decision Tree, and Support Vector Machine classifiers. Validation is done on the testing data obtaining four predictions. The ensemble algorithm is depicted below in figure 3. The resultant of four base classifiers is weighted average voting methodology through which the optimistic results are obtained. The prediction of class Fig2. Proposed Hybrid HOK-SMOTE Algorithm labels is done based on the predicted probabilities of individual classifiers. The weighted average voting is given as the Final decision =Argmax∑ =1 wjpij (2) Where wj is the weight that can be assigned to the j th classifier and 'k' is the total number of classifiers and i= {0,1}.

Empirical work and results inferred
Empirical work is done on the KDD Cup 99 dataset [36].It is the data set that was started from MIT's Lincoln Lab. From this, a chunk of it is taken for experimentations and labeled as "KDD dataset". It has records as a proportion to the records of KDD cup 99. It contains 10230 samples. It has 41 features and two class labels {attack, normal}. It has 9188 'attack' samples and 1042 'normal' samples.
Several experimentations were done on this data set. In this 50% data is taken for training and 50% is taken for testing. The histogram of records in the KDD data set is shown in figure 4. The trials on the proposed scheme were conducted on R and Python interfaces in Anaconda 3.6 Environment [39]. Anaconda is an open-source platform for Python and R language. It holds about 100 of the commonly used Python packages for data science. Thesystem type of 64-bit Operating System with a processor of Intel i5, Memory of 1TB, and 4GB RAM has chosen for doing several tests. It was commonlypracticed in research areas like machine learning, artificial intelligence, data science, and so on. On the KDD data set, normalization is applied. The class labels are assigned as {1, 0} for 'normal' and 'attack' instances respectively. Features are also given labels as {F1, F2….F41}.Now to the data set, the OWA operator is applied. It has taken F * (a1, a2….an) = 1 ∑ =1 ai(i.e., averaging case) where it is fixed weighting operator.The parameters for OWA are 'x' and 'w'.
Where 'x' is the individual feature values and 'w' is the weight vector (1/length(x))executed in R platform. As a result, based on the score obtained the features are selected. Here the threshold is 0.45. The resultant features are  F5,F7,F10,F11,F17,F20,F21,F32,F33,F35,F36,F37,F38,F39,F40 and F41.It is shown in figure 5. The ensuing features are 17.The other feature selection methods are applied for comparing the proposed approach. The number of features obtained with Chi-square, Information gain, the Gain ratio is 15, 13, and 12 respectively. In this work, the error rate ofOWA is compared with the error rates of information gain, chi-square, and gain ratio. It is affirmed in underneath figure 6.
Before applying oversampling techniques to the data set, the dataset appears imbalanced as shown in figure 7. The blue dots are 'attack' samples whereas orange dots are 'normal' samples. Then the KDD data set with 17 features is fed to the KMEANS SMOTE as mentioned above in the HOK-SMOTE algorithm. It results in balancing the minority samples giving rise to new class distribution. Here in this work, several experiments were made on SMOTE, SVM-SMOTE, ADASYN, Borderline SMOTE, and KMEANS SMOTE. The oversampling results are shown below in figures8, 9, 10, 11, and 12 respectively. Then new resampled data is sent to SVM classifier for obtaining predictions. Later it was sent to GNB, DT and KNN classifiers.     figure 22 below with AUC 1.0. Since no single classifier gives less than 0.5 and where a perfect classifier gives 1.0. This can be taken as an optimistic measure of how the model is worthy. While in the Intrusion detection system both attack and normal data should be correctly predicted, so the accuracy rate of the model is more prioritized than anything else in evaluating the model, we are evident that the model is an optimistic one. And therefore we choose the ROC curve to determine the quality of the proposed model. Centered on these fallouts, we acclaim either KMEANS SMOTE or Ensemble modeling with the above said base classifiers for imbalanced high-dimensional datasets is good. Feature subset selection is also one of its inherent attainments for such sort of data sets. As such we recommend Ordered weighted averaging for opting optimal features.

Conclusions and discussions
Previously several machine learning techniques have proposed solutions to the challenges of imbalanced learning and increasing accuracy of the model but still, imbalanced data is not negotiating issue. Therefore, to combat this, data distribution challenge and high dimensionality, an oversampling technique, and an innovative feature selection method were emphasized in this paper. Our work suggested a novel hybrid algorithm that considers an ordered weighted averaging (OWA) approach for choosing the best features from the KDD cup 99 data set and K-MEANS SMOTE for imbalanced learning. Here an ensemble model is compared against the hybrid algorithm. This ensemble integrates SVM, KNN, Naïve Bayes, and DT. For predictions weighted average voting is applied. The results indicate that the proposed work is the most accurate one among other ML techniques. Hence K-Means SMOTE in parallel with ensemble learning has given remarkable results and a precise solution to the imbalanced learning in IDS. From the results shown the proposed ensemble with OWA feature selection is compared with the Hybrid algorithm HOK-SMOTE. Hence it is evident that the ensemble model with KNN, GNB, DT, and SVM has given identical results with the oversampling techniques. Among all the oversampling techniques from the observations made, KMEANS SMOTE is paramount over others.To our awareness, there arecertainly no previous works that havereflected the properties of ensemble modeling methods against data sampling proceduresfor intrusion detection data set. As Upcoming work,we explore possibilities ofnovel models for ensemble learning and oversampling techniques. And as scope for feature selection we study other aggregation methods.