Load balancing for Software Defined Network using Machine learning

Software-Defined Networking is one of the most revolutionary and prominent technology in the field of networking. It solves the problem that our traditional network faces. Still it can face a problem of bottleneck and can be overloaded. To overcome this issue, various researcher has it given various works but they are based on two or three-parameter to perform load balancing and also they are static or dynamic. We have proposed an intelligent technique that forwards the packet i.e. TCP/UDP packet traffic based on several parameters (based on 12 parameters discussed in the latter part of this section). Based on these parameters, we have applied the trained machine using KMeans [1] and DBSCAN [2] clustering algorithm and also determine the optimal number of clusters. We have tested it on the huge number of packet that are 5000, 10000, 20000, 50000, 100000, 10000000.We have also compared there results of the KMeans and DBSCAN algorithm and also discussed researchers view


Introduction
Software Defined Network [3]solves many problems that our traditional network is facing. But as we know that a huge amount of data is generated in our era. It might possible that the controller in SDN becomes overloaded.To overcome this issue, many researchers have given their view to solve problem of controller overloading. But they have taken two or three-parameter to balance the load in SDN.So there is need of a machine learning technique which can take various parameter and solve the problem of load balancing.These issues can be solved by machine learning techniques. So, we are going to discuss SDN, Load balancing and clustering in machine learning [4] in a later section of this paper.

Software Defined Network
Traditional systems have tight coupling between the control plane [5] andinformation plane [6] yet this prompts the issue of dynamic IP allocation, change in routing, bandwidth management end to end reachability, etc. SDN solves these problems and separates the control layer and information layer.So,we can define SDN as a network in which the control layer(may be located on different geographical locations) is physically separated from the information layer and a logically centralized regulator governs the routing devices.Main layers of SDN arcitecture(as shown in Fig1) are -Data Layer:It contains switching devices like router, switches etc. It is responsible to process and forward the packets as per the rules de-fined in the forwarding table.
-Control Layer :It is also known as NETWORK BRAIN.It is responsible for routing the data. Some of the function this layer are System Configuration, routing table information's exchange and management -Management plane:It is used to access and provides in management of our network equipment.Still, SDN has the issue of standard protocol that can exchange between informationplane and control plane. In 2008, the issue is tackled by the Open Flow protocol [7] which is famous southbound API in the SDN and maintained by the Open Flow Networking Foundation(ONF) [8]. Some of the components of Open flow are:--Flow table -Port -Messages Whatever, we have discussed is about SDN but SDN controller can be overloaded by traffic. So we will discuss load balancing, load balancer in SDN in later section.

Load Balancing
It is a process used to spread the load between servers or any other computing equipment. The purpose of load balancing is to maximize resource utilization, maximum throughput, reduce response time, avoid overload and avoid crashing. It also used to avoid from failover. Customarily load balancer comes programming introduced on equipment. Along these lines, it was merchant explicit and costly. The product load balancer runs on a virtual machine or hardware.

Load balancing Types
-Transport layer load balancer [9]: In this method, load balancer utilizes data , for example, IP address of source and objective and ports characterized at data header of the information.
-Application Layer load balancer [10It circulates the heap to the servers dependent on application layer conventions, for example, HTTP, COOKIES or information. In our proposed system, we relate with layer 4 load balancer.

Load balancing in SDN
There are two ways to deal with balance the heap in SDN as a centralized and distributed methodology. In a central methodology, there is a super-regulator that adjusts the heap between different regulators (ordinary controllers). The issue with this methodology is that if the super-regulator comes down, the entire organization gets down and other issues are adaptability and accessibility issues. The super-regulator methodology was settled by a distributed methodology. In the distributed methodology, there are a few regulators that balance the heap between them. Yet, the issue is that correspondence is overhead. Presently, we will examine different work done by researchers.

Tabular Related Work
This section shows the previous work done by the researchers.It include advantages and disadvantages,and the parameter they have used for load balancing. The tabular format is shown as

Proposed Methodology
We have used Software Jupyter notebook to implement our methodology. Our main is to propose a technique for the Software-Defined network which considers various parameters.For this, we have taken a dataset from Kaggle.com of Universidad Del Cauca Popayan, Colombia. The proposed methodology diagram is shown in Fig 2. In our proposed methodology, when clients sends request for particular service to the server. it first sends to the Software Load balancer(the al-gorithm is implemented here) and the first algorithm will obtained the flow statistics (IP addresses, ports,inter-arrival times, etc) using CIC Flowmeter [30]. Based on these features, it calculates the cluster value of request and it will send to the appropriate servers. We will discuss each thing in the upcoming sections.

Fig.2. Proposed Work Architecture 2.1 Flowchart of Proposed algorithm
In our proposed methodology, when clients send requests for particular service to the server, it first sends to the Software Load balancer (an algorithm is imple-mented here) and the first algorithm will obtained statistics using CIC Flowmeter. Based on these features, it calculates the cluster value of request using KMeans and also checks the threshold of respective server(No of request it can handle). If it is less than the threshold, it will send to the intermediate nodes. This is how it will work.Let us discuss about the dataset in the next section.Load balancing for Software Defined Network using Machine learning 9 max : return maximum value in particular column. All above statistics are useful to perform data preprocessing.In the next section, we will discuss Data preprocessing.

Data Preprocessing
As we know that Data preprocessing is the technique to process the raw data so that it can be used in an efficient way. Steps involved in data preprocessing are: -Data Cleaning -Data Transformation -Data Reduction In our methodology, we first check the missing data, noisy data, as well as data type of each of the parameters. These are performed by pandas module, Seaborn module of python language. Whatever, we get the result, we performed hashing on these datasets by using apply map() method. To normalize data, so that it can lie between 0 and 1.We apply min-max scaling, Min-max scalar works by using the following formula: where x is the particular datapoint,max(x) is maximum value in the data-points,min(x) is minimum value in the datapoints. This can be achieved by sklearn.preprocessing.MinMaxScaler class. Now we have data ready for traininpurposes.Next step is to train the machine using KMeans as discussed in next section.

Training Model
In this module, we use KMeans and DBSCAN to train or to perform clustering based on flow statistics(12 parameters) of our dataset. KMeans, DBSCAN. Both use a parameter called Euclidian distance. Greater the Euclidean distance, the lower will be the similarity between the data points or vice versa. In n-dimensional space, Euclidean distance between two data points can be calculated as: where xi, yiare the ith data point KMeans algorithm aims at minimizing an objec-tive function known as Square error function given by: One of the major problem KMeans clustering faces that is finding the optimal-number of clusters. This can be solved by Elbow Method [32] and Sihoullte Method

Elbow Method
It is quite possibly the most popular strategies to locate the ideal number of groups.It plots the value of the sum of squared error (SSE) [33] Vs values of k. The main aim is to select small SSE after that SSE will tend to decrease towards zero as the number of k increases. In the Elbow method, KMeans will runs for the entire dataset for a range of values of k. For each k, the SSE will be calculated.
where xi is the i th datapoint and ci is the ith centroid. In our algorithm, we have result of Elbow method for different number of packets.

Silhouette Method
It is utilized to quantify how close every data point in a group to other neighbour-ing bunches. Its worth extents from -1 to +1. An estimation of +1 shows that the ex-ample is excessively far away from its neighbouring group and excessively near the allocated bunch. So also, an estimation of -1 demonstrates that the fact of the matter is nearer to its neighbouring bunch than its allocate group. Suppose ith is the data point, whereas a(i),b(i) is the mean distance between the point i and cluster (A), (B) respectively. Thus, the silhouette s(i) can be expressed as ested on different number of packets KMeans unsupervised clustering algorithm clusters the data points into spherical shape whereas DBSCAN is suitable for non-convex clusters.It is also used to iden-tify the outliers or noise We will give a short description of DBSCAN in the next section.

DBSCAN
Density-based spatial clustering of Application with noise(DBSCAN) is suitable for dataset having nonconvex clusters and having outliers. DBSCAN clustering technique requires two parameters: 1. eps 2. minPoints: It can be define as minimum number of points required to form dense region. Now, we will discuss results obtained in next section.

Experimental Result
We have tested it on several number of parameter that are 5000,10000,20000,50000,100000,1000000 packets(described below) The fol-lowing graph shows that KMeans perform better than DBSCAN .Load balancing for Software Defined Network using Machine learning 13

Conclusion
For the proposed methodology, we conclude that KMeans perform better than DBSCAN and it can be proposed as Machine learning techniques for Software Defined Network but there are limitations to this approach. Basically it take care of a total of 12 parameters which may make it a better machine-learning algo-rithm and also take care about the forward and backward transmission parame-ter which make it intelligent and optimized methodology. In the coming years, we would like to test on SDN scenario and measure the various load balancing pa-rameters.

Future Work
The Proposed work need to be implemented on real world scenario such as on Software Defined Network Applications. It need to be tested on more numbers of packets to measure various parameters such as Throughput,Response time and latency etc.The proposed work needs to check time complexity and more general software need to capture TCP/IP packets which contains the more features