An Optimized Discretization Approach using k-Means Bat Algorithm

Article History: Received: 10 November 2020; Revised: 12 January 2021; Accepted: 27 January 2021; Published online: 05 April 2021 Abstract: This study has proposed a relatively new discretization approach using k-means and Bat algorithm in preparation phase of classification problem. In essence, bat algorithm is applied to find the best search space solution. Eventually, the best search space solution is utilized to produce cluster centroid. The cluster centroid is very useful to determine appropriate breakpoint for discretization. The proposed discretization approach is applied in the experiments with continuous datasets. Decision Tree, k-Nearest Neighbours and Naïve Bayes classifiers are used in the experiments. The proposed discretization approach is evaluated against other existing approaches: K-Means algorithm, hybrid K-Means with Particle Swarm Optimization (PSO) and hybrid K-Means with Whale Optimization Algorithm (WOA). The classification performance is evaluated in terms of accuracy, recall, f-measure and receiver operating characteristic curve (ROC). To test the performance of the proposed algorithm, nine benchmark continuous datasets are used. The proposed algorithm show the better results compare to other approaches. The proposed algorithm performs better in discretization to solve classification problems.


Introduction
Discrete values is necessary in representation of knowledge for data mining application. This is because the characteristics of discrete values that are very close to the representation of knowledge make these discrete values easier to handle compared to continuous values. From Madhu et al. (2014), the conversion process from continuous value into discrete data is a major step in data preparation. Thus, the continuous attribute is need to convert into discrete value before the data mining process. Where, the continuous values is composed with a range that called as breakpoint. For example the distance attribute can be transformed in discrete values representing by intervals: from 0 to 10km, over than 10km into 100km and over 100km. The task to determine continuous value into these range is known as discretization, and become an essential task of the data preparation in classification (Cano et al., 2016).
Choose the correct data processing method has significant impact on dataset classification (AlMuhaideb & Menai, 2016). The major challenge in classification problem is to obtain the better result in classification performance. There are many ways to improve classification performance such as feature selection (Uçar, 2020), fuzzy clustering (Xu et al., 2020), enhancement of Random Forest Classification (More & Rana, 2020), and discretization (Zhou et al., 2021).
The optimization approach is intensively developing since it is widely used to solve problems in the real life (Slowik & Kwasnicka, 2017). Recently, the data mining field have adjust the algorithms and method with advanced optimization, theory graph and matrix computations. Based on the methods, matrix representation is used to present the data. While, optimization problem is used to formulate the data mining problems with matrix variables (Azham Hussain, et al, 2019). The task of data mining is a process to find the goal of optimization problem, depending on minimizing or maximizing objective function.
Data preparation is an important process in classification. Meanwhile, discretization process is important in classification. However, most of the research in discretization lack in optimization approach (Hacibeyoğlu & Ibrahim, 2016; Lavangnananda & Chattanachot, 2017). In this paper, a better discretization scheme is obtain through optimization algorithm. The objective of discretization is to find the best solution in many optimization problems. Thus, searching in a whole space is needed to find the best solution. A new hybrid optimized discretization approach in data preparation phase is proposed in this research. To avoid loss of information and to maintain the accuracy of the classification algorithm are the challenging issues of discretization process. Discretization of continuous value for feature can be used to solve that problem. The feature value is divided into discrete range where each range present a category. This research proposes a new discretization approach based on hybrid K-Means with Bat algorithm discretization approach for single-class single-label. This paper is organized as follows: Literature Review on K-Means as discretization approach and optimization algorithm are presented in Section 2. Section 3 discusses The Proposed Discretization Approach based on K-Means and Bat algorithm and the description of the data sets. Section 4 discusses the description of the discretization methods used for comparison, experiment results the followed by discussion. Conclusion of this paper is presented in Section 5.

Literature Review a. K-Means as Discretization Approach
Various discretization approaches can be used in many problems. Discretization can involve one method or more than method. For example, the research from (Fikri et al., 2020) uses fuzzy logic and Random Forest classifier as discretization approach to improve classification accuracy. Also employ multivariate discretization (Zamudio-Reyes et al., 2017), and K-Means (MacQueen, 1967) as discretization approach. In 1967, J. MacQueen was proposed as an iterative algorithms. At the beginning, k data points are randomly select as reference point called as centroid.
K-Means can be used as discretization approach. In (Maryono et al., 2018) K-Means act as discretization on mixed attribute dataset. In another research, K-Means is combined with discretization technique and Naïve Bayes classifier (Tahir et al., 2016) applied in network intrusion detection system. Moreover, K-Means can be implemented as discretization approach without combination with another approach such as in network intrusion detection research (Zhao et al., 2018) and graph optimal graph clustering (Han et al., 2020)

b. Optimization Algorithm
Recently, the real-life problems can be solve by using optimization algorithms. The right choice of an optimization algorithm is needed to solve the optimization problem. There are many way to classify the optimization algorithms which are depending on the characteristics and focus. One of the commonly used algorithms is swarm intelligence-based. This section present three prominent swarm-based optimization algorithm; Bat Algorithm (BA), Particle Swarm Optimization (PSO) and Whale Optimization Algorithm (WOA).
The population-based metaheuristic optimization algorithm (Nguyen et al., 2020) known as Particle Swarm Optimization (PSO) algorithm is proposed by Kennedy and Eberhart (1995). PSO simulates the movement of birds that are randomly looking for food in search space. According to PSO, every bird is considered as a solution or particle. PSO was used to resolve many kind of optimization problems such as scheduling Xin-She ( 2010) was present Bat Algorithm (BA) (Nguyen et al., 2020). BA was developed that mimic the behavior of bat where according to echo to find the pray. Meanwhile, the algorithm of BA is changing pulse rates of loudness and emission to find the best solution. BA have been employed in various applications. In

The Proposed Discretization Approach
This research proposed a new discretization approach, called for discretizing the continuous values of a datasets. To evaluate the effectiveness of the proposed approach, the rest of experiments have been conducted.

a. Data Acquisition
Nine continuous datasets are obtained from UCI Machine Learning Repository (http://archive.ics.uci.edu/ml) and used. UCI was created in 1987 by David Aha(Imran et al., 2013) and fellow graduate students at UC Irvine, where more than 500 datasets were provided to public for research purposes. The 9 continuous datasets that are used in this research are listed as follows: The scale of instances are from 159 to 5000 and attributes in range 8 to 100 features. These 9 datasets are from various domains, consisting of different number of instances and attributes. The information about datasets are present in Table 1 including the dataset name, instances, attributes and dimension. These datasets are in the format of Comma-Separated Values (CSV) which is the delimited text file that uses a comma to separate values for machine learning using WEKA (Waikato Environment for Knowledge Analysis).

A. K-Means Algorithm
K-Means is an iterative algorithm. At the beginning, k data points are randomly selected as reference points, also known as centroids. Data are divided into k clusters. Let assume cluster k -th consist of x data point that nearest to center point, kc. Location of center point and the data point are repetition process and repeated until meet the optimum solution. The definition of K-Means are represented using equation (1).

B. Bat Algorithm
Bat algorithm (BA) mimics the bat behavior where a group of bats in a population will fly randomly to find the prey. Each bat will detect the nearest prey to them and will update the position and speed. The bat that is closest to the prey becomes the best bat in the population. In BA the speed is known as the velocity and a set of bats is known as the solution. According to BA the fitness function must be computed for each bat and the best fitness function for each bat is known as . Then, the highest will be the . The bat with the becomes the best bat in population.
This study is follow the following rules of BA: (i) First, distance detection. The entire of bats in population will used echo to detect their position with pray.

C. Hybrid Discretization K-Means with BA Algorithm
In discretization, the vital role is to determine the breakpoints of the integer values. The continuous value can be assigned according to breakpoints, as integer values such as 1,2 or 3. In approach, the cluster centroid of each cluster, -th is determined by BA. The format of the dataset is presented in Table 2. In , each bat position consists of the number of features denote by in dataset. The information regarding the solution is given by where is the number of solutions. Each solution is }, where is the number of attributes for the -th solution in the -th dataset. For example, it is assumed that dataset DS, has 10 features, 15 instances and 20 generations or repetitions. After 20 repetitions, the -th bat or instance number 10 th represented by is considered the best in the population. The position for 10 th instance is . Thus the initial centroid for cluster in K-Means algorithm, .
Let the set of data points in dataset , where where, ‖ ‖ is the Eucledian distance between a point, , and a centroid, , iterated over all point in the -th cluster, for all , cluster.

c. Classifiers Performance
In optimize discretization approach process at the end the results can be evaluated through classifier. Classifier is a learning algorithm that learn the model from training data. There are four classifier use in this research, which are Decision Tree, k-Nearest Neighbours and Naïve Bayes. These classifiers are usually used in classification (Shafiq et al., 2020).
To compare classifiers, four classification evaluation criteria are used; Accuracy, Precision, Recall and ROC. These performance criteria are used to evaluate the effectiveness of optimize in discretization and feature selection in order to improve classification accuracy through six experiments.

Results and Discussion
The algorithms used in this experiment are executed using MATLAB. Validation of the algorithm using four classifiers (Tree, k-Nearest Neighbours and Naïve Bayes) from WEKA. The goal of this experiment is to validate that the discrete data can improve classification performance in terms of accuracy, recall, f-measure and ROC.
The experiment is conducted by converting all continuous dataset and generating new discrete datasets. The comparison have been made between proposed approach, between continuous dataset denote as and discrete dataset that convert by K-Means classifiers denote as , hybrid K-Means with PSO, and hybrid K-Means with WOA, .

a. Accuracy of Discrete Datasets
The results of performance measure accuracy for Naïve Bayes classifier shows in Table 3. The accuracy of eight datasets out of nine datasets achieve better results after discretization process. Six datasets out of eight datasets are using hybrid discretization, where four datasets are improved by . Table 4 shows the accuracy of six out of nine datasets which are improved after discretization process. and are able to improve three datasets out of six datasets, respectively. By using Decision Tree, five datasets out of nine datasets achieve better results after discretization process. As shown in Table 5, improved the accuracy of three datasets out of five datasets and improved the accuracy of two datasets out of five datasets.

b. Recall of Discrete Datasets
The performance measure results in terms of recall for Naïve Bayes classifier are shown in Table 6. All datasets obtain good results after discretization process by using hybrid discretization which is 8 datasets from 9 datasets using . Table 7 shows that six out of nine datasets are improved after discretization process. , , discretization approach improved 2, 1 and 3 from 6 datasets, respectively. By using Decision Tree, six datasets out of nine datasets are improved after discretization process as shown in Table 8. improved two datasets, improved one dataset and improved three datasets from six datasets.

c. F-Measure of Discrete Datasets
The performance measure results in terms of f-measure for Naïve Bayes classifier are shown in Table 9. All datasets obtained good results after discretization process using hybrid discretization. Where, 7 datasets using as discretization approach. Table 10 shows the F-Measure of five datasets out of nine datasets which are improved after discretization process.
, and are able to improve two datasets, one dataset, and one dataset, respectively. By using Decision Tree, five datasets out of nine datasets are improved after discretization process as shown in Table 11.
technique improved three datasets out of five datasets. Both and techniques improved one dataset out of five datasets.

d. ROC of Discrete Datasets
The performance measure results in term of ROC for Naïve Bayes classifier are shown in Table 12. ROC of five datasets out of nine datasets achieved good results after discretization process. From these five datasets, four datasets are using approach and one dataset is using approach. Table 13 shows the ROC of seven datasets out of nine datasets which are improved after discretization process.
technique is able to improve five datasets out of seven datasets. Both techniques, and are able to improve one dataset. By using Decision Tree, seven datasets out of nine datasets are improved after discretization process as shown in Table 14.
improved six out of seven datasets, and improved one out of six datasets. In this experiment, and obtained the same result for DS1.

Conclusion
In this paper, one new optimize discretization approach was proposed. The experiment was done to compare effectiveness of the proposed approach, to improve classification performance over discrete datasets that were generated with continuous dataset, also discrete dataset that using another approach where; , and approach. From the experiment, its proof that optimization algorithm employ during data preparation step able to solve classification problem. Also, the results show the optimization algorithm was able to improve the classification performance in terms of accuracy, recall, f-measure and ROC.
This research shows that outperforms almost all datasets compared to continuous dataset and discrete dataset which uses another approach. Thus, BA is a good discretization approach, where it is able to maintain the accuracy of the classification algorithm and avoid loss of information. However, the proposed approach still have room for improvement in future research, since was not able to improve classification performance in all datasets. In the future, this research will be conducted on feature selection by using optimization algorithm especially Bat Algorithm. Optimization algorithm may examine with mix type of attributes and imbalance datasets.