A Study on GBW-KNN Using Statistical Testing

In the 4th industrial revolution, big data and artificial intelligence are becoming more and more important. This is because the value can be four by applying artificial intelligence techniques to data generated and accumulated in real-time. Various industries utilize them to provide a variety of services and products to customers and enhance their competitiveness. The KNN algorithm is one of such analysis methods, which predicts the class of an unlabeled instance by using the classes of nearby neighbors. It is used a lot because it is simpler and easier to understand than other methods. In this study, we proposed a GBW-KNN algorithm that finds KNN after assigning weights to each individual based on the KNN graph. In addition, a statistical test was conducted to see if there was a significant difference in the performance difference between the KNN and GBW-KNN methods. As a result of the experiment, it was confirmed that the performance of GBW-KNN was excellent overall, and the difference in performance between the two methods was significant.


Introduction
In the 4th industrial revolution, the boundaries of existing industries are blurring, centering on the development of ICT technology [1]. This is called the big blur phenomenon, and it is accelerating as technologies such as IoT, artificial intelligence, and big data emerge. In particular, new services are emerging through the combination of artificial intelligence and big data, and are changing the form of existing business. A large amount of data is accumulated in real-time due to the advancement of technology capable of storing and processing data. In addition, advances in data mining, machine learning, and deep learning technologies allow us to discover patterns and new values from this data. If these values are applied to products or services, competitiveness can be enhanced, so countries and companies are paying attention. Big data refers to a large amount of data and data analysis technology beyond the capabilities of existing databases. In the past, information was simply individual-centered and at the level of structured data, but now it includes unstructured data such as text [2,28]. The characteristics of big data can be expressed as 4V, which means volume, variation, velocity, and value. Techniques for finding or analyzing such patterns of big data include statistical techniques, data mining, and artificial intelligence. Machine learning is a field of artificial intelligence that develops algorithms or techniques so that computers can learn, and it means making predictions based on given attributes through training data. The process of creating a model based on data is called learning. Machine learning is classified into supervised learning and unsupervised learning depending on the presence or absence of a label. Supervised learning is to train a computer in a given state of a label and predict the result value of new instance, and regression and classification are representative. If the classification result value is fixed, it is classification, whereas in regression, the result value is not fixed. Regression is a model that is mainly used in the form of guessing data through functional expressions. Unsupervised learning is a method of learning with unlabelled data, finding hidden patterns or structures of data, and clustering is representative [3,4,29]. The goal of the cluster model is to group similar entities by analyzing the characteristics of unlabelled data.

K Nearest Neighbor
KNN is a representative classification algorithm that classifies new individuals with labeled data. The KNN algorithm is a technique that classifies entities with unknown categories into the class of the most similar entities among labeled entities. It is a popular technique in many fields because of its intuitive and simple advantages [5]. The pseudo-code of KNN is as follows [6].
Algorithm KNearest Neighbor #Input #training dataset: X, class labels of X: Y, unknown sample: x #Output: class labels of x: y #Classify (X, Y, x) fori = 1 to length of X do Compute distance d(Xi, x) end Compute the k smallest distances get class labels of k-nearest-neighbors Y Compute majority label of k-nearest-neighbors and assign the label of x In order to classify a new object that has not been classified, the most similarly labelled data is used.Methods of measuring similarity include Euclidean distance, Manhattan distance, and cosine similarity, and the Euclidean distance is widely used. The formula to find the distance d between point A (x1, y1) and point B(x2, y2) in the Euclidean method is as follows [7].
The distance between data is calculated according to the similarity measurement method, and k neighbors from the nearest data to the k-th nearest neighbor are obtained. And the most class among the k neighbor's classes becomes the new object class. Various metrics such as accuracy, precision, and recall are used to measure classification performance [8]. The advantages and disadvantages of KNN are summarized in the table below. -Sensitive to noise -Requires large storage -Difficult to find optimal k value

Related Research
The KNN algorithm is one of the simplest classification techniques and is a popular machine learning technique.KNN has been developed in various ways to improve performance, and a method of weighting has been mainly proposed [9,10].Weighted KNN refers to a KNN that is weighted according to the importance of a feature or object.In addition to the method of taking the inverse of similarity, various methods have been proposed [11,12,13,14].
Dudani proposed a WKKN methodology, which is a KNN that added a method of giving larger weights to nearby objects [15]. Dudaniassigned the weight w as shown below.  [16]. This is a method of calculating the dual weight by squaring the weight obtained through Equation 1 suggested by Dudani, and using it as a new weight. Dual weight can be calculated as the following equation.
In addition, WKNN, which applied various methods or weighted by taking the reciprocal of similarity, were introduced, and KNN derivative studies in the form of fusion of other techniques and KNN were conducted [17,18,19,20]. There was also a study that improved the performance of KNN by fusion of Genetic Algorithm and KNN. GA is a technique commonly used as stochastic search methods [21]. Yan Xuesong et al conducted a study on applying GA to KNN, and was able to derive higher performance than KNN. Daniel Mateos-García, Jorge Garcia Gutierrez, and Jose C. proposed the Simultaneous Weighting of Attributes and Neighbors (SWAN) method [22].
Unlike other WKNN algorithms, this method considers the contribution of neighbors and the importance of data attributes. Regardless of the analysis method, preprocessing is required and KNN also has been used for this purpose. KNN-based studies for preprocessing were conducted. Hautamaki, V proposed a method for detecting and processing outliers using KNN graph [22].Two-dimensional KNNG was created and the outliers were determined by counting the Indegree number of each object. The minimum number of indegrees is determined, and outliers are determined according to this criterion.
In this study, we propose a KNN that uses a KNN graph to give weights to each object and predicts a class using it. Using this method, neighbors can be selected by considering the relationship between the object and the surrounding data. Performance was compared using public data, and statistical tests were performed to confirm whether the difference was significant.

GBW-KNN Algorithm
In this paper, we propose GBWKNN to improve accuracy.In general, the KNN algorithm calculates similarity with other objects to classify unclassified objects, and finds the K neighbors with the highest similarity among them. Allocate objects that are not classified into the most classes by referring to the labels of K neighbors. It is used a lot because it is an easy to understand and intuitive method, but it is affected a lot by outliers, and there is a burden to determine k in advance. Outliers mean values that are far from other observations. If neighbors are selected using the conventional KNN method, data that are farther apart than many data can be selected if it is close. Therefore, in this paper, we propose GBWKNN that can supplement this problem.
First, perform KNN and count "how many times as KNN" for each point. For example, in the case of K=3, there are 3 extending lines for each point. The weight is calculated by counting the number of times connected to other points. If weights are assigned to each point by calculating the weight in this way, the point that receives more selections from other neighbors has a smaller weight. Also, when classifying unknown points, the weight given to the distance obtained is multiplied by the similarity update the similarity. When selecting a neighbor in this way, a point that has received many selections from other data groups is selected as a neighbor, even if the absolute distance is close but a little farther than a point apart from other data. Therefore, it can be regarded as a method of selecting a neighbor by considering the surrounding characteristics of the data, rather than simply judging by the similarity measure.

Experiment Experiment design
In order to compare the performance of GBW-KNN and KNN proposed in this paper, an experiment was designed as follows. Wisconsin Breast Cancer data and Pima Indian Diabetes data were used for the experiment [23,24]. Each data can be downloaded from the UCI repository and Kaggle. Breast Cancer data consists of 32 attributes and 569 patient data, and Pima Indian data consists of 9 attributes and data of 768 patients.

Figure 1. Experiment Design
Normalization and feature selection were performed to apply the dataset to the experiment, and GBWKNN and KNN were implemented in Python. The experiment was repeated 30 times for each K value while changing the k value, and accuracy was used as a performance evaluation scale in this paper.

Experiment Result
The experimental results of each data are as follows. This is a table and graph that summarizes the average accuracy of 30 times for each K value.

Statistical Testing
Paired t-test is a statistical test technique that is widely used when comparing the performance of classification algorithms [25]. Unlike the independent T-test, the paired T-test is a method of analyzing differences between the same groups [26]. In the experiment of this paper, the GBWKNN algorithm is applied immediately after the accuracy of the existing KNN algorithm is derived under the same random seed value, so the difference between before and after the application of the proposed algorithm can be verified with a paired ttest. In order to perform the T-Test, it is necessary to check whether the collected data follows a normal distribution. Depending on whether or not, an analysis method is selected from the parameter method and the nonparametric method [27]. Thereis also a method to check normality using a graph, but in this paper, shapiro.test() functions were used. In the Shapiro test result, if the p-value is greater than or equal to the significance level, it can be considered to follow the normal distribution. The hypothesis here is as follows. The null hypothesis is 'Sample follows the normal distribution' and 'Sample does not follow the normal distribution' is the alternative hypothesis. Here, if the p-value is less than 0.05, the null hypothesis is rejected and the alternative hypothesis is adopted. Conversely, if the p value is greater than 0.05, the null hypothesis that the normal distribution is followed cannot be rejected. The above figure is the result of Shapiro test when k=4 in the Pima dataset. In this case, the null hypothesis can be rejected because the p-value is less than the significance level of 0.05. Therefore, since normality is not satisfied, the Wilcoxon signed rank test should be performed instead of the paired t-test. The above figure is the result of Shapiro test when k=5 in breast cancer. Because the P-value is greater than the significance level, the null hypothesis cannot be rejected. Therefore, since normality is satisfied, the paired-t test can be performed.
The table below shows the Shapiro-Wilk normality test result of the breast cancer dataset. When K was 5 and 9, the paired t-test was applied, and when K was 3, 7, and 11, the Wilcoxon signed rank technique was applied.  5,9,11, and Wilcoxon signed rank technique was applied when K is 3 and 7. If the number of data is too small or the normality assumption is not satisfied, the test can be performed using a nonparametric method instead of a parametric method. The analysis method was applied depending on whether each data satisfies the normality. The null hypothesis (H0) and the alternative hypothesis (H1) are as follows. The null hypothesis is'the difference between the medians between the two groups is 0', and the alternative hypothesis is'the difference between the medians between the two groups is not 0'.

Figure 6. Result of Paired T test
The above figure is the result of applying Paired t-test when k=5 of Pima data. As a result of the Shapiro test, the p-value was 0.09996, which was greater than the significance level of 0.05, so the paired t-test was performed as shown in the figure above. As a result of the paired t-test, the p-value was 0.002526, which is less than 0.05, so the null hypothesis was rejected at the significance level of 0.05, and it can be said there was a significant difference in accuracy before and after weighting each point. Like the paired t-test, the Wilcoxon signed rank method is also tested using the difference between the paired data. The picture above is the result of applying Wilcoxon signed rank when k=3 of breast cancer data. As a result of the paired t-test, the p-value was 0. 0001936, which is less than 0.05, so the null hypothesis was rejected at the significance level of 0.05, and it can be said there was a significant difference in accuracy before and after weighting each point. For breast cancer, the paired t-test was performed when K is 5 or 9, and each p-value was less than 0. 05. When K is 3, 7, 11, Wilcoxon signed-rank technique was applied and each p-value was less than 0.05. For the Pima data, the paired t-test was applied when K is 5, 9, and 11, and each p-value was all less than 0.05. When K is 3 and 7, Wilcoxon signed-rank technique was applied and each p-value was less than 0.05. Therefore, it can be said that there is a significant difference in the accuracy comparison experiment of KNN and GBW-KNN using the two data sets.

Conclusion
In the 4th industrial revolution era, the boundaries between industries are blurring. Artificial intelligence is being applied to big data in each industry, and it is expected to increase gradually. Each country and companies try to actively use it because the value that can enhance competitiveness can be found through analysis. The KNN algorithm predicts class of a new object that is not classified as a representative machine learning algorithm. Although it has the advantage of being simple and easy to understand, it has the disadvantage that k must be determined in advance and is sensitive to outliers. In this paper, GBWKNN was developed to improve classification accuracy, and statistical tests were performed to find out whether there was a significant difference when performing performance comparison experiments with the existing KNN. For verification, Wisconsin Breast Cancer dataset and Pima Indian diabetes dataset, which can be downloaded from the Internet, were used, and experiments were conducted under five k-value conditions, accuracy was measured, and statistical tests were performed. The Paired t-test, which is widely used when comparing the performance of classification algorithms, was conducted, and when normality was not satisfied, Wilcoxon signed-rank technique was applied. Through the experiment, it was possible to verify the difference before and after weighting through GBWKNN. As a result of the experiment, the overall accuracy of the proposed GBWKNN was superior to that of the KNN. The normality satisfaction test was performed for each data set and condition of K value, and the statistical test technique was applied differently according to the result. As a result, the p-value in all cases was less than the significance level of 0.05, and the null hypothesis was rejected. Therefore, it could be said that there is a significant difference in the accuracy of KNN and GBWKNN.