Feature Selection Based Enhancement Of The Accuracy Of Classificationalgorithms

: Feature selection is played vital role for classification algorithmsin the machine learning. In real time data mining process, irrelevant features which are available in the dataset may decrease the accuracy level of the classification algorithms. The selection ofthe appropriateand relevant features of the dataset in classification problems is the important role in data mining. The aim of the research work is to increase accuracy level of the classification algorithmsusing feature selection technique with different domain datasets.This paper also comparesaccuracy of different classification algorithms with and without feature selection method. The classification algorithms such as Bayesian Net, Naïve Bayes, Multi Layer Perception, logistic regression, J48 and Random Forestare used with feature selection methodusing different domain datasetssuch as Breast Cancer, Glass, Iris and Weather for comparison.This experiment is done with the help of Weka tool and datasetsin the machine learning repository. Feature selectiontechniques are effective in reducing dimensionality, removing irrelevant data, increasing learning accuracy and improving result comprehensibility. However, the recent increase of dimensionality of data poses severe challenge to many existing feature selection techniques with respect to efficiency and effectiveness.


INTRODUCTION
This research work conducted a comparison of data mining classification algorithms under Bayesian classification, Function classification, and Decision Tree classification using different datasets.The accuracy of the classification algorithmsis improved using feature selection method.

Classification Algorithms
Classification is a data mining technique which is used to predict group membership for data instances.Three types of classification models are considered for this research study.
Bayesian Classification is statistical classifiers.They can predict class membership probabilities that a given tuple belong to a particular class.It is based on baye's theorem.Types of this classification are Naïve Bayesian Classification and Bayesian Belief Networks.In our work, the Bayesian Net and Naïve Bayes classification methods are considered.
Function classification is based on constructing function taking input feature vector X and predicting it outcome Y.In this work Multi Layer Perception (MLP)and logistics classifier methods are implemented using Weka tool.A MLP is a feed forward artificial neural network model that maps sets of input data onto a set of appropriate output.An MLP consists of multiple layers of nodes in a directed graph, with each layer fully connected to the next one.Their current output depends only on the current input instance.It trains using back propagation.Logistics is a binary classification model.
Decision tree (DT) learning algorithm work based on processing and deciding upon attributes of the data, Random Forest (RF)and J48 were used in our experiments.The decision tree mechanism is transparent and we can follow a tree structure easily to see how the decision is made.It is a predictive modeling technique used in classification, clustering and prediction tasks.Decision tree classification technique is performed in two phases namely tree building and tree pruning.Tree building is done in top-down manner.Tree Pruning is done in a bottom-up fashion.

1.2.Feature selection
Feature selection is one of the vital roles in the field of machine learning.It is effective in reducing dimensionality, removing irrelevant data, increasing learning accuracy and improving result comprehensibility.However, the recent increase of dimensionality of data poses severe challenge to many existing feature selection methods with respect to efficiency and effectiveness.

Objectives
The main aim of this research work is to enhance accuracy of the classification algorithms with feature selection.To attain the task to enhance the accuracy of classification algorithms, the following objectivesare framed.
• To build a classifier using different dataset.To create an Ensemble of Techniques that will address the short comings of the existing approaches technique that is not affected by the feature selection • To use a feature selection technique for reduce the number of attributes and improve the level of classification accuracy.• To develop the framework based on different classification algorithms and compare the performance of different classifiers with feature selection and without feature selection and analyze the results.

REVIEW OF LITERATURE
Many researchers studied classification algorithms with feature selection method to enhance the accuracy of the algorithms in various data sets.Some of the important research works are identified and reviewed in this section.These research articles help us to propose a feature based method to enhance the accuracy of classification algorithms.
M. Krishnaveni et al. summarized the feature selection process, different types of feature selection algorithms such as Filter, Wrapper and Hybrid and their importance.Moreover, it analyzed some of the existing popular feature selection algorithms through a literature survey and also addresses the strengths and challenges of feature selection algorithms [1].N. Sai Sragvi Vibhushan et al. discussed about predicting the performance of the students.A feature selection algorithm removes the extraneous data and helps in increasing the accuracy of the classifier.An ensemble method produces different models and combines them to produce improvised results and compare with various feature selection algorithms which concluded the best feature selection algorithm [2].
Fifie Francis et al. concluded that J48 and Navie Bayes are most commonly used classification algorithms in the area of prediction analysis and Gain Ratio Attribute Evaluator and Ranker Algorithm are used for feature selection [3].Maryam Zafar et al. proved that Feature Selection (FS) improved the quality of prediction models for datasets.FS algorithms eliminate unrelated data from the dataset and increase the performance of classifier accuracy Best features can produce better results [4].Gupta et al.summarized the results based on the classification techniques that different feature selection methods are used as Search method in feature selection and apply the best features and find the accuracy of classifier and he concluded that Feature selection played an important role in classification [13].
Nikhil et al. proved that the accuracy of the classifier improved while applying the feature selection technique, after applying the feature selection the classifier performed better prediction [14].Rohit et al. [15] done the comparative analysis of different classification algorithms and find the accuracy and performance measures of each classifier and showed that best classification.

PROPOSED METHODOLOGY
The proposed work is designed for the analysis of data mining classification algorithms and to enhance the accuracy of the classifier using feature selection methods.This work has two phases such as to find the accuracy of classifier withoutfeature selection from the dataset and to find the accuracy of the same classifier with features selection from the dataset.Finally the results of with and without feature selection of the different classifiers are compared.The architecture of the proposed methodology is depicted in the Figure 1.The proposed method involves the following steps: Step 1: Take Datasets from the machine learning repository In the first step, Breast Cancer, Iris, Glass, and Weather data setsfrom the machine learning repository are considered.Best features are selected using feature selection method in the second step.In the third step, select the best feature among all the features using Information gain method.In the step 4, irrelevant features of the data sets are removed to apply classification algorithms.The classification algorithms Bayesian Net, Naïve Bayes, Multi Layer Perception, logistic regression, J48 and Random Forest are applied in the next step.The step 6 finds the accuracy of each classification algorithm with and without feature selection.Finally best classification algorithm is predicted for given data set in the step 7.

METHODS AND MATERIALS
In order to implement the proposed work, first and foremost suitable data mining software tool is required.Weka tool is used.Apart from the Weka tool, four data sets are considered to implement classification algorithms with feature selection method.

Weka tool
This experiment is done with help of data mining tool Weka (Waikato Environment and Knowledge Analysis) to perform the analysis.This software provides a set of methods and algorithms that help in better utilization of data and information available to users, including feature selection methods and classification algorithms for data analysis.

Data set Description
In this research study, different domain datasetssuch as Breast Cancer, Glass, Iris and Weather are used.Breast Cancer data set contains 9 attributes and 286 instances.There are 4 attributes and 150 instances in Iris data set.Glass contains 10 attributes and 214 instances.The fourth datasethas 5 attributes and 14 instances.The description of data set is given in the Table 1

Classification algorithms with feature selection
The classification algorithms such as Bayes Net, Naïve Bayes, Multi Layer Perception, logistics, J48 and Random Forest(RF) are implemented in the four different data set such as Breast Cancer, Glass, Iris and Weather using with and without feature selection methods.For feature selection method, InfogainAttributeEval is used as Attribute selection Evaluator method and Ranker is used as search method.It evaluates the worth of an attribute by measuring the information gain with respect to the class.

RESULTS AND DISCUSSION
The classification algorithms are evaluated with datasets and find the accuracy of each classifier using performance measures after applying feature selection method.The results of the classifiers with and without feature selection method for Breast Cancer data set are given in the Table 2.The results of the classifiers with and without feature selection method for Iris data set are given in the Table 3.The results of the classifiers with and without feature selection method for Glass data set are given in the Table 4.The results of the classifiers with and without feature selection method for Weather data set are given in the Table 5.From the results given in Table 2 to Table 5, accuracy of all classification algorithms is improved while feature selection method is applied.The true positive rate of these four classification algorithms with feature selection using four data sets are shown in Figure 2 to Figure 5.The figures compare the four classification algorithms and predict the more accuracy algorithm.Among these four classifiers, J48 is best classifier for Breast Cancer data set, MLP and J48 are best classifier for Iris data set, RF is the best classifier for Glass data set, and MLP is best classifier for Weather data set when feature selection is used.

6.CONCLUSIONS
Since Feature selection is played vital role for classification in the machine learning algorithms and irrelevant features affect accuracy of the algorithms, relevantfeatures are taken for the classification for better prediction.This research work investigates the performance of the different classifiers with features selection method using different data sets.From the results, it is observed that classifications with feature selection give better accuracy.Moreover, due to nature of the domain, number of attributes and number of instances in the dataset, each classifier produces better accuracy with respect to dataset.In this study, J48 produced more accuracy for Breast Cancer data set, MLP and J48 produced more accuracy for Iris data set, RF produced more accuracy for Glass data set, and MLP produced more accuracy for Weather data set.This paper helps the researcher to identify suitable classification algorithm for their data set.In future, different features selection methods can be applied in the classification algorithms to improve more accuracy.
Aleyani et al. proposed the clustering based feature selectionand summarized the various classification techniques for feature selection [5].Domeniconi et al. explained the local feature selection techniques and computational methods [6].Brodley et al. summarized the techniques for unsupervised learning for the feature selection technique[7].Guyon et al. discussed the various methods for feature selection rather information gainand compared the different approach for the feature selection methods based on filter or wrapper methods[8].Liu et al. summarized the different techniques supportable for the feature selection and knowledge discovery [9].Mitra et al. proposed the techniques for unsupervised feature selection based on different domain and compare the different similarity measures for the clustering [11].Abdullah et al. discussed the feature selection and data mining classifiers and proposed the ensemble techniques for classification and helpful for decisions making [12].

Step 2 :Figure 1 :
Figure 1: Architecture of the proposed methodologyIn the first step, Breast Cancer, Iris, Glass, and Weather data setsfrom the machine learning repository are considered.Best features are selected using feature selection method in the second step.In the third step, select the best feature among all the features using Information gain method.In the step 4, irrelevant features of the data sets are removed to apply classification algorithms.The classification algorithms Bayesian Net, Naïve Bayes, Multi Layer Perception, logistic regression, J48 and Random Forest are applied in the next step.The step 6 finds the accuracy of each classification algorithm with and without feature selection.Finally best classification algorithm is predicted for given data set in the step 7.

Figure 2 :Figure 3 :Figure 4 :Figure 5 :
Figure 2: True Positive Rate of classification algorithms with feature selection using Breast Cancer Data set

Table 1 :
. Data set description The performance measures play vital role in finding accuracy of classification algorithms.The following performance measures such as precision, recall and accuracy are used to find the accuracy of the classifiers.Precision is the fraction of relevant instances among the retrieved instances.

Table 2 :
Results for the classifiers with and without feature selection (Breast Cancer)

Table 3 :
Results for the classifiers with and without feature selection (Iris)

Table 4 :
Results for the different classifiers with and without feature selection(Glass)

Table 5 :
Results for the different classifiers with and without feature selection(Weather)