A Novel Granularity Optimal Feature Selection based on Multi-Variant Clustering for High Dimensional Data

Clustering is the most complex in multi/high dimensional data because of sub feature selection from overall features present in categorical data sources. Sub set feature be the aggressive approach to decrease feature dimensionality in mining of data, identification of patterns. Main aim behind selection of feature with respect to selection of optimal feature and decrease the redundancy. In-order to compute with redundant/irrelevant features in high dimensional sample data exploration based on feature selection calculation with data granular described in this document. Propose aNovel Granular Feature Multi-variant Clustering based Genetic Algorithm (NGFMCGA) model to evaluate the performance results in this implementation. This model main consists two phases, in first phase, based on theoretic graph grouping procedure divide features into different clusters, in second phase, select strongly  representative related feature from each cluster with respect to matching of subset of features. Features present in this concept are independent because of features select from different clusters, proposed approach clustering have high probability in processing and increasing the quality of independent and useful features.Optimal subset feature selection improves accuracy of clustering and feature classification, performance of proposed approach describes better accuracy with respect to optimal subset selection is applied on publicly related data sets and it is compared with traditional supervised evolutionary approaches


Introduction
In recent years, most of the different data retrieval related applications such as gene related text data, data/text categorization, and retrieval of images with respect to different attributes/instance of features. Because of novel development of data technology high amount of data being collected, normally data itself it doesn't have reliable knowledge. Mining of data plays main role to mine similar patterns from large data sources, it is the main challenge for different mining related approaches to identify both relevant and irrelevant data from data source applications (which have different types of attributes/instances). Because of different/thousands of attributes present in large data sources, Feature selection (FS) is the basic data processing procedure to handle data mining efficiently, main aim behind features selection is to identify optimal feature subset which contain strongly relative and mostly matched data for making efficient decisions, feature selection may works with better comprehensive of a specific domain if the features are with good ability references. Because of inherent similarities in features, feature selection helps to reduce curse of dimensionality, optimal feature selection maximize or minimize the major importance of feature present in data sources. For recent data mining applications, different feature related selection approaches have been introduced for the studied of machine learning methods, these methods are classified into four different methods: Hybrid and Filter, Wrapper, Embedded methodologies.
Mostly embedded methodologies incorporate the selection of feature is sub part of training process to explore the features relates to specific domain related machine learning algorithms, and also these methods are most efficient than remaining approaches. Artificial neural networks, decision tree related approaches are most relevant to embedded categories, Wrapper related methods are used to predict accuracy of pre-described training calculation to determine effectiveness of selected subset features, accuracy usually high in categorical of different features. Filtered methods are used to learn independent of different learning calculations with good generality in features extraction, in these methods complexity is low with respect to low accuracy. Hybrid related methods are combination of filtered and wrapper methods to reduce search space in identifying sub-sequent wrapper conditions. 5052 problem from perspective of feature granulation or sample granulation. For granulation of feature selection with respect to optimization genetic algorithm is proposed. Because of global search of features, genetic algorithm is used, in this representation each feature evaluated chromosome in GA represented as perspective feature space and stored in feature set then size of feature set described as size of granularity. Granularity based feature selection procedure described in figure 1. For sample data granulation Granularity Neighborhood relates rough-setis used. Based on all these procedures, propose and introduce a Novel Granular Feature Multi-variant Clustering based Genetic Algorithm (NGFMCGA) model. This model worked based on two steps. In first phase, based on theoretic graph grouping procedure divide features into different clusters, in second phase, select strongly representative related feature from each cluster with respect to matching of subset of features. Attributes present in different clusters have different relations i.e. independent, clustering strategy in NGFMCGA have highest probability in processing of relative and useful independent features. The proposed NGFMCGA approach was tested on publicly available text related datasets, experimental results of NGFMCGA show that, compare with other feature selection related algorithms/approaches, proposed approach only performs optimal features with improves the performance of well know attributes.

Review of Related Work
In this section, we present and discuss about different authors opinion regarding feature selection from different data sources. Basic traditional procedure relates to traditional approaches described as follows: Feature subset determination can be seen as the procedure of distinguishing and evacuating the same number of immaterial and excess includes as could be expected under the circumstances. This is on the grounds that 1) insignificant highlights don't add to the prescient precision, also, 2) excess features don't redound to getting a better indicator for that they give for the most part data which is as of now present in different feature(s). Of the many component subset determination calculations, a few can adequately dispense with unessential features however neglect to handle excess features however some of others can dispose of the immaterial while taking care of the excess features. Traditionally proposed Fast clustering based feature Selection algorithm (FAST) calculation falls into the subsequent gathering.
Generally, feature subset choice research has concentrated on looking for important feature. A notable model is Relief et.al, which gauges each component agreeing to its capacity to separate examples under various targets dependent on separation based measures work. Nonetheless, Help is insufficient at expelling excess highlights as two prescient however exceptionally connected features are likely both to be profoundly weighted. Help F broadens Relief, empowering this strategy to work with uproarious and inadequate informational indexes and to manage multi-class issues, yet at the same time can't recognize repetitive features.

Basic Preliminaries
Basic preliminaries used in this research described in this section

Genetic Algorithm (GA)
It is a well-known approach for potential global search with different features selection to different researchers, for efficient feature optimization basic procedure relates to genetic algorithm described as follows: Let us consider the finite set of string i.e. represents as format of coding, metrics relates to fitness of each attribute, operator selection, parameter set and operator relates to genetic. Finally general steps of GA is a) Generate N random individual features at initial data, evaluate all the features b) Calculate the fitness function for each feature based on elected feature c) Employ the selection of features with respect to cross-over and mutation to process next d) Check fitness function for each feature and the evaluate the optimal feature set relates to each individual selected attribute e) Terminate from process with different scenarios.

Granularity Neighborhood relates Rough-Set
In information retrieval system, granularity information is the measurement of data evaluationbased on knowledge and data. Data granularity is smaller than ability is stronger related to knowledge and vice-versa, Let us consider data relates to classification and then formalize the decision system describes as ,, The sample data set is the feature set which is obtained from sample data set then decision of feature selection In information retrieval system, All the scenarios used in proposed implementation with efficient selection of features.

4.
Proposed NGFMCGA Procedure Model for Optimization

Basic Definitions
Feature subset selection to be selected for the identification and redundant/ irrelevant features removal. Different features consists different co-related attribute relevancy with respect to relevancy between each attribute. All these are keeping in mind develop a Novel Granular Feature Multi-variant Clustering based Genetic Algorithm (NGFMCGA) effectively remove redundant with respect to irrelevant features and obtain good subset feature.  It is to find the value of () p GCis the size of overall feature set based on above conditions; let us consider the P be the different feature related characteristics where each feature selected from subset feature set shown in figure 3.

Threshold based Fitness Function
It is the process of selection of feature with respect to subset feature selection; fitness value is used to describe the accuracy of classification. Based on the sample distribution characteristics with respect to differences in class related features to measure adapted subset features. With in class representation, selected feature sub set ensures as small as possible, and between class labels are represented as long as possible Let us consider P 0 be the population of initial, then convergence rate of genetic algorithm with different feature representations. For each attribute fitness difference is represented with normalized geometric mean with respect to rate of convergence, so it is clear faster convergence defines larger convergence rate to FT and FM and then convergence rate is reach to maximum 1.

Algorithm 1 Optimal feature selection (OFS) procedure with respect to genetic algorithm features.
Based on above discussion, feature selection could be done with granularity based feature related genetic algorithm procedure discussed in algorithm 1, algorithm 1 describes initialization process with different input parameters listed from 4-9, at the same time fit () is used to evaluate the fitness function of individual parameters listed in 9-13, after feature extraction evaluate the optimal feature selection listed from 13-22 shown in algorithm 1. Generally because of uncertain property of genetic in our proposed algorithm improved to evaluate the time complexity issue based on decreasing the iterations using different parameters like no. of individual P size , rate of mutation P m , rate of crossover P c , length of the individuals L. Optimized time complexity is described as

Optimization of Granularity Feature Selection based on Neighborhood Genetic Calculation
In this section, for granularity feature selection, granularity radius is usedto explore different attributes, optimal feature set selection related data sets to be explored as input data from fitness function. Classification related pattern feature selection with respect to interpreted feature subspace, feature granularity can improve the reduction of dimensionality with extent features without selection of sample granularity. Granularity feature optimization procedure described in algorithm 2 Algorithm 2 Procedure to select granularity optimal features subset selection 1 tt , End While 15. Obtain best from individuals 16. Return optimal granularity based with feature sub set According to the threshold fitness function, algorithm 2 describes the granularity optimal feature selection, from 1-4 line defines basic definations relates to ranularity feature selection, fit2() calculates the fitness for granularity optimal feature selection described from 6-13, from 14 calculates the time complexity of algorithm 2. Because of crossover, mutation fitness of individaul parameters and dimensionality reduction with different features, proposed approach performs efficient classification accuracy with optimal feature respctually. So that time complexity of optimized granularity feature algorithm describes as different stages in processing of multiple attributes with optimal granularity features.

Experimental Evaluation
In this section, we describe the performance evalluation of proposed approach with comparison of different feature subset selection calculations/methods with different attributes in high dimensional data.
To analyze proposed approach performance from lowerative perspective selection of different sample related features selected randomly, data sets are downloaded from following UCI database repository link with different parameter sequences https://archive.ics.uci.edu/ml/datasets.php?format=&task=cla&att=&area=&numAtt=&numIns=&type=&sort=na meUp&view=tableAll these data describes sample attributes with high dimensions, low dimensions with sample attributes present in different real time statical analysis of company data sets described in table 1.  FCBF[3]. In order to increase the performance of proposed approachapplied on high dimensional data sets, classification accuracy behind selection of different features (|S|)and communication cost t(s), compare with traditional approaches and also improve precision, recall, memory utilization with respect to classifying features with time complexity in exploring correlation between different features for high dimensionality data sources.

Results
Classifying multi granularity features accuracy with different attribute processing values described in table 2. As shown in table 1, proposed approach describe the different accuracy values with exploring different features, when compared to traditional approaches, proposed approach gives better and efficient results with respect to processing multi-features with multi label data analysis which consists high dimensional attributes relations.

Figure 4
Performance evaluation of accuracy with comparison of different approaches As shown in figure 4, it describes the accuracy of proposed approach with different labeled correlation between features for high dimensional data sources for optimal multi feature selection. Precision for granularity feature selection with different values processing described in table 3 with different attribute relations. As shown in table 3, precision values are described with different notations, performance of these notations described in figure 5. It describes the performance evaluation of precision with different attribute relations described as follows As shown in figure 5, proposed approach gives better attribute partition i.e. above 0.5-0.7, remaining approaches were not reached the proposed approach precision and not increase more than 0.5, they are in only 0.3-0.5 only for different multi-dimensional data sets. Recall values for granularity feature exploration described in table 4. Recall values of correctly identified optimal features relates to different data sets described table 4, all these values explore optimal feature from multi dimensional features with different relations. Performance evaluation described in figure 6.

Figure 7
Performance of time efficiency with respect to different data sets As shown in above figures proposed approach gives better optimal feature selection from overall multi dimensional data sets evaluation, In terms of time, accuracy, precision and recall for granularity multi-feature selection.

Conclusion
In this paper, granularity feature selection model i.e.Novel Granular Feature Multi-variant Clustering based Genetic Algorithm (NGFMCGA) is presented, this model mainly consists granulation of subset features based on genetic calculation and neighborhood sample granulation with multi features. For granularity feature optimization, granularity based genetic algorithm is presented with calculation of fitness and other sequence parameters and it also improve quality in selection of subset features. Proposed approach is also kept good offence on optimal subset feature selection and improves the classification accuracy in identification purely matched patterns for synthetic related data sets. In the future work, we plan to explore new model to obtain multi-feature optimal selection with respect to multi-dimensions and multi-objective optimization in selection of features and improve the efficiency.