Differential Evolution based Cluster optimization for Multi valued data sets

: In data analysis, items were mostly described by a set of characteristics called features,in which each feature contains only single value for each object. Even so, in existence, some features may include more than one value, such as a person with different job descriptions, activities, phone numbers, skills and different mailing addresses. Such features may be called as multi-valued features, and are mostly classified as null features while analyzing the data using machine learning and data mining techniques. In this paper, a proximity function is described to find the proximity between two substances with multi-valued features that are put into effect for Clustering. This distance measure approach allows iterative measurements of the similarities around objects as well as their characteristics. For facilitating the most suitable multi-valued factors, we put forward a model targeting at determining each factor’s relative prominence for diverse data extracting problems. The proposed model is an evolutionary strategy that uses Differential strategy for evolutions, which is using the degree of member ship as fitness function. The proposed clustering algorithm as multi valued attribute data cluster optimization based on the strategy of Differential evolution (MVA-DE). Therefore this becomes feasible using any mechanisms for cluster analysis to group similar data. The effectiveness of our model is evinced by performance analysis carried through experimental study. The outcomes of the experiments carried on proposed model were compared with other strategic clustering approaches like fuzzy c-means based Clustering of Multivalued Attribute Data (FCM-MVA) and K-Means with Tanimoto based multi-valued data clustering. The findings demonstrate that our test not only improves the performance the traditional measure of similarity but also outperforms other clustering algorithms on


1
Overview The clustering method is the most focal point of many researchers to contribute and conduct their novel research works, particularly on efficient feature selection.The execution procedure of this method differentiates by depending on the sampled sub-sets of features.Thus, the process of selecting appropriate features is significant, as these are engaged in holding essential information of given data.As depicted in [1], to gain the accuracy in operating and executing the certain data extracting algorithms clustering is significant The significance of clustering is majorly visualized in various data-sets with multiple dimensionalities.Because data mining techniques needs numerous computational efforts in order to handle various features.According to the existing data mining methods, the representation of any dataset is always in a table format and hence, the features maybe the categorical and arithmetical attributes.The conventional methods reflected weak performance parameter in realistic databases, as these sets mostly include attributes, which can predict several values at a time.For instance, this method is involved in the classification of different types of movies including "horror", "romantic", "documentary", and "action".Depending on the specific database domain, the category of attributes helps in conducting mining procedure.Categorical attributes which are capable to estimate multiple values are hardly impacted by minimizing the dimensionalities of various attributes.Several modern works focused on analyzing efficient membership values for multivalued but these values are not always suitable.Because, the optimal values which are analyzed may be in weak connection with the values of dissimilar attributes.Thus, the selection of an appropriate object towards an optimal cluster for such attributes holding several values is observed.
Few scholars also concentrated on utilizing multi-value attributes for executing clustering with other procedures.As depicted in [2], the selection of attributes is explained using diversified attribute's set with various domain ranges.Even though, these research works failed to explain the selection procedure when attributes capture multiple values at a time.Hence, Multi-Relational Data Mining (MRDM) is an open area for many researchers.It encourages authors to develop effective techniques so as to deal with various databases which include multiple tables, as depicted in [3].The process of MRDM and its related techniques are deeply described and analyzed in [4], [5], [6], [7], [8], [9], [10] and [11] research works.
In order to decrease the redundancy in the database' key table [12], authors proposed the concept of novel table generation.The new research work focused on specific attribute category that generally denotes attributes which holds multiple values in a dataset and it signifies the perspective of MRDM.Although numerous selection methods are being introduced to determine the attributes with multiple-values, only a few works analyzed the significance of selecting those attributes.Further, a research work [13] introduced a solution, in which, k-feasible values related to specific attribute which is a category of multi-valued in k binary attributed are employed.This permits the approach to employ existing feature selection techniques.However, the major thing that limits this method is that it involves in enlarging dimensionality of original information and is a big threat to this approach.Hence, with increased value of k, the performance of this procedure might be degraded.To overcome this challenge, this paper focused on describing a modern approach for the clustering the data that entails with multi valued attributes.In regard to this, article also depicts a member ship measure depending on the objects of dataset attributes and also used for cluster optimization.
The upcoming sections of this literature include, Section-2 is all about outline of existing literature.Section-3 focused on optimizing the proposed approach by utilizing a member ship.Section-4 comprises research outcomes and determination procedures of proposed approach.Finally, Section-5 concludes the novel research work and also suggest outlook for future research works in this domain.

Review of Research Work
Extracting data with multiple values is highly difficult than extracting single-valued data in terms of process overheads, redundancy, and implementation.Performance degradation of mining techniques is observed in multivalued mining, as it deals with extremely dense data.To overcome with such complication, research works in [14], [15], [16], [17], [18] and [19] are proposed discretization technique.With the decomposition of sequence of infinite attributes into a cluster of finite neighboring intervals, the discretization technique decreases the process complexity.Moreover, this technique is highly applicable for such mining algorithms, which completely depends on data volume.In addition, the technique also effectively determines the categorical attributes, as explained in [20] and [21].The tentative results of this technique represent the benefits including fast process execution and enhanced accuracy rate of learning techniques, as explained in [22].The work in [22] represents various approaches like Supervised and Unsupervised as explained in [23], [24], [25], [26], Static Approaches and Dynamic Approaches as explained in [27], Local or universal as explained in [26], Splitting and Merging, Direct and Incremental, In direct separation methods, the authors need to evaluate the quantity of k-intervals.Depending on those values, infinite attributes are then divided into k intervals at a time.Incremental approaches begin with easy separation procedure and later on continue to upgrade the process, even though few of attributes requires termination criteria to discontinue the discrete procedure.
In [28], constant width and constant frequency discretization approaches are proposed, which entails unsupervised, universal, direct and static models.The below depicted approaches are few of the most significant techniques under Splitting and Merging categories, Few of the methods explained in [28], [29], [30], [31], [32] and [33] are considered as effective splitting approaches.A noteworthy point is that, with the consideration of empirical studies, CACC described in [32] is efficient than others in terms of performance.On merging methods front, these methods employ a testing procedure for analyzing a point at which specific intervals need to be merged.According to [34], researcher introduced most efficient merging approach.As like other traditional approaches, this algorithm also comprised specific limitations including computing complexity and it requires user participation to define several process parameters and to accomplish the merging procedure.
Giannotti et al. [35] suggested a clustering technique for transnational data using k-means algorithm by using the Jaccard similarity measure to cluster the multi-valued attribute data but meets a weak convergence of the method.Fuyuan Cao.[36] suggested a clustering technique for set-valued data called SV-k-modes algorithm here the similarity measure for the two objects with multi-valued attributes is defined and a set-valued mode interpretation of cluster centers is suggested.Wenhao Shu.[37] Proposed a Similarity measure on the unlabeled objects.Subsequently, a features extraction method is designed and characterized by mutual information that is incorporated in a declining universe to speed up the screening process of characteristics.Guha et al. [38] offered a ROCK algorithm, which is of the type agglomerative hierarchical clustering method that is unscalable to large data.It is furthermore hard to acquire the interpretable cluster agents from hierarchical clustering results.F. Giannotti, C. Gozzi, [39] in this paper it is described a model of splitting and managing transactions, i.e., it is the representation of discrete data with variable size.Authors adapt the appropriate mathematical separation concept shown in the K-Means method to reflect proximity of transactions, and reshape the group centroid concept in a fine way.
Celebi et al. [40] provided an analysis of clustering strategies for solving the numerical configuration issue.The best k-means clustering being implemented based on the analysis of the most common initialization process.Throughout this study, various massive amounts of data have been used to evaluate the clustering quality.However, the K-means grouping method have other inconveniences, The k-means and the fuzzy Cmeans (FCM) cluster methods by Ghosh and Dubey [41] especially in comparison are premised on their effectiveness in selecting the right data analysis method.This clustering algorithm significantly considered the data in the form of the positions around different input data objects.FCM has been an unsupervised grouping method applied and used in agricultural, astronomical, biological, environmental, medical imaging, classification and clustering areas, in particular.

Classification of Information through Attributes with Multiple Values
In data mining, classification of information is one of the challenging functions and has its primary focus on approximate the object class depending on related class attributes.For evaluating similarity measure between different objects, a distance factor is employed.Among all, Euclidean distance [42] is an efficient distance parameter used in this kind of classification algorithm.It deals with numerical attributes.For categorical attributes, distance is computed by assuming variance as One (1) for divergent values and Zero (0) for exact values.If a classification involves several class attributes for the representation of attributes which comprises multiple values then a specific distance parameter should be employed.This metric can able to compare various objects set of a class.Hence, this contribution includes various measures to evaluate the distances amid instances sets.In particular, the works in [43], [44] and [3] are employed for analyzing distance between multi-valued attributes.However, as the outcomes of these three distances metrics are same, this manuscript analyzed the solutions using Tanimoto [44].The Tanimoto distance between two sets including X to Y is referred as DT (X, Y) and is computed through implementing following formula-

𝐷 = |𝑋| + |𝑌| − 2|𝑋 ∩ 𝑌| |𝑋| + |𝑌| − |𝑋 ∩ 𝑌|
A noteworthy point is that, as it depends on intersection range of X and Y sets, the distance metric can be used for distinct data sets.On infinite values front, xi and yj are similar, if variance is less than pre-defined threshold value.

Proximity measure:
The sequence of determination of the most appropriate values in a multi-valued feature context requires the function of member ship abstraction approaches that mine characteristics as per their participation size and not the number of times they occur.A deeper insight into the working of our method is outlined in the following sections.The approach to find similarity between multi-valued objects while accompanying clustering is dependent on multi-valued characteristic.In association to current metrics it consents much use of more than one point of comparison to find similarity for clustering.In this article the similarity between the objects is found as follows: Computation of similarity of two multi-valued feature values of the X and Y attributes is represented by DMA(X, Y) and determined by consideration of the distances from two sets of elements, that is, to take into account all possible X and Y pairs of attributes.This can be computed by summing an aggregate of all distances in pairs mentioned in the following mathematical model.
is the distance that is described as below between any couple of values generated from X and Y.

𝑑(𝑥 𝑖 , 𝑦
The(  ,   ), Proximity between the two fixed unordered data vectors      that are represented by a set of  number of features is foundby using the similarity between their individual dimensions.The dimension similarity can be calculated by usingDMA(X, Y).From the following equation the (  ,   ) can be obtained:

Fuzzy C-Means (FCM) Algorithm:
Let  be the data set with  objects in which each object is characterized by  number of attributes, where  = { 1 ,  2 , , … ,   } and each   is represented by  = { 1 ,  2 , , … ,   }, hence the dataset can be represented by a  ×  matrix.Let the data set  is partitioned into  number of fuzzy clusters by a fuzzy clustering algorithm and also each fuzzy partition is represented by a matrix  in which each element   represents the degree of member ship of the object   in the cluster  whose values lie between 0 and 1.The Fuzzy C-Means (FCM) is depending on the minimization of the objective function given bellow which corresponds to , a fuzzy partition  of the data set, the set of centroids.
Where V= { 1 ,  2 , , … ,   },      is a centroid of the cluster  that is to be determined.The fuzzy ness of the clusters is determined by the fuzzy index which is   (0, ∞). 2 (  ,   ) is the distance between      which is the inner product metric.The trivial solution problem is eliminated by satisfying the following conditions on .

The new centroids 𝑉 𝑗
̂ are computed by Also the degree of member ship µ  is updated to µ  ̂ according to µ  5. If  |µ  − µ  ̂| < ɛ, stop otherwise, go to step 4, where ɛ ∈ (0, 1) which is a termination condition.The FCM algorithm allows each object belongs to each cluster depending on the member ship value that is computed by µ  .Finally, the algorithm assigns each object to a particular cluster according to the maximum member ship of all clusters.To make use of FCM algorithm for multi-valued data the following construction is made Let  = { 1 ,   First, the method for measuring the distance between a cluster centroid and a datum is proposed, along with the method for updating the cluster centroid at each iteration.The distance measure (  ,   )between a centroid Vi and a multi-valued data point Xj is defined as described above in similarity measure which is Eq(3).The cluster centroids are updated when the cluster centroid   = { 1 ,  2 , , … ,   } is given, each     for1 ≤  ≤ , based on the type of the attribute.If the attribute   is numerical then   is updated as given bellow.
For the categorical attribute   the centroid value   is updated as given bellow.
To make use of Fuzzy C-Means Algorithm for multi-valued attributes which is FCM-MVA we replace the equations 5 by 8 for getting member ship of the objects and equation6 by 9 or 11 to get the updated centroids of the clusters.

Differential Evolution:
The differential evolution (DE) method [44] is one that has been documented to be vigorous to optimization techniques among the other evolutionary procedures relating to the process of optimization.The DE meaning is identical to Genetic algorithm [45] roughly, but In view of the new genetic variants (new population) it varies with GA.Parent and child chromosomes are often evaluated in terms of fitness, if the child's chromosomes seem to be the most fit, then survive and the parents will be disqualified, if the parent's chromosomes are most fit then children's chromosomes do not survive.Only the parent chromosome is replaced by the most fit child chromosome.That means finally either parents or the fittest among all children whichever is more fit is survived.The various fitness mechanisms and crossover methods adopted by DE illustrate incontemporary literature [46][47] [48][49] between the different approaches of different evolution strategies.The research that is investigated regarding DE is discussed in the survey [50].

Creation of Initial Clusters:
The application of FCM-MVA clustering as discussed earlier enables a record to fit into one or perhaps more clusters, where the member ship of the record to the corresponding cluster would be greater than the member ship threshold which is usually greater than or equal to 0.3, In this regard one cluster center may be the other cluster's record.

Optimization of Clusters using DE (MVA-DE)
The initial clusters will be considered as a set of input chromosomes, and performs Differential Evolution on each set of chromosomes that results pair of new chromosomes (new clusters).Among these input and resultant chromosomes, fittest pair of chromosomes can survive.The following subsection explores the fitness function used in Differential Evolution Process.

Algorithm for cluster optimization using DE:
Let Consider the given each pair of clusters and find the sum of the membership values of all objects in both of the clusters as depicted in section 3.2.In addition, find the pair of clusters with highest sum of the membership values as an optimal pair of clusters among all pairs of clusters given.
Optimization of Clusters Upon completion of the initial cluster formation process, sort the records in descending order of their degree of member ship of each object in each of the lusters formed, then perform differential evolution to set back the clusters with maximal fitness, which is given in the following algorithm.
Upon completion of the depicted algorithm, the set  contains most optimal clusters projected from multivalued dataset.In order to acclaim the clusters with unique entries, allow records to be the part of only one cluster, such that the respective record should have maximal cluster level membership degree for the corresponding cluster.

Simulation Study Phase and Efficiency Observations
In this chapter, empirical studies on datasets, evaluation procedures and related solutions of proposed approach are depicted.In regard to assess the significance of the proposed clustering technique MVA-DE, the experiments also carried on K-means clustering that tends to cluster the given data, the distance measure that used in this regard is Tanimoto distance measure.The Tanimoto distance between two sets including X to Y is referred as D (X, Y) and is computed through implementing following formula-

𝐷 = |𝑥 𝑖 | − |𝑦 𝑗 | |𝑥 𝑖 | + |𝑦 𝑗 |
To assess the significance of the proposed clustering technique MVA-DE, the experiments are also carried on FCM-MVA which is described above.The method has been implemented on a 4-GB RAM capacity and i5 processor machine.For the measurement of the results on the resulting clusters, the scripts are described using Python programming language.

The Dataset
This section explores the projection and properties of the real and synthetic datasets used in experimental study.The real dataset that used in experiments is CORA [52], and the synthetic dataset is generated by hybridizing the projection and volume of the CORA dataset.

Real Dataset
Researchers' focuses on CORA [52] database, as it includes 2,708 data records and plays a prominent role in research.Each data record is a scientific contribution from any of seven types including RL machine learning methods, CBR models, Probabilistic approaches, Rule-based Learning approaches, NNs, Genetic techniques and models based on theory.Each record comprises numerous entries to form a data-subset with 1,433 special words that are referred as attributes.The value set of any two attributes which can hold multiple values are called citing and cited manuscripts.
Each document of CORA includes a sub-set of chosen 5,429 special instance identities as a cluster of Multivalues for such attributes usually involve multiple values.Exactness and level of performance of novel approach is determined by utilizing various cluster determination parameters including cluster pureness and cluster HM and also contradictory concepts of both.So as to setup this, the suggested data files are selected based on topic perspectives, as knowledge bases.In addition, clustering of these files into corpuses is observed to assist the optimal determination of clusters according to the selected parameters.

Synthetic Dataset
The dataset generated, by synthesizing the original CORA dataset by adding additional attributes labeled as keywords, and indexing.In addition, around 2000 additional records included to the original dataset.With the influence of the stated modifications to the CORA dataset, the total records become 4708, the simple attributes remain same with count of 1433, however, the multivalued attributes increased from 2 to 4. Metrics pureness, as well as inverted pureness and HM of cluster takes a prominent role in cluster determination procedure.The category frequency in every resulted cluster termed as purity of cluster [54].Purity parameter can able to remove noise in the clusters, but it is unable to detect the similarities between the records.For instance, in case, each record is considered as single cluster, then purity parameter assigns higher purity value for those clusters.Thus, inverted purity parameter is implemented and essential for analyzing those data clusters as similar categories.This inverted parameter is important in detecting the cluster, which holds highest recall value for each category.
Determination of a cluster involving every input record gives the highest value to inverted purity due to the fact that, this parameter unable to nullify the combination of various records captured from different categories.A noteworthy point is that HM of document clusters also considered in addition to above two parameters.HM parameter is the inverse purity and combination of purity that estimated by comparing every category with the cluster having higher combined precision and recall [55], [56], [57] termed as F-Measure.

Statistical and Empirical Study of Proposed Work
The proposed solution ensures optimization of Clusters' which are developed from dataset documents and multi-value features because F-Measure of those clusters is extremely high.Level of purity for each detected cluster will have superior accuracy rates.The below Table 1 depicts the statistical data related to the experimental analysis of proposed solution and the Table 2 represents the outcomes of clustering techniques applied on real dataset CORA.The above results are shown in the following figure 1.In order to further demonstrate the importance of suggested approach, k-means clustering algorithm is implemented on every document along with multi-valued attributes that improve the performance of existing frequency models.The proposed approach also achieves optimal purity and F-Measure parameters.These resulted values of these parameters are effective than the values resulted through earlier methods.The below Figures depicts purity and F-Measure of dissimilar clusters.The rate of accuracy visualized for all the approaches is represented in below Figure 4.It represents the reliable proportion value between derived and original true records of an evaluated cluster.The similar Assessment is carried on synthetic dataset, the statistics of the dataset are depicted in Table 3, and the performance metric values obtained from proposed and other clustering techniques, those applied on synthetic dataset are depicted in Table 4.The above results are shown in the following figure 5.

MVA-DE
The results depicted for synthetic data evincing the phenomenal performance advantage of the proposed clustering technique MVA-DE.The resultant clusters purity, accuracy, and cluster harmonic mean observed for DEC-MVA are more than the respective order of k-means clustering with Tanimoto scale-based clustering and FCM-MVA.The cluster level assessment of these three-metrics depicted in Figure 6 (cluster purity), Figure 7 (cluster harmonic means), and Figure 8 (cluster accuracy).It is clearly evincing that all of these cluster level metric values depicted for proposed MVA-DE are stable and outperformed the values depicted for same metrics in regard to other two clustering processes.

Conclusion
This contribution proposes a novel approach in order to cluster the data engaged with multivalued attributes.The depicted model is an evolutionary strategy that uses Differential Evolution technique to cluster the data with multivalued attributes.The depicted model is using the degree of membership as fitness measure.In contrast to the selecting methods through available approaches, this paper clusters the data by selecting member ship values based on potentiality of dataset transactions.The proposed solution uses the degree of member ship in mining.This concept allows programmers to form the clusters on the basis of its member ship.The proposed approach also follows the same procedure.Specific values of any tuple are determined through the member ship of that tuple with respect to the cluster in which it belongs.The respective outcomes of this model depict that the novel approach achieves high performance to select efficient values for multi-value features than existing approaches.
To perform empirical study, a real dataset referred as CORA [52], and a synthetic dataset that generated by hybridizing the CORA dataset is employed.Various cluster performance metrics also used such as purity, fmeasure, and accuracy.Results observed from empirical study, encouraged the further research work in numerous ways like utilization of member ship in various approaches, ways to innovate additional effective models to select significant values for attributes which comprise multiple values.Finally, the deployment of heuristic scales is also feasible for selecting optimized clusters for these attributes.

= 1 ,
, ] Where  , (  ) for 1 ≤  ≤ P. The objective of the FCM algorithm for multi-valued data (FCM-MVA) is to cluster the data set X into C clusters by minimizing the function as given in the equation   (, : ).  (, : ) = ∑ ∑(µ  )    = 1 … . . 0 < ∑ µ   =1 < ,  = 1 … . .Where µ  is the membership degree of data Xj to the i th cluster which is given bellow in the equation(8), and is additionally an element of a  × pattern matrix  = [µ  ]. 1 ,  2 , , … ,   }Consists of the centroids of the fuzzy clusters.Centroid Vi is represented as { 1 ,  2 , , … ,   } the parameter m controls the fuzziness of membership of each datum.To cluster multi-valued data, the fuzzy k-means algorithm extends to cluster multi-valued data based on the fuzzy c-means-type procedure.

Figure 1 :
Figure 1: Clustering results on real data set Dataset

Figure 4 :
Figure 4: Rate of Accuracy for Dissimilar-Clusters Resulted from all Methods

Figure 5 :
Figure 5: Clustering results on Synthetic Dataset

Figure 6 :
Figure 6: Cluster purity observed for each cluster depicted from proposed and other two clustering models

Figure 7 :Figure 8 :
Figure 7: Cluster accuracy observed for each cluster depicted from proposed and other two clustering models 2 , , … ,   }be a set of n multi-valued data.Let data Xj (1 ≤ j ≤ N) be defined by a set of attributes { 1 ,  2 ,  3 , … ,   } in which the attribute   is either a single-valued or multi-valued attribute.Each  describes a domain of values denoted by (  ) = {  1 ,   2 , … … .    }, where   is the number of distinct values of attribute   for 1 ≤  ≤ P. If   is a single valued attribute then each    (1 ≤ i ≤   ) is considered as a set of single value and If   is a multi-valued attribute then each    (1 ≤ i ≤   ) is considered as a set of multiple values.A domain (  ) is defined as a finite and unordered.Let   be denoted by{ ,1 ,  ,2 , , … ,  , }, thus Xj can be logically represented as a conjunction of pairs of attribute-values as given bellow the notation is a set representing all possible clusters depicted, Let the notation N is a set contains newly formed clusters after each evolution of the DE algorithm, ℎ( ∩  ≠ ) Begin  =  For each new cluster {  /  ∈ } begin For each new cluster {  /  ∈ ,  ≠ }  = {  /  ∈   ∩   } //Find all common transactions (crossovers) exists in clusters   ,   as set   = ∅ // an empty set taken to store the new chromosomes generated from crossover process Consider  = {(  ,   )} // moving the parent chromosomes (clusters) to the set  For each crossover {  /  ∈ } where 1 ≤  ≤ || Begin   = {  ∈   / ℎ(  ) <  ℎ(  ) } //subset of   in which the tuples are predecessor to     = {  ∈   / ℎ(  ) ≥  ℎ(  ) } //subset of   in which the tuples are successor to     = {  ∈   / ℎ(  ) <  ℎ(  ) } // subset of   in which the tuples are predecessor to     = {  ∈   / ℎ(  ) ≥  ℎ(  ) } // subset of   in which the tuples are predecessor to    1 =   ⋃   2 =   ⋃   = {( 1 ,  2 )} // moving the pair of child chromosomes (clusters) to the set  End Find fitness of each entry which is in  as described in sec 3.4.1 Replace the pair of clusters (  ,   ) in N by the pair of clusters in  which has maximum fitness.

Table 1 :
The real dataset Statistics

Table 2 :
The outcomes of clustering techniques applied on real dataset CORA

Table 3 :
The synthetic dataset Statistics

Table 4 :
The outcomes of clustering techniques applied on synthetic dataset CORA