PxEBCA: Proximity Expansion Based Clustering Algorithm

: Cluster analysis is one of the main techniques for analysing data. It is a technique for detecting groups of objects which are similar without specifying any criteria for the grouping. The matter of detecting clusters is challenging when the clusters are of varied size, density and shape. DBSCAN can find arbitrary shaped clusters along with outliers but it cannot handle different density. This paper presents a new method for detecting density based clusters which works on datasets having varied density. In this paper we propose PxEBCA that discovers clusters with arbitrary shape and also with varying density.Experimental evaluation of the effectiveness and efficiency of PEBCA was done using synthetic data. The results of experiments demonstrated that PxEBCA is significantly more effective in discovering clusters of arbitrary shapes with varying densities.

The basic idea for the algorithm DBSCAN [1] works based on two parameters viz. 1) Eps: Specifies radius of neighborhood around data point p and 2) minPts: Specifies number of minimum data points in the neighborhood to identify it as a cluster. Using these two values, the algorithm discovers clusters by using concept of density reachability and connectivity. Since density reachability is non-linear, this algorithm can discover clusters with different shapes. All data points of the data set are categorized into i) Core points ii) Border points, and iii) Noise points. DBSCAN algorithm has got a limitation that it is not able to find clusters with varying density. In [2] OPTICS (Ordering Points to Identify the Clustering Structure) is an improved method upon DBSCAN, which uses random values for Eps which are used to identify clusters by generalizing technique of DBSCAN. This algorithm find clusters with varying density. It calculates outlier score for each point considering distance from its closest point. Future research on this can be based on improving efficiency to support hyper sphere range queries for highdimensional spaces having no index structures and also at discovering information of clusters in sparse datasets though it is good at finding them in dense areas. In DENCLUE [3] works based on the Kernel Density Estimation technique which is aimed to find dense region ns. It was mainly developed to classify large multimedia databases which have high-dimensional data and contain large amount of noise. In [4] proposed a new density-based clustering algorithm which enhances DBSCAN by first partitioning the dataset using CLARANS to reduce the search space to each segment instead of scanning the entire dataset, which improves the accelerate factor over the first DBSCAN algorithm. VDBSCAN [5] arrives at the value of parameter Eps and MinPts by finding distance of k th nearest neighbor of the point, after which it finds sharp change in the distance, which is called k-dist. This allows it to come out with different partitions of thegiven data sets and with multiple Eps values. Using which it generates multiple clusters with different densities. The challenge here is that the magnitude of impact for finding k-dist depends fully on the characteristic of the data set. ST-DBSCAN [6] is an improved version of density-based clustering algorithm, which has the capacity of finding groups as indicated by non-spatial, spatial, and temporal values of the objects. It is intended to run the algorithm in parallel in order to improve the performance. In addition, more useful heuristics may be found to decide the input parameters Eps and MinPts. The BRIDGE algorithm [7] consolidates proceduresof K-means and DBSCAN algorithms to overcome limitations of each other. It empowers DBSCAN to deal with extremelylarge data whereas it also removes noisy points by improving proceduresof K-means. Itperforms K-means first and then density based clustering. It helps setting density threshold parameter properly. This approach makes it faster and computationally cost effective. In Density based clustering methods allow the identification of arbitrary, not necessarily convex regions of data points that are densely populated. The number of clusters does not need to be specified beforehand; a cluster is defined to be a connected region that exceeds a given density threshold. The LSDBC algorithm [8] works on the technique of local scaling, has got two input parameters, 1) kused to order points according to their distance to their kth neighbor and 2) αused to determine the boundary of the current cluster expansion based on its density. The local scaling technique separate clusters using local statistics of the points. These parameters help to know how dense the region is around each point. Beginning with higher density regions, it connects points of dense regions until the density fall below the threshold. The UDBSCAN algorithm [9] is designed to work on uncertain objects. Uncertain objects are the one which have certain attributes whose precise value can't be defined. This may be due to various reasons including data acquisition or property of the object itself. The U-DBSCAN algorithm uses a deviation function that approximates value of such attribute and creates clusters using an associated probability density function.

PxEBCA: Proximity Expansion Based Clustering Algorithm
The key concept of the [PxEBCA] is the expansion of neighbourhood by applying (based on the) proximity expansion parameter to the distance in reference. For example objects A1 and A2 are two closest objects in the given space, an object P is in its proximity if either dist(A1,P) <= dist(A1, A2) * PEP or dist(A2,P) <= dist(A1,A2) * PEP, where proximity expansion parameter PEP will have a value greater than 1. The algorithm works in two steps, viz. 1) formation of micro clusters and 2) merging of micro clusters. Formation of micro cluster starts by considering two closest objects of the dataset as micro cluster and adding an object to the micro cluster if it is in its proximity. In an iterative process an object P is added to the micro cluster C = {A1,A2,A3,…..An} if min(dist(P,Ai)) <= max(dist(Aj,Ak)) * PEP [i= 1…n, j,k = 1..n, j != k]. This process forms micro clusters having at least two or more objects. Objects belonging to the micro clusters having less than 3 objects are considered noise. These are objects merged to a cluster if found in the proximity of any cluster during merging process. The objects which do not belong to any cluster at the end of the process are considered outliers.
In the step-2, merging of clusters is done based on intra-cluster distance and inter cluster distance. In the merging process, starting with two closest clusters, they are merged if they are in the proximity. In this process, objects identified as noise are also added to cluster if they are within the proximity. Iterative process of merging stops when no merging of clusters happen in an iteration. Object in proximity: An object P is in proximity of a cluster C = {A1,A2,A3,…..An} if min(dist(P,Ai)) <= max(dist(Aj,Ak)) * PEP [i= 1…n, j,k = 1..n, j != k]. Intra cluster distance: (Average distance within the cluster) Intra cluster distance of the cluster C is average distance between all the objects of a cluster under consideration. IntraClustDist(C) = ∑ (dist(Ai,Aj)) / ∑ k [ i, j: 1..n, i< j ; k = 1..n-1] Inter cluster distance: Inter cluster distance is the distance between two nearest objects of the two clusters under consideration. InterClustDist(C1,C2) = min(dist (A1i,A2j) Step-1: Prepare Distance Matrix which calculates distance between each object from all other objects.
Step-2: Read (next) two closest objects from the distance matrix.
Step-3: (i) If any of the two objects belong to a micro cluster, other object is added to it if it is in proximity of the micro cluster else it considered as noise.
(ii) If none of the two objects belong to any micro cluster, these two objects are considered new micro cluster. (iii) If both the objects belong to micro clusters, the pair is skipped. The iterative process executes step-2 and step-3 till all the objects are read. Part-2 Merging of Clusters: Step-1: Two closest clusters are read Step-2: These two clusters are merged if they are in the proximity.
Step-3: The iterative process executes step-1 and step-2 till no merging of clusters happen in an iteration.
Step-4: Objects considered as noise are merged to the nearest cluster if it is in its proximity.

Performance Evaluation:
In this section, the performances of algorithm are evaluated by using the 2-Dimensional synthetic dataset. We use four synthetic sample datasets which are shown in Figure    For comparingperformance of PxEBCA with DBSCAN four sample databases were used which are shown in figure 1. We have tested proposed algorithm with different Proximity Expansion Parameter for these databases. The Proposed algorithm works on both sparse and dense data. It is capable to handle the density variations that exist within the dataset. The clusters detected by the proposed algorithm are having considerable density variation within clusters. From the content of the above mentioned tables, it has been observed that the Rand Index, Dunn Index, Error rate and F-Measure calculated gives most promising result.

Conclusion:
In this paper we presented a new clustering algorithm which overcomes challenge of density based clustering algorithms. In addition, our clustering approach works well for datasets with varying densities.This is achieved by using expansion of neighbourhood by applying proximity expansion parameter to the distance in reference.We did performance evaluation on synthetic data. Results of these experiments demonstrate that PxEBCA is significantly more effective in discovering clusters of arbitrary shape than the well-known algorithm DBSCAN.