Binary Priority Outlier Classifier Based Outlier Elimination

Outliers are records that deviate from normal behavioral pattern. This causes a serious issue when it comes to analysing data. In the recent years there has been great research to identify these outliers. Identifying them not only helps improve analysis of data but also provides many applications. The paper presents a way of indenting these outliers based on priority assigned to the attributes. The priorities are then added for each record in the dataset and the pattern is analysed. A concept based on interquartile range is used to eliminate the outliers. Hence the classifier divides the dataset into two classes: outliers and normal data.


Introduction
Outlier detection has been used since a long time to detect anomalous behavior. Outliers are caused by faults in machines, frauds, human error or simply a phenomenon of natural deviations. Identifying them provide extremely valuable information and early identification of them can help prevent catastrophic consequences. Victoria Hodge et al. discuss various algorithm for outlier detection and compare them by analysing their advantages and disadvantages [1]. Robert.L.Lipnik et al. discuss how molecular descriptors and molecular toxicity can help identify the outlier behaviour as well as provide information about the predictive capability of such models. The paper uses QSAR baseline prediction and compares it with toxicity levels for identifying outliers and impress the classification process by removing the same[2]. Yang Zing et al. discuss how the identification of outliers in wireless sensor networks can provide valuable information such as noise, errors and malicious attack affecting the network. Traditional outlier methods fail when used on wireless networks due to various requirements and limitations specific to networks. The paper provides an algorithm based on taxonomy and comparative table to select a technique from the available wireless network outlier detectors based on data type, outlier type, outlier identity and its degree [3]. Jorm Laurikkala et al. discuss the identification of outliers with the help of box plot in the field of medicine. They plotted the distance on box plot using Mahalanobis distances to identify multivariate outliers and univariate outliers were detected directly using a box plot. The identification of outliers not only helped increase the predictive ability of the classification but also the most experts in the field actually recognized the records to be outliers in their area [4]. Sofie Verbaeten at al. uses outlier detection as an application in noisy training examples where certain records are mislabelled. They use outliers methods as a pre-processing task and then proceed with the classification process. They use a number of filtration techniques like cross validation, boosting and bagging. They evaluate these techniques in an Inductive Logic programming setting and use decision tree to construct these ensembles [5]. Outliers provide valuable information about the data and should not be ignored. Identifying them not just provides various applications but it also provides industries the ability to get a clean dataset and identify the normal patterns in the data.

Methodology
The algorithm uses posterior likelihood in Bayesian statistics as a measure for assigning priority to each attribute.
Posterior likelihood -In Bayesian Statistics, the posterior likelihood of a random event or associate degree uncertain proposition is that the chance that's assigned when the relevant proof or background is taken under consideration.
Let us have a prior belief that the probability distribution function is P(Ѳ) and observations X with the likelihood then the Posterior likelihood is:

Research Article Research Article Research Article Research Article Research Article Research Article Research Article
The outliers are eliminated by interquartile range. IQR stands for "interquartile range". It is used in statistics to help analyse set of numbers. IQR is preferred over range as it is better at identifying outliers.

Work Flow:
A dataset is first imported for outliers detection.
The Binary Outlier Classifier identifies the Outliers.
A dataset without the outliers is generated.
(1) For each attribute in A, Assign each attribute a priority based on P (L/A) from 0 to K in increments of 1.
For each training data records imported calculate the sum of priorities for the attributes that occur in that record. The above algorithm is only used for identifying outliers for a single label. The attributes for other labels can be calculated in similar way.

Data Set & Tool used
Adult Census Income dataset is downloaded from the UCI Machine Learning Repository. Total number of instances are 48842, number of attributes are 14 and it is a multivariate data set. Attributes are Categorical, Integer. Tool used is Jupyter Notebook. The Jupyter Notebook is net application that enables you to make and share documents that contain live code, equations, visualizations and narrative text. Uses include: information cleansing and transformation, numerical simulation, applied mathematics modeling, information visual image, machine learning, and a far lot.The Notebook can be used to code in many languages, including Python, R, Julia, and Scala.

Data Pre-processing
One of the main shortcomings of the given algorithm is that it only works for categorical data. The reason for this is because the algorithm needs to calculate posterior probability for each distinct attribute. For continuous attributes the no of distinct values may be extremely large. Hence it is important to convert this continuous data into categorical data. The attributes age, capital gain, capital losses are converted into categorical data. Missing values are also removed.

Implementing the algorithm
The pandas library in python is used for implementation. Pandas provide statistical and data mining tools for coding in python. The implementation first starts with calculating the conditional probability of each attribute and placing them in ascending order. After that based on assigned probability a weight is calculated for each similar label record. On implementation it is quite obvious that records with similar labels will have weights close by. Hence a interquartile range value can be used to identify weights which are far away.

Conclusion
This paper has presented an algorithm for identifying outliers. As we saw that the algorithm successfully classified 3.27% of the dataset as outlier. While the number may seem small the records provide extremely valuable information. By using the posterior probability to assign weight for each record the algorithm created clusters and eliminate those as outliers which fall outside them. Hence the algorithm is successful in identifying outliers for any dataset provided that dataset has categorical data.