do not necessarily reflect the views of UKDiss.com.
A novel hybrid outlier detection by using distance based clustering and transitive relation rules
Abstract: The courtesy of speed of data growing and its storage into datasets shows that most of these datasets are unstructured in many cases. Clustering is one of the most important unsupervised approaches that it deals with finding a structure in a collection of unlabelled data. In many datasets, several data exist that have long distance to other data points that they mostly considered as exception, fault or, etc. In this paper, we use a distance base model with use of transitive relation rule to cluster data and detecting the outliers. This research presents a novel distance based model that can determine a number as neighbourhood distance by using genetic algorithm on transitive relation rules that it can minimize outliers and maximize clusters. This method searches the problem space to find a threshold for each data point that if two points have a distance less than the threshold, take them to the same cluster. We use this model on “Perfume data” and “The Animal Gestation” datasets. By implementing the proposed model, we reach to various propositions like, number of clusters, number of outliers and the largest threshold number to detect the data points that have not been normal distribution. We found three clusters with one outlier in “perfume data” and four clusters with one outlier in “The Animal Gestation” dataset. By deploying the proposed model, we reach to a high accurate clustering and also detect outliers in the datasets with a neighbourhood distance without knowing the number of clusters.
Keywords: Clustering, Outlier Detection, Transitive Relation, Genetic Algorithm, Distance Based model and Machine Learning
- Introduction
Outlier is defined as an observation that is inconsistent with the remainder of the set of data. These data points are detected by many methods like distance based, cluster based and etc. outlier detection is an extensive body of work on clustering algorithms. Outliers are objects that not located in clusters of a dataset, usually called noise or unrelated data. Only a few approaches are directly concerned with outlier detection [1].
Searching for outliers is an important area of research in the data mining world. For example, there are many applications of outlier detection like: credit card fraud detection, discovery of criminal activities, weather prediction, etc. All existing algorithms for outlier detection have high computation costs; however, most of the clustering algorithms try to cluster the data without considering the outliers.
The other problem in clustering is finding the number of clusters that it initializes manually. In our method, the main concern is finding the clusters with finding the outliers.
In this paper, we proposed a method for clustering and outliers’ detection that it works with distance neighbourhood. This distance calculated by genetic algorithm, and it tried to minimize the number of outliers with maximizing the number of clusters. With use of transitive relation rule, each cluster is detected. Based on this rule the outliers are not several points, which are far from the center of clusters, but the outliers are several data points, which are far from any other points.
- Literature Review
Clustering has been used in many areas, and it tries to find some distributions and patterns in unlabelled data sets. Clustering algorithms are used for grouping related data, and it is categorized under unsupervised learning. In clustering, objects divide to cluster based on similarity.
The similarities’ objects in the same group are high, while the similarities between objects in different groups are low. There are many methods and algorithms to cluster data; -means [2, 3], genetic algorithm (GA) [4], c-means [5], binary [6, 7], hierarchical [8] and etc. Also there are many methods that used hybrid or combined algorithms, like hybrid clustering with -means and GA algorithms that with using of GA, it determines the number of clusters and initial center and next they used the k-means [9], or a combined model of fuzzy c-means and GA [10]. Similarly, there is a research based on firefly algorithm and GA, which use the firefly algorithm to initial population of GA. Firefly is an algorithm that is used to optimization problems [11]. These methods with attention to data can be applied.
Clustering methods have widely applications. [12] proposed a method for clustering of smart phone images. They defined a method that can detect fraud in images, which is taken with a mobile and recognizing type of the mobile. They used True Proposed Rate (TRP) and Support Vector Machine (SVM) to divide image samples into clusters.
Furthermore, in another research by [13], there is a research about image segmentation with clustering or there is a research from [14] for text clustering and many more other. In this research, they used two methods, term variance (TV) and document frequency (DF) as a feature extractor. This feature extractor selects non-frequent and frequent features that after creating the sub list, they used Principal Component Analysis (PCA) to reduce the feature space. All of these researches can present the importance of clustering in any space of data.
K-means is one the most common method in clustering. This method cluster data in K group and in each group the objects have minimum distance and have maximum distance with other data in other clusters [2]. The center of each cluster is very important, because, it determines the area of each cluster. Then, the data with clusters have center too. In all of these approaches, we need to set the number of clusters and various clustering algorithms, are unable to determine the number of clusters in a given dataset. Moreover, some methods can detect a number of clusters; Two-stage Genetic Clustering Algorithm (TGCA) is one of them. This algorithm can determine automatically the number of clusters and partition data between those groups [15, 16].
In another research [17] use parametric modelling of the quantization error for determining the number of clusters or [18] used an integrating entropy for data that contain numerical attributes and categorical attributes to find the number of clusters in mix data.
None of the clustering methods can be implemented for all purposes. There have been hundreds of clustering algorithms. There is a research by [19] that use clustering based on Near Neighbour Influence (CNNI), this method is inspired by the idea of near neighbours.
In another research, this idea is well proposed by Newtonian law of gravity approach. This algorithm reduced an effect of noise on data, and it isn’t sensible to the initial positions of the centroids. The main idea of this method is based on law of gravity, and it shows the celestial object applies a gravity force to the movable objects and changed their positions [20].
Hebooul, Hachouf and etc [21] proposed a method based on Incremental Neural Network (INN). In this constructive model, they use labelled and unlabelled data to learn space structure, and the classifier train with labelled data, after that INN produced the posterior probabilities of each unlabelled sample data to different classes. The unlabelled sample data that has higher certainty of belonging to one class is then classified by the classifier. The most confidently classified unlabelled data with their predicted labels are added to the labelled set. This procedure is repeated until all unlabelled data fitted to a labelled set.
Meta heuristic methods like GA, firefly, ant colony and etc. have more shared in clustering. These methods can have many applications, like clustering web documents [22].
İnkaya, Kayalıgil and etc. [23] proposed an approach for clustering data, based on ant colony, which the number of clusters is unknown, and clusters may have arbitrary shapes. Their method has two steps for pre-processing data before initiation and neighbourhood construction and data set reduction. In another research a combination of GA and k-means is used for clustering big data. In this paper with using of a hybrid method, the clustering of big data make their results more accurate as they implemented k-means algorithm to generate initial center of clusters and then use of these points as initial population for GA to reduce the sum square error [24].
All of these algorithms can cluster data in different way and situation. Today’s databases typically contain millions of items. In these databases, there are many outliers. In statistics, an outlier is an observation point that is distant from other observations. Outlier detection has been used in variety of applications to identify crimes, network intrusion, stock market, medical data analysis [25].
Outlier detection laid in supervised, semi-supervised and unsupervised algorithm groups. Outlier analysis is closely related to changing point and event detection [26]. Most of the distribution-based approaches for detecting outliers use KNN (k-nearest neighbour) approaches and a combined model of a global and local outlier detector with KNN can be more effective. KNN methods are very sensitive to the value of and with different value of , it has different value for top outliers [27].
In this paper, we proposed a model for data clustering with a specific neighbourhood distance between data in each cluster. The similarity between data in each cluster is a particular distance and two data in a different cluster have distinctive distance too. In the next section, the full map our proposed work is presented.
- Proposed Work
The most common application of clustering is to explore the data and find all possible meaningful groups in a dataset. They also used in many distance-based methods for detecting the outliers; however, the value of the distance must be predefined. In this paper, we are proposing a method for clustering data that use a distance value based on Euclidean distance and specific value neighbourhood for clustering data inevitably. Clustering is a method for grouping related data in the same group and unrelated data have a different group. The similarity scales in many methods are Euclidean distance:
Where and are two different points in dataset. In each cluster, we have.
In this method, we use genetic algorithm as a method to search the problem space. The initial population is a group of chromosomes that each chromosome has some gens. In this approach, we use binary chromosome. Binary chromosome is a sequence of ones and zeros.
If it has n gens, it can present an integer positive number in range.
Therefore, we use this coding method as all distances between points are positive. Our method steps are as follows:
- Initial Population
- A population with binary chromosomes, which each chromosome has genes and x is the largest distance between two pair points:
- For initial population, we use random number.
- Calculate Euclidean distance between all two points.
- Fitness function:
- Set the maximum neighbourhood distance, which we want to have all two points in a cluster.
- Label all pairs of data points that have fewer distances from maximum neighbourhood distance and create groups with these points.
- Labelling all points that have fewer distances from maximum neighbourhood distance of any points in a group with the same label of the group.
- Add that new data point to the group.
- Repeat steps 2.1 to 2.3 for all data points.
- Calculate the summation of outliers and threshold as.
- Cross-over and Mutation
- Terminate Rule
- In this method, we have several terminated rules that, reaching to the maximum iteration or to the answer of two of them.
- Write Result
In continue we explain these steps.
In Fig. 1, we can see the proposed method workflow. In Fig. 1, is a labelling operator and is grouping operator for grouping all points with one or several common points in a same group. In the following flowchart, the evaluation is calculated the fitness function to have the optimal answer.
Fig. 1. The view of distance calculating of the proposed method
3.1 Calculation
In the first step, we calculate the Euclidean distance between all pair data:
Where, is an upper triangular matric and each is an Euclidean distance between the data points and. It is clear that and then the matric is upper triangular. In Fig. 2, there is a view of this calculation.
Fig. 2. The view of distance calculating
In this step, we find the number of gens for each chromosome. We use the maximum distance number in D matric, because it is a large number that it can be consisted of all data. With this distance, we can access to all data points and then; we have only one cluster, which consists of the whole dataset. Fig. 3 represents the edges of a cluster and shows that edge 1 has the maximum distance between the other edges.
Fig. 3. View of maximum distance between data
After calculating matric, we should label all data, which are less from maximum neighbourhood distance Threshold (T). In this stage, the labelling of all pairs of data point with maximum neighbourhood distance or less is initiated, and it is shown in Fig. 4.
Fig. 4. Grouping each pair of data points
Where R is a relation over a set. Therefore, points 1, 2, 3 and 4 are in a same group. Fig. 5 shown the final cluster.
Fig. 5. Grouping data and final cluster
In this method, with use transitive relation. We can determine the number of clusters based on neighbourhood distance. In this model, with attention to fig. 5 in the neighbourhood of cluster 1 and 2, there is not a data point with maximum neighbourhood distance or less. Adding the new points to a cluster repeat until all points with maximum neighbourhood distance ending. An advantage of this method is, as we have many clusters of pair points on places with high density of data points, which are close enough together, we can have one or several points’ collectivises.
Fig. 6. A view of outlier’s data
The proposed method shows in the following algorithm (Alg.1). The grouping data with using transitive relation rule is generated; T is a threshold for Euclidean distance between data.
for i←1:n-1
for j←i+1:n
if(Euclidean_distance(x_i,x_j)≤ T)
if(Label(x_i)==∅ && Label(x_j)==∅)
Label(x_j)←i;
Label(x_i)←i;
Elseif(Label(x_i)<> ∅ && Label(x_j)== ∅)
Label(x_j)←Label(x_i);
Elseif(Label(x_i)== ∅ && Label(x_j)<> ∅)
Label(x_i)←Label(x_j);
Elseif(Label(x_i)<> ∅ && Label(x_j)<> ∅)
M←Min(Label(x_i), Label(x_j));
Label(x_j)←M;
Label(x_i)←M;
End
End
End
End
Alg.1. Pseudo-code for grouping data section
If we have some outlier data points, which those points are far from any cluster and other points, then we can conclude that the density in a neighbour of those points is very low, and the method ignores them as they aren’t in a neighbourhood area that is already defined. In Fig. 6, if the maximum neighbourhood distance be a small number, data points 1 and 2 are outlier and with this distance only we have one cluster as for any distance, we need at least two points in that area. With this method, we can also find the outlier points. Then the other advantage of this method in addition to clustering is distance base outlier detection. In Fig. 7, we can see the six points as outliers.
Fig. 7. (Up picture) A 3D view of distance-based outliers in 3-dimentional data. (Down picture) A 2D view of that.
After these steps, the goal of the problem that we want to reach to it, is minimizing the number of outliers and threshold. The short model of the problem is look like this:
Formula 1. Model of the problem
Where, is the threshold and is the number of outliers, based on the threshold, and are coefficients of determination of importance. Each one of them is laid between ranges, that 0 means, it is unimportant and one means high importance. In this model, we use.
In each iteration, the value of T changed to minimize the result. For minimizing the result, it should minimize the outliers () and the threshold. The reason of this work is, we want to minimize the threshold until, we reach to the minimum number of outliers in clusters and on the other hand, with minimizing of, we want to expand the area of each cluster until any data points that are not so far from the clusters, be a part of them.
- Analysis & Design of the Proposed Model
In this paper, we use “perfume data” and “The Animal Gestation” datasets, which is included a massive data. The first dataset has 20 rows and 28 columns. Table 1, represents a short view of this dataset [28]:
Table 1. A short view of “perfume_data” dataset
Ajayeb | 64558 | 64556 | 64543 | 64543 | 64541 | 64543 |
Ajmal | 60502 | 60489 | 61485 | 60487 | 61485 | 61513 |
amreaj | 57040 | 57040 | 57040 | 58041 | 58041 | 58041 |
Aood | 71083 | 72087 | 71091 | 71095 | 71099 | 72103 |
asgar_ali | 68209 | 68209 | 68216 | 68216 | 68223 | 68223 |
bukhoor | 71046 | 71046 | 71046 | 71046 | 71046 | 71046 |
burberrry | 61096 | 61096 | 60093 | 60092 | 60093 | 60093 |
dehenalaod | 68132 | 69137 | 69137 | 68137 | 68137 | 69142 |
Junaid | 71590 | 71575 | 71574 | 71560 | 71560 | 71559 |
kausar | 74631 | 74649 | 74650 | 74650 | 74650 | 74632 |
In the first step, we calculated Euclidean distance between all two points. Table 2 is a view of matric. This matric is upper triangular, and every represent the Euclidean distance between each pair of points and. This algorithm uses the Euclidean distance measure for finding similarity between objects.
Table 2. A short view of D’s matric.
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
1 | 0 | 19733.47 | 35244.23 | 39021.77 | 19436.78 | 31041.19 | 21472.63 | 23137.87 | 41075.02 |
2 | 0 | 0 | 15907.12 | 58500.68 | 38934.61 | 50514.59 | 4564.554 | 42664.98 | 60575.3 |
3 | 0 | 0 | 0 | 74087.55 | 54523.06 | 66142.96 | 14470.77 | 58233.36 | 76127.76 |
4 | 0 | 0 | 0 | 0 | 19805.82 | 9545.539 | 60224.86 | 16301.62 | 4483.097 |
5 | 0 | 0 | 0 | 0 | 0 | 12071.62 | 40657.26 | 4957.088 | 21904.23 |
6 | 0 | 0 | 0 | 0 | 0 | 0 | 52251.7 | 9174.637 | 11348.84 |
7 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 44468.47 | 62344.66 |
8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 18125.04 |
9 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
In this stage, we read the outlier score from input to set groups based on Euclidean distance. After that by implementing the Alg.1, the method can create the cluster and detect outlier with a specific value for the outlier scores (T). This result is shown in Fig.7.
If value of chosen an enough big number, it consists of whole data, and then we have only one cluster with all of data. Additionally, if value of chosen a suitable value, it can group data into several clusters and detects the outliers based on Euclidean distance.
The second dataset that we use in this method has 40 rows and 2 columns. Table 4 represents a small part of this dataset. This dataset consists of an average of lifespan (in years) and typical gestation period (in days) for 40 different species of mammals [29].
Fig.7. Upper picture shows 500 generations, and the down picture shows two clusters for T=16072
Table 4. A short view of dataset “The Animal Gestation”
animal | gestation | longevity |
baboon | 187 | 20 |
bear, black | 219 | 18 |
bear, grizzly | 225 | 25 |
bear, polar | 240 | 20 |
beaver | 122 | 5 |
buffalo | 278 | 15 |
camel | 406 | 12 |
cat | 63 | 12 |
chimpanzee | 231 | 20 |
Genetic algorithm and outlier detection for this dataset is shown in Fig. 8. In this run, the value of T is 26, and it contains whole data except one data point, which is so far from the other points (the black point in Fig.8).
Fig.8. Left picture shows 500 generations and the right picture shows four clusters for T=26
The data point on up right side of Fig.8 is an outlier. There appears to be one outlier, indicating an animal with an exceptionally long longevity and gestation period (This animal happens to be the elephant). The threshold of this dataset is 26, and each point in a neighbor of the other points that have distanced less than 26 will have the same cluster; we investigate this issue in Fig.9.
Fig.9. “The Animal Gestation” clusters in the real scale plot
In Fig.9 the circles around each data points show the threshold () area around each point. Every point that is enough too closed to have same cluster and with this distance value and with attention to transitive relation rule, the data is clustered. After that, each data point that is not belonged to any cluster is labelled as an outlier. Table 5 represents the summary of this work.
Table 5. Summary of results.
Name | Generations | T | Cluster(s) | Number of data | Outlier(s) | Percentage of outliers |
perfume_data | 500 | 16072 | 2 | 20 | 1 | 5% |
The Animal Gestation | 500 | 26 | 4 | 40 | 1 | 2.5% |
- Conclusion
One of the most important operations on data, especially big, is classification and clustering. This paper proposed a novel model to cluster data and detect outliers based on Euclidean distance. This approach determines outliers with use of a distance-based model, and any data point that there is not in any cluster, labelled as an outlier. Outlier detection based on clustering is done by this method. Without determining the number of clusters or threshold distance for this approach, it can only detect clusters and outliers based on the transitive relation rules. The Proposed method can be useful in large datasets and big data. In this paper, we use GA method with a fitness function for clustering and outlier detection. GA method searched the problem space to minimize the number of outliers and maximize the number of clusters. The result of this method on “perfume data” data is two clusters and one outliers and on “The Animal Gestation” is four clusters and one outlier. For the future of this work, the proposed method can combine to another algorithm like memetic or a hybrid technique of that.
References
[1] M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander, “LOF: identifying density-based local outliers,” in ACM sigmod record, 2000, pp. 93-104.
[2] A. K. Jain, “Data clustering: 50 years beyond K-means,” Pattern Recognition Letters, vol. 31, pp. 651-666, 2010.
[3] O. M. Ebadati E and S. Babaie, “Implementation of Two Stages k-Means Algorithm to Apply a Payment System Provider Framework in Banking Systems,” in Artificial Intelligence Perspectives and Applications. vol. 347, R. Silhavy, R. Senkerik, Z. K. Oplatkova, Z. Prokopova, and P. Silhavy, Eds., ed: Springer International Publishing, 2015, pp. 203-213.
[4] Y. Liu, X. Wu, and Y. Shen, “Automatic clustering using genetic algorithms,” Applied Mathematics and Computation, vol. 218, pp. 1267-1279, 2011.
[5] Z.-g. Liu, Q. Pan, J. Dezert, and G. Mercier, “Credal c-means clustering method based on belief functions,” Knowledge-Based Systems, vol. 74, pp. 119-132, 2015.
[6] A. Hatamlou, “In search of optimal centroids on data clustering using a binary search algorithm,” Pattern Recognition Letters, vol. 33, pp. 1756-1760, 2012.
[7] J.-P. Barthélemy and F. Brucker, “Binary clustering,” Discrete Applied Mathematics, vol. 156, pp. 1237-1250, 2008.
[8] F. Queyroi and S. Kirgizov, “Suppression distance computation for hierarchical clusterings,” Information Processing Letters, vol. 115, pp. 689-693, 2015.
[9] M. A. Rahman and M. Z. Islam, “A hybrid clustering technique combining a novel genetic algorithm with K-Means,” Knowledge-Based Systems, vol. 71, pp. 345-365, 2014.
[10] S. Wikaisuksakul, “A multi-objective genetic algorithm with fuzzy c-means for automatic data clustering,” Applied Soft Computing, vol. 24, pp. 679-691, 2014.
[11] K. Kaushik and V. Arora, “A Hybrid Data Clustering Using Firefly Algorithm Based Improved Genetic Algorithm,” Procedia Computer Science, vol. 58, pp. 249-256, 2015.
[12] L. J. G. Villalba, A. L. S. Orozco, and J. R. Corripio, “Smartphone image clustering,” Expert Systems with Applications, vol. 42, pp. 1927-1940, 2015.
[13] F. Zhao, H. Liu, and J. Fan, “A multiobjective spatial fuzzy clustering algorithm for image segmentation,” Applied Soft Computing, vol. 30, pp. 48-57, 2015.
[14] K. K. Bharti and P. K. Singh, “Hybrid dimension reduction by integrating feature selection with feature extraction method for text clustering,” Expert Systems with Applications, vol. 42, pp. 3105-3114, 2015.
[15] H. He and Y. Tan, “A two-stage genetic algorithm for automatic clustering,” Neurocomputing, vol. 81, pp. 49-59, 2012.
[16] S. Razavi, O. M. Ebadati E., S. Asadi, and H. Kaur, “An Efficient Grouping Genetic Algorithm for Data Clustering and Big Data Analysis,” in Computational Intelligence for Big Data Analysis. vol. 19, D. P. Acharjya, S. Dehuri, and S. Sanyal, Eds., ed: Springer International Publishing, 2015, pp. 119-142.
[17] A. Kolesnikov, E. Trichina, and T. Kauranne, “Estimating the number of clusters in a numerical data set via quantization error modeling,” Pattern Recognition, vol. 48, pp. 941-952, 2015.
[18] J. Liang, X. Zhao, D. Li, F. Cao, and C. Dang, “Determining the number of clusters using information entropy for mixed data,” Pattern Recognition, vol. 45, pp. 2251-2265, 2012.
[19] X. Chen, “A new clustering algorithm based on near neighbor influence,” Expert Systems with Applications, 2015.
[20] A. Bahrololoum, H. Nezamabadi-pour, and S. Saryazdi, “A data clustering approach based on universal gravity rule,” Engineering Applications of Artificial Intelligence, vol. 45, pp. 415-428, 2015.
[21] A. Hebooul, F. Hachouf, and A. Boulemnadjel, “A new Incremental Neural Network for Simultaneous Clustering and Classification,” Neurocomputing, 2015.
[22] M. Mahdavi, M. H. Chehreghani, H. Abolhassani, and R. Forsati, “Novel meta-heuristic algorithms for clustering web documents,” Applied Mathematics and Computation, vol. 201, pp. 441-451, 2008.
[23] T. İnkaya, S. Kayalıgil, and N. E. Özdemirel, “Ant Colony Optimization based clustering methodology,” Applied Soft Computing, vol. 28, pp. 301-311, 2015.
[24] E. O. M. Ebadati and M. M. Tabrizi, “A Hybrid Clustering Technique to Improve Big Data Accessibility Based on Machine Learning Approaches,” in Information Systems Design and Intelligent Applications: Proceedings of Third International Conference INDIA 2016, Volume 1, C. S. Satapathy, K. J. Mandal, K. S. Udgata, and V. Bhateja, Eds., ed New Delhi: Springer India, 2016, pp. 413-423.
[25] A. Christy, G. M. Gandhi, and S. Vaithyasubramanian, “Cluster Based Outlier Detection Algorithm for Healthcare Data,” Procedia Computer Science, vol. 50, pp. 209-215, 2015.
[26] M. Gupta, J. Gao, C. Aggarwal, and J. Han, “Outlier detection for temporal data,” Synthesis Lectures on Data Mining and Knowledge Discovery, vol. 5, pp. 1-129, 2014.
[27] X. Wang, X. L. Wang, Y. Ma, and D. M. Wilkes, “A fast MST-inspired kNN-based outlier detection method,” Information Systems, vol. 48, pp. 89-112, 2015.
[28] (2002 – 2003). UCI Machine Learning Repository: Perfume Data Data Set. Available: https://archive.ics.uci.edu/ml/datasets/Perfume+Data
[29] M. S. Hoffman, The World Almanac and Book of Facts 1993: The Authority Since 1868, 1993.