Essay Writing Service

1 Star2 Stars3 Stars4 Stars5 Stars (No Ratings Yet)
Loading...

A Novel Hybrid Outlier Detection by Using Distance Based Clustering and Transitive Relation Rules

do not necessarily reflect the views of UKDiss.com.

A novel hybrid outlier detection by using distance based clustering and transitive relation rules

Abstract: The courtesy of speed of data growing and its storage into datasets shows that most of these datasets are unstructured in many cases. Clustering is one of the most important unsupervised approaches that it deals with finding a structure in a collection of unlabelled data. In many datasets, several data exist that have long distance to other data points that they mostly considered as exception, fault or, etc. In this paper, we use a distance base model with use of transitive relation rule to cluster data and detecting the outliers. This research presents a novel distance based model that can determine a number as neighbourhood distance by using genetic algorithm on transitive relation rules that it can minimize outliers and maximize clusters. This method searches the problem space to find a threshold for each data point that if two points have a distance less than the threshold, take them to the same cluster. We use this model on “Perfume data” and “The Animal Gestation” datasets. By implementing the proposed model, we reach to various propositions like, number of clusters, number of outliers and the largest threshold number to detect the data points that have not been normal distribution. We found three clusters with one outlier in “perfume data” and four clusters with one outlier in “The Animal Gestation” dataset. By deploying the proposed model, we reach to a high accurate clustering and also detect outliers in the datasets with a neighbourhood distance without knowing the number of clusters.

Keywords: Clustering, Outlier Detection, Transitive Relation, Genetic Algorithm, Distance Based model and Machine Learning

  1. Introduction

Outlier is defined as an observation that is inconsistent with the remainder of the set of data. These data points are detected by many methods like distance based, cluster based and etc. outlier detection is an extensive body of work on clustering algorithms. Outliers are objects that not located in clusters of a dataset, usually called noise or unrelated data. Only a few approaches are directly concerned with outlier detection [1].

Searching for outliers is an important area of research in the data mining world. For example, there are many applications of outlier detection like: credit card fraud detection, discovery of criminal activities, weather prediction, etc. All existing algorithms for outlier detection have high computation costs; however, most of the clustering algorithms try to cluster the data without considering the outliers.

The other problem in clustering is finding the number of clusters that it initializes manually. In our method, the main concern is finding the clusters with finding the outliers.

In this paper, we proposed a method for clustering and outliers’ detection that it works with distance neighbourhood. This distance calculated by genetic algorithm, and it tried to minimize the number of outliers with maximizing the number of clusters. With use of transitive relation rule, each cluster is detected. Based on this rule the outliers are not several points, which are far from the center of clusters, but the outliers are several data points, which are far from any other points.

  1. Literature Review

Clustering has been used in many areas, and it tries to find some distributions and patterns in unlabelled data sets. Clustering algorithms are used for grouping related data, and it is categorized under unsupervised learning. In clustering,  objects divide to  cluster based on similarity.

The similarities’ objects in the same group are high, while the similarities between objects in different groups are low. There are many methods and algorithms to cluster data; -means [2, 3], genetic algorithm (GA) [4], c-means [5], binary [6, 7], hierarchical [8] and etc. Also there are many methods that used hybrid or combined algorithms, like hybrid clustering with -means and GA algorithms that with using of GA, it determines the number of clusters and initial center and next they used the k-means [9], or a combined model of fuzzy c-means and GA [10]. Similarly, there is a research based on firefly algorithm and GA, which use the firefly algorithm to initial population of GA. Firefly is an algorithm that is used to optimization problems [11]. These methods with attention to data can be applied.

Clustering methods have widely applications. [12] proposed a method for clustering of smart phone images. They defined a method that can detect fraud in images, which is taken with a mobile and recognizing type of the mobile. They used True Proposed Rate (TRP) and Support Vector Machine (SVM) to divide image samples into clusters.

Furthermore, in another research by [13], there is a research about image segmentation with clustering or there is a research from [14] for text clustering and many more other. In this research, they used two methods, term variance (TV) and document frequency (DF) as a feature extractor. This feature extractor selects non-frequent and frequent features that after creating the sub list, they used Principal Component Analysis (PCA) to reduce the feature space. All of these researches can present the importance of clustering in any space of data.

K-means is one the most common method in clustering. This method cluster data in K group and in each group the objects have minimum distance and have maximum distance with other data in other clusters [2]. The center of each cluster is very important, because, it determines the area of each cluster. Then, the data with  clusters have  center too. In all of these approaches, we need to set the number of clusters and various clustering algorithms, are unable to determine the number of clusters in a given dataset. Moreover, some methods can detect a number of clusters; Two-stage Genetic Clustering Algorithm (TGCA) is one of them. This algorithm can determine automatically the number of clusters and partition data between those groups [15, 16].

In another research [17] use parametric modelling of the quantization error for determining the number of clusters or [18] used an integrating entropy for data that contain numerical attributes and categorical attributes to find the number of clusters in mix data.

None of the clustering methods can be implemented for all purposes. There have been hundreds of clustering algorithms. There is a research by [19] that use clustering based on Near Neighbour Influence (CNNI), this method is inspired by the idea of near neighbours.

In another research, this idea is well proposed by Newtonian law of gravity approach. This algorithm reduced an effect of noise on data, and it isn’t sensible to the initial positions of the centroids. The main idea of this method is based on law of gravity, and it shows the celestial object applies a gravity force to the movable objects and changed their positions [20].

Hebooul, Hachouf and etc [21] proposed a method based on Incremental Neural Network (INN). In this constructive model, they use labelled and unlabelled data to learn space structure, and the classifier train with labelled data, after that INN produced the posterior probabilities of each unlabelled sample data to different classes. The unlabelled sample data that has higher certainty of belonging to one class is then classified by the classifier. The most confidently classified unlabelled data with their predicted labels are added to the labelled set. This procedure is repeated until all unlabelled data fitted to a labelled set.

Meta heuristic methods like GA, firefly, ant colony and etc. have more shared in clustering. These methods can have many applications, like clustering web documents [22].

İnkaya, Kayalıgil and etc. [23] proposed an approach for clustering data, based on ant colony, which the number of clusters is unknown, and clusters may have arbitrary shapes. Their method has two steps for pre-processing data before initiation and neighbourhood construction and data set reduction. In another research a combination of GA and k-means is used for clustering big data. In this paper with using of a hybrid method, the clustering of big data make their results more accurate as they implemented k-means algorithm to generate initial center of clusters and then use of these points as initial population for GA to reduce the sum square error [24].

All of these algorithms can cluster data in different way and situation. Today’s databases typically contain millions of items. In these databases, there are many outliers. In statistics, an outlier is an observation point that is distant from other observations. Outlier detection has been used in variety of applications to identify crimes, network intrusion, stock market, medical data analysis [25].

Outlier detection laid in supervised, semi-supervised and unsupervised algorithm groups. Outlier analysis is closely related to changing point and event detection [26]. Most of the distribution-based approaches for detecting outliers use KNN (k-nearest neighbour) approaches and a combined model of a global and local outlier detector with KNN can be more effective. KNN methods are very sensitive to the value of and with different value of , it has different value for  top outliers [27].

In this paper, we proposed a model for data clustering with a specific neighbourhood distance between data in each cluster. The similarity between data in each cluster is a particular distance and two data in a different cluster have distinctive distance too. In the next section, the full map our proposed work is presented.

  1. Proposed Work

The most common application of clustering is to explore the data and find all possible meaningful groups in a dataset. They also used in many distance-based methods for detecting the outliers; however, the value of the distance must be predefined. In this paper, we are proposing a method for clustering data that use a distance value based on Euclidean distance and specific value neighbourhood for clustering data inevitably. Clustering is a method for grouping related data in the same group and unrelated data have a different group. The similarity scales in many methods are Euclidean distance:

Where  and  are two different points in dataset. In each cluster, we have.

In this method, we use genetic algorithm as a method to search the problem space. The initial population is a group of chromosomes that each chromosome has some gens. In this approach, we use binary chromosome. Binary chromosome is a sequence of ones and zeros.

If it has n gens, it can present an integer positive number in range.

Therefore, we use this coding method as all distances between points are positive. Our method steps are as follows:

  1. Initial Population
    1. A population with binary chromosomes, which each chromosome has  genes and x is the largest distance between two pair points:

  1. For initial population, we use random number.
  2. Calculate Euclidean distance between all two points.
  1. Fitness function:
    1. Set the maximum neighbourhood distance, which we want to have all two points in a cluster.
    2. Label all pairs of data points that have fewer distances from maximum neighbourhood distance and create groups with these points.
    3. Labelling all points that have fewer distances from maximum neighbourhood distance of any points in a group with the same label of the group.
    4. Add that new data point to the group.
    5. Repeat steps 2.1 to 2.3 for all data points.
    6. Calculate the summation of outliers and threshold as.
  2. Cross-over and Mutation
  3. Terminate Rule
    1. In this method, we have several terminated rules that, reaching to the maximum iteration or to the answer of two of them.
  4. Write Result

In continue we explain these steps.

In Fig. 1, we can see the proposed method workflow. In Fig. 1,  is a labelling operator and  is grouping operator for grouping all points with one or several common points in a same group. In the following flowchart, the evaluation is calculated the fitness function to have the optimal answer.

Fig. 1. The view of distance calculating of the proposed method

3.1 Calculation

In the first step, we calculate the Euclidean distance between all pair data:

Where, is an upper triangular matric and each  is an Euclidean distance between the data points  and. It is clear that and  then the matric is upper triangular. In Fig. 2, there is a view of this calculation.

Fig. 2. The view of distance calculating

In this step, we find the number of gens for each chromosome. We use the maximum distance number in D matric, because it is a large number that it can be consisted of all data. With this distance, we can access to all data points and then; we have only one cluster, which consists of the whole dataset. Fig. 3 represents the edges of a cluster and shows that edge 1 has the maximum distance between the other edges.

C:UsersHAMANDesktopUnsupervisedResourcedistance.jpg

Fig. 3. View of maximum distance between data

After calculating  matric, we should label all data, which are less from maximum neighbourhood distance Threshold (T). In this stage, the labelling of all pairs of data point with maximum neighbourhood distance or less is initiated, and it is shown in Fig. 4.

Fig. 4. Grouping each pair of data points

Where R is a relation over a set. Therefore, points 1, 2, 3 and 4 are in a same group. Fig. 5 shown the final cluster.

Fig. 5. Grouping data and final cluster

In this method, with use transitive relation. We can determine the number of clusters based on neighbourhood distance. In this model, with attention to fig. 5 in the neighbourhood of cluster 1 and 2, there is not a data point with maximum neighbourhood distance or less. Adding the new points to a cluster repeat until all points with maximum neighbourhood distance ending. An advantage of this method is, as we have many clusters of pair points on places with high density of data points, which are close enough together, we can have one or several points’ collectivises.

Fig. 6. A view of outlier’s data

The proposed method shows in the following algorithm (Alg.1). The grouping data with using transitive relation rule is generated; T is a threshold for Euclidean distance between data.

for i←1:n-1

   for j←i+1:n

      if(Euclidean_distance(x_i,x_j)≤ T)

         if(Label(x_i)== && Label(x_j)==)

            Label(x_j)←i;

            Label(x_i)←i;

         Elseif(Label(x_i)<>  && Label(x_j)== )

            Label(x_j)←Label(x_i);

         Elseif(Label(x_i)==  && Label(x_j)<> )

            Label(x_i)←Label(x_j);

         Elseif(Label(x_i)<>  && Label(x_j)<> )

            M←Min(Label(x_i), Label(x_j));

            Label(x_j)←M;

            Label(x_i)←M;

         End

      End

   End

End

 

Alg.1. Pseudo-code for grouping data section

If we have some outlier data points, which those points are far from any cluster and other points, then we can conclude that the density in a neighbour of those points is very low, and the method ignores them as they aren’t in a neighbourhood area that is already defined. In Fig. 6, if the maximum neighbourhood distance be a small number, data points 1 and 2 are outlier and with this distance only we have one cluster as for any distance, we need at least two points in that area. With this method, we can also find the outlier points. Then the other advantage of this method in addition to clustering is distance base outlier detection. In Fig. 7, we can see the six points as outliers.

Fig. 7. (Up picture) A 3D view of distance-based outliers in 3-dimentional data. (Down picture) A 2D view of that.

After these steps, the goal of the problem that we want to reach to it, is minimizing the number of outliers and threshold. The short model of the problem is look like this:

Formula 1. Model of the problem

Where,  is the threshold and  is the number of outliers, based on the threshold, and  are coefficients of determination of importance. Each one of them is laid between  ranges, that 0 means, it is unimportant and one means high importance. In this model, we use.

In each iteration, the value of T changed to minimize the result. For minimizing the result, it should minimize the outliers () and the threshold. The reason of this work is, we want to minimize the threshold until, we reach to the minimum number of outliers in clusters and on the other hand, with minimizing of, we want to expand the area of each cluster until any data points that are not so far from the clusters, be a part of them.

  1. Analysis & Design of the Proposed Model

In this paper, we use “perfume data” and “The Animal Gestation” datasets, which is included a massive data. The first dataset has 20 rows and 28 columns. Table 1, represents a short view of this dataset [28]:

Table 1. A short view of “perfume_data” dataset

Ajayeb 64558 64556 64543 64543 64541 64543
Ajmal 60502 60489 61485 60487 61485 61513
amreaj 57040 57040 57040 58041 58041 58041
Aood 71083 72087 71091 71095 71099 72103
asgar_ali 68209 68209 68216 68216 68223 68223
bukhoor 71046 71046 71046 71046 71046 71046
burberrry 61096 61096 60093 60092 60093 60093
dehenalaod 68132 69137 69137 68137 68137 69142
Junaid 71590 71575 71574 71560 71560 71559
kausar 74631 74649 74650 74650 74650 74632

In the first step, we calculated Euclidean distance between all two points. Table 2 is a view of  matric. This matric is upper triangular, and every  represent the Euclidean distance between each pair of points  and. This algorithm uses the Euclidean distance measure for finding similarity between objects.

Table 2. A short view of D’s matric.

  1 2 3 4 5 6 7 8 9
1 0 19733.47 35244.23 39021.77 19436.78 31041.19 21472.63 23137.87 41075.02
2 0 0 15907.12 58500.68 38934.61 50514.59 4564.554 42664.98 60575.3
3 0 0 0 74087.55 54523.06 66142.96 14470.77 58233.36 76127.76
4 0 0 0 0 19805.82 9545.539 60224.86 16301.62 4483.097
5 0 0 0 0 0 12071.62 40657.26 4957.088 21904.23
6 0 0 0 0 0 0 52251.7 9174.637 11348.84
7 0 0 0 0 0 0 0 44468.47 62344.66
8 0 0 0 0 0 0 0 0 18125.04
9 0 0 0 0 0 0 0 0 0

In this stage, we read the outlier score from input to set groups based on Euclidean distance. After that by implementing the Alg.1, the method can create the cluster and detect outlier with a specific value for the outlier scores (T). This result is shown in Fig.7.

If value of  chosen an enough big number, it consists of whole data, and then we have only one cluster with all of data. Additionally, if value of  chosen a suitable value, it can group data into several clusters and detects the outliers based on Euclidean distance.

The second dataset that we use in this method has 40 rows and 2 columns. Table 4 represents a small part of this dataset. This dataset consists of an average of lifespan (in years) and typical gestation period (in days) for 40 different species of mammals [29].

C:UsersHAMANDesktopUnsupervisedResourcePlotGA-Perfume-small.bmp

C:UsersHAMANDesktopUnsupervisedResourcePlotT16072-perfume-small.bmp

Fig.7. Upper picture shows 500 generations, and the down picture shows two clusters for T=16072

Table 4. A short view of dataset “The Animal Gestation”

animal gestation longevity
baboon 187 20
bear, black 219 18
bear, grizzly 225 25
bear, polar 240 20
beaver 122 5
buffalo 278 15
camel 406 12
cat 63 12
chimpanzee 231 20

Genetic algorithm and outlier detection for this dataset is shown in Fig. 8. In this run, the value of T is 26, and it contains whole data except one data point, which is so far from the other points (the black point in Fig.8).

C:UsersHAMANDesktopUnsupervisedResourcePlotGA-boston-small.bmp

C:UsersHAMANDesktopUnsupervisedResourcePlotT26-boston.bmp

Fig.8. Left picture shows 500 generations and the right picture shows four clusters for T=26

The data point on up right side of Fig.8 is an outlier. There appears to be one outlier, indicating an animal with an exceptionally long longevity and gestation period (This animal happens to be the elephant). The threshold of this dataset is 26, and each point in a neighbor of the other points that have distanced less than 26 will have the same cluster; we investigate this issue in Fig.9.

C:UsersHAMANDesktopUnsupervisedResourcePlotanimal-equal.bmp

Fig.9. “The Animal Gestation” clusters in the real scale plot

In Fig.9 the circles around each data points show the threshold () area around each point. Every point that is enough too closed to have same cluster and with this distance value and with attention to transitive relation rule, the data is clustered. After that, each data point that is not belonged to any cluster is labelled as an outlier. Table 5 represents the summary of this work.

Table 5. Summary of results.

Name Generations T Cluster(s) Number of data Outlier(s) Percentage of outliers
perfume_data 500 16072 2 20 1 5%
The Animal Gestation 500 26 4 40 1 2.5%
  1. Conclusion

One of the most important operations on data, especially big, is classification and clustering. This paper proposed a novel model to cluster data and detect outliers based on Euclidean distance. This approach determines outliers with use of a distance-based model, and any data point that there is not in any cluster, labelled as an outlier. Outlier detection based on clustering is done by this method. Without determining the number of clusters or threshold distance for this approach, it can only detect clusters and outliers based on the transitive relation rules. The Proposed method can be useful in large datasets and big data. In this paper, we use GA method with a fitness function for clustering and outlier detection. GA method searched the problem space to minimize the number of outliers and maximize the number of clusters. The result of this method on “perfume data” data is two clusters and one outliers and on “The Animal Gestation” is four clusters and one outlier. For the future of this work, the proposed method can combine to another algorithm like memetic or a hybrid technique of that.

References

[1] M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander, “LOF: identifying density-based local outliers,” in ACM sigmod record, 2000, pp. 93-104.

[2] A. K. Jain, “Data clustering: 50 years beyond K-means,” Pattern Recognition Letters, vol. 31, pp. 651-666, 2010.

[3] O. M. Ebadati E and S. Babaie, “Implementation of Two Stages k-Means Algorithm to Apply a Payment System Provider Framework in Banking Systems,” in Artificial Intelligence Perspectives and Applications. vol. 347, R. Silhavy, R. Senkerik, Z. K. Oplatkova, Z. Prokopova, and P. Silhavy, Eds., ed: Springer International Publishing, 2015, pp. 203-213.

[4] Y. Liu, X. Wu, and Y. Shen, “Automatic clustering using genetic algorithms,” Applied Mathematics and Computation, vol. 218, pp. 1267-1279, 2011.

[5] Z.-g. Liu, Q. Pan, J. Dezert, and G. Mercier, “Credal c-means clustering method based on belief functions,” Knowledge-Based Systems, vol. 74, pp. 119-132, 2015.

[6] A. Hatamlou, “In search of optimal centroids on data clustering using a binary search algorithm,” Pattern Recognition Letters, vol. 33, pp. 1756-1760, 2012.

[7] J.-P. Barthélemy and F. Brucker, “Binary clustering,” Discrete Applied Mathematics, vol. 156, pp. 1237-1250, 2008.

[8] F. Queyroi and S. Kirgizov, “Suppression distance computation for hierarchical clusterings,” Information Processing Letters, vol. 115, pp. 689-693, 2015.

[9] M. A. Rahman and M. Z. Islam, “A hybrid clustering technique combining a novel genetic algorithm with K-Means,” Knowledge-Based Systems, vol. 71, pp. 345-365, 2014.

[10] S. Wikaisuksakul, “A multi-objective genetic algorithm with fuzzy c-means for automatic data clustering,” Applied Soft Computing, vol. 24, pp. 679-691, 2014.

[11] K. Kaushik and V. Arora, “A Hybrid Data Clustering Using Firefly Algorithm Based Improved Genetic Algorithm,” Procedia Computer Science, vol. 58, pp. 249-256, 2015.

[12] L. J. G. Villalba, A. L. S. Orozco, and J. R. Corripio, “Smartphone image clustering,” Expert Systems with Applications, vol. 42, pp. 1927-1940, 2015.

[13] F. Zhao, H. Liu, and J. Fan, “A multiobjective spatial fuzzy clustering algorithm for image segmentation,” Applied Soft Computing, vol. 30, pp. 48-57, 2015.

[14] K. K. Bharti and P. K. Singh, “Hybrid dimension reduction by integrating feature selection with feature extraction method for text clustering,” Expert Systems with Applications, vol. 42, pp. 3105-3114, 2015.

[15] H. He and Y. Tan, “A two-stage genetic algorithm for automatic clustering,” Neurocomputing, vol. 81, pp. 49-59, 2012.

[16] S. Razavi, O. M. Ebadati E., S. Asadi, and H. Kaur, “An Efficient Grouping Genetic Algorithm for Data Clustering and Big Data Analysis,” in Computational Intelligence for Big Data Analysis. vol. 19, D. P. Acharjya, S. Dehuri, and S. Sanyal, Eds., ed: Springer International Publishing, 2015, pp. 119-142.

[17] A. Kolesnikov, E. Trichina, and T. Kauranne, “Estimating the number of clusters in a numerical data set via quantization error modeling,” Pattern Recognition, vol. 48, pp. 941-952, 2015.

[18] J. Liang, X. Zhao, D. Li, F. Cao, and C. Dang, “Determining the number of clusters using information entropy for mixed data,” Pattern Recognition, vol. 45, pp. 2251-2265, 2012.

[19] X. Chen, “A new clustering algorithm based on near neighbor influence,” Expert Systems with Applications, 2015.

[20] A. Bahrololoum, H. Nezamabadi-pour, and S. Saryazdi, “A data clustering approach based on universal gravity rule,” Engineering Applications of Artificial Intelligence, vol. 45, pp. 415-428, 2015.

[21] A. Hebooul, F. Hachouf, and A. Boulemnadjel, “A new Incremental Neural Network for Simultaneous Clustering and Classification,” Neurocomputing, 2015.

[22] M. Mahdavi, M. H. Chehreghani, H. Abolhassani, and R. Forsati, “Novel meta-heuristic algorithms for clustering web documents,” Applied Mathematics and Computation, vol. 201, pp. 441-451, 2008.

[23] T. İnkaya, S. Kayalıgil, and N. E. Özdemirel, “Ant Colony Optimization based clustering methodology,” Applied Soft Computing, vol. 28, pp. 301-311, 2015.

[24] E. O. M. Ebadati and M. M. Tabrizi, “A Hybrid Clustering Technique to Improve Big Data Accessibility Based on Machine Learning Approaches,” in Information Systems Design and Intelligent Applications: Proceedings of Third International Conference INDIA 2016, Volume 1, C. S. Satapathy, K. J. Mandal, K. S. Udgata, and V. Bhateja, Eds., ed New Delhi: Springer India, 2016, pp. 413-423.

[25] A. Christy, G. M. Gandhi, and S. Vaithyasubramanian, “Cluster Based Outlier Detection Algorithm for Healthcare Data,” Procedia Computer Science, vol. 50, pp. 209-215, 2015.

[26] M. Gupta, J. Gao, C. Aggarwal, and J. Han, “Outlier detection for temporal data,” Synthesis Lectures on Data Mining and Knowledge Discovery, vol. 5, pp. 1-129, 2014.

[27] X. Wang, X. L. Wang, Y. Ma, and D. M. Wilkes, “A fast MST-inspired kNN-based outlier detection method,” Information Systems, vol. 48, pp. 89-112, 2015.

[28] (2002 – 2003). UCI Machine Learning Repository: Perfume Data Data Set. Available: https://archive.ics.uci.edu/ml/datasets/Perfume+Data

[29] M. S. Hoffman, The World Almanac and Book of Facts 1993: The Authority Since 1868, 1993.



Recommendation
EssayHub’s Community of Professional Tutors & Editors
Tutoring Service, EssayHub
Professional Essay Writers for Hire
Essay Writing Service, EssayPro
Professional Custom
Professional Custom Essay Writing Services
In need of qualified essay help online or professional assistance with your research paper?
Browsing the web for a reliable custom writing service to give you a hand with college assignment?
Out of time and require quick and moreover effective support with your term paper or dissertation?
Did you find someone who can help?

Get quality dissertation help from certified experts, along with a free originality report attached.