class outliers mining d -b...

TTHEHE I ISLAMICSLAMIC U UNIVERSITYNIVERSITY

FFACULTYACULTY O OFF I INFORMATIONNFORMATION T TECHNOLOGYECHNOLOGY

DDEPARTMENTEPARTMENT O OFF C COMPUTEROMPUTER S SCIENCECIENCE

GGAZAAZA - P - PALESTINEALESTINE

CCLASSLASS O OUTLIERSUTLIERS M MININGINING::DDISTANCEISTANCE-B-BASEDASED A APPROACHPPROACH

A projectsubmitted to the

Department of Computer Sciencein partial fulfillment

of the requirement for the Degree ofBachelor of Science

inComputer Science

by

Motaz K. SaadMotaz K. Saad

Supervisor

Dr. Nabil M. HewahiDr. Nabil M. Hewahi

March, 2006

Acknowledgments

I would like to thank my parents very much for their pray, patience, motivation, and continues support. I also extend my thanks to all my family members for their motivation, and support. I am very grateful to my supervisor Dr. Nabil M. Hewahi for his constant guide, moral support, and very good views he provided me.

Class Outliers Mining: Distance-Based Approach. 2

Abstract

In large datasets, identifying exception or rare case/cases with respect to a group of similar cases is to be considered very significant problem. The traditional problem (Outlier Mining) is to find exception or rare cases in a dataset irrespective of the class label of these cases, they are considered rare event with respect to the whole dataset. In this research, we discuss the problem that is Class Outliers Mining and a method to find out those outliers. The general framework of this problem is “given a set of observations with class labels, find those that arouse suspicions, taking into account the class labels”. We introduce a novel definition of Class Outlier, and propose the Class Outlier Factor (COF) which measures the degree of being a Class Outlier of a data object. Out work includes a proposal of a new efficient algorithm towards mining of the Class Outliers, presenting experimental results applied on various domains of real world datasets and finally a comparison study with other related methods is performed.


TO The Memory of My Father:

Khalid H. Saad.Khalid H. Saad.


Table of ContentsAcknowledgments....................................................................................2Abstract....................................................................................................31. Introduction and Motivation..............................................................8

1.1. Outlier Definitions..........................................................................91.2. Outliers Detection Methods...........................................................9

1.2.1. Distribution Based...........................................................................91.2.2. Clustering.......................................................................................101.2.3. Depth Based...................................................................................101.2.4. Distance-Based..............................................................................101.2.5. Model Based Approach: Neural Networks (NNs).........................11

1.3. Distance-Based Approach............................................................121.3.1. K Nearest Neighbors Approach (KNN)........................................121.3.2. Density Based................................................................................15

1.4. Research Motivation....................................................................162. The Proposed Approach...................................................................23

2.1. The Used Definitions and Terms.................................................23 2.1.1. Distance (Similarity) Function.....................................................23 2.1.2. K Nearest Neighbors (KNN)........................................................25 2.1.3. PCL...............................................................................................262.1.4. Deviation........................................................................................27 2.1.5. K-Distance (The Density Factor)..................................................28 2.1.6. Class Outlier.................................................................................29 2.1.7. Class Outlier Factor (COF)...........................................................30

2.2. The Proposed Algorithm (CODB Algorithm)..............................303. Experimental Results........................................................................34

3.1. Experiments.................................................................................34 3.1.1. Experiment I (votes dataset).........................................................34 3.1.2. Experiment II (hepatitis dataset)...................................................39 3.1.3. Experiment III (heart-statlog dataset)...........................................44 3.1.4. Experiment IV (credits approval dataset).....................................49 3.1.5. Experiment V (vehicle dataset)....................................................54

3.2. Experimental Results Analysis Study.........................................614. Comparison Study.............................................................................635. Conclusion..........................................................................................67References..............................................................................................68


Index of FiguresFigure 1: Identifying Outliers.......................................................................................8

Figure 2: Statistical Approach Outlier Detection (normal distribution).....................10

Figure 3: A schematic view of a fully connected Replicator Neural Network...........12

Figure 4: Distance-Based Approach: K Nearest Neighbors (KNN)...........................13

Figure 5: The difference between Distance-based Outliers definitions......................14

Figure 6: The Concept of Density Based Outlier........................................................16

Figure 7: Party Behavior of votes dataset. .................................................................18

Figure 8: the K Nearest Neighbors of the instance T (K = 3).....................................26

Figure 9: The probability of the class label with respect to class label of its nearest

neighbors.....................................................................................................................27

Figure 10: Deviation: y2 deviates from instances with y class more than y1..............28

Figure 11: K-Distance (Density Factor).....................................................................29

Figure 12: CODB Algorithm......................................................................................31

Figure 13: COF_Rank Algorithm...............................................................................32

Figure 14: The CODB algorithm flowchart................................................................33

Figure 15: SOF vs COF..............................................................................................65


Index of TablesTable 1: The nearest neighbors of instance #407 of votes dataset..............................17

Table 2: The nearest neighbors of instance #31 of hepatitis dataset...........................19

Table 3: The 7 Nearest Neighbors of the Instance #69 of heart-statlog dataset.........19

Table 4: The top 20 Class Outliers of votes dataset....................................................36

Table 5: The 7NN of the top 10 class outliers of votes dataset..................................37

Table 6: The top 10 Class Outliers of hepatitis dataset..............................................41

Table 7: The 7NN of the top 10 class outliers of hepatitis dataset.............................42

Table 8: The top 10 Class Outliers of heart-statlog dataset........................................46

Table 9: The 7NN of the top 10 class outliers of heart-statlog dataset.......................47

Table 10: The top 10 Class Outliers of credit-a dataset..............................................51

Table 11: The 7NN of the top 10 class outliers of credit-a dataset.............................52

Table 12: The top 10 Class Outliers of vehicle dataset.............................................56

Table 13: The 9NN of the top 10 class outliers of vehicle dataset.............................57

Table 14: The top 20 Semantic Outliers for votes dataset..........................................66


1. Introduction and Motivation

Detecting outliers, examples in a database with unusual properties, is an important data mining task. Recently researchers have begun focusing on this problem and have attempted to apply algorithms for finding outliers to tasks such as fraud detection [7], identifying computer network intrusions [10, 28], data cleaning [37], detecting employers with poor injury histories [24], and in other several problem domains (e.g., surveillance and auditing, stock market analysis, health monitoring systems, insurance, banking and telecommunication ..., etc). The problem of detecting rare events, deviant objects, and exceptions is very important. Methods for finding such outliers in large data sets are drawing increasing attention [3, 4, 12, 17, 18, 24, 25, 26, 27, 36]. Figure 1 shows outliers of a data set in n dimensional space presented in circles. For a set of numerical data, any value that is markedly smaller or larger than other values. For example, in the data set {3, 5, 4, 4, 6, 2, 25, 5, 6, 2} the value of 25 is an outlier.


Figure 1: Identifying Outliers.

1.1. Outlier Definitions1.1.1. An Outlier is an observation that deviates so much from

other observations as to arouse suspicion that it was generated by a different mechanism [13].

1.1.2. An Outlier is a data object that does not comply with the general behavior of the data, It can be considered as noise (One person's noise could be another person's signal) or exception but is quite useful in fraud detection and rare events analysis [12].

1.2. Outliers Detection Methods

1.2.1. Distribution Based

Methods in this category are typically found in statistics textbooks. They deploy some standard distribution model (e.g., normal) and flag as outliers those points which deviate from the model [4, 13, 36].

For arbitrary data sets without any prior knowledge of the distribution of points, we have to perform expensive tests to determine which model fits the data best, if any. Figure 2 depicts normal distribution of a data set points.

Outlier detection has a long history in statistics [4, 13], but has largely focused on data that is univariate, and with a known (or parametric) distribution. These two limitations have restricted the ability to apply these types of methods to large real-world databases which typically have many different fields and have no easy way of characterizing the multivariate distribution of examples.


1.2.2. Clustering

Many clustering algorithms detect outliers as by-products [17]. However, since the main objective is clustering, they are not optimized for outlier detection. Furthermore, the outlierness criteria are often implicit and cannot easily be inferred from the clustering procedures. An intriguing clustering algorithm using the fractal dimension has been suggested by [3], however, it has not been demonstrated on real datasets.

1.2.3. Depth Based

This is based on computational geometry and finds different layers of k-d convex hulls [18]. Points in the outer layer are potentially flagged as outliers. However, these algorithms suffer from the dimensionality curse.

1.2.4. Distance-Based

DB(p, d) This was originally proposed by Knorr and Ng [25]. A point in a data set DB is a distance-based outlier if at least a fraction p of the points in DB are further than d from it


Figure 2: Statistical Approach Outlier Detection (normal distribution).

[12]. This outlier definition is based on a single, global criterion determined by the parameters p and d and cannot cope with local density variations.

In section 3.1, we shall discuss in details the distance-based approach due to its relation to our approach.

1.2.5. Model Based Approach: Neural Networks (NNs)

This approach was proposed by Hawkins, et al. [14]. It employs multi-layer perceptron neural networks with three hidden layers, and the same number of output neurons and input neurons, to model the data. These neural networks are known as replicator neural networks (RNNs). In the RNN model the input variables are also the output variables so that the RNN forms an implicit, compressed model of the data during training.

A measure of outlyingness of individuals is then developed as the reconstruction error of individual data points.

The RNN we use is a feed-forward multi-layer perceptron with three hidden layers sandwiched between an input layer and an output layer. The function of the RNN is to reproduce the input data pattern at the output layer with error minimized through training. Both input and output layers have n units, corresponding to the n features of the training data. The number of units in the three hidden layers are chosen experimentally to minimize the average reconstruction error across all training patterns.

The model structure is shown in Figure 3 and the points listed below:

i. Use a replicator 4-layer feed-forward neural network.

ii. Input variables are the target output during training.

iii.Replicator Neural Networks (RNN) forms a compressed model for training data.

iv.Outlyingness → reconstruction error.


1.3. Distance-Based ApproachThis approach has two different major methods, K Nearest

Neighbors and Density Based.

1.3.1. K Nearest Neighbors Approach (KNN)

Defining outliers by their distance to neighboring examples is a popular approach to finding unusual examples in a data set. Recently, much work has been conducted with the goal of finding fast algorithms for this task [24, 26, 27].

Although distance is an effective non-parametric approach to detecting outliers, the drawback is the amount of computation time required. Straightforward algorithms, such as those based on nested loops, typically require O(N2) distance computations. This quadratic scaling means that it will be very difficult to mine outliers as we tackle increasingly larger data sets. This is a major problem for many real databases where there are often millions of records.

So some of these approaches show that a simple nested loop algorithm that in the worst case is quadratic, and there is some tries to improve it to give near linear time performance


Figure 3: A schematic view of a fully connected Replicator Neural Network.

when the data is in random order [5].

A popular method of identifying outliers is by examining the distance to an example's nearest neighbors [2, 23, 24, 35] as shown in Figure 4. In this approach, one looks at the local neighborhood of points for an example typically defined by the K nearest examples (also known as neighbors). If the neighboring points are relatively close, then the example is considered normal; if the neighboring points are far away, then the example is considered unusual. The advantages of distance-based outliers are that no explicit distribution needs to be defined to determine unusualness, and that it can be applied to any feature space for which we can define a distance measure. Some other distance-based outlier definitions:

i. Outliers are the examples for which there are fewer than p other examples within distance d [23, 24].


Figure 4: Distance-Based Approach: K Nearest Neighbors (KNN).

ii. Outliers are the top n examples whose distance to the Kth

nearest neighbor is greatest [35].

iii.Outliers are the top n examples whose average distance to the K nearest neighbors is greatest [10, 24].

There are several minor differences between these definitions as shown in Figure 5. The first definition does not provide a ranking and requires specifying a distance parameter d. Ramaswamy et al. [10] argue that this parameter could be difficult to determine and may involve trial and error to guess an appropriate value. The second definition only considers the distance to the Kth neighbor and ignores information about closer points. Finally, the last definition accounts for the distance to each neighbor but is slower to calculate than definition 1 or 2. However, all of these definitions are based on a nearest neighbors density estimate to determine the points in low probability regions which are considered outliers.

Researchers have tried a variety of approaches to find these outliers efficiently. The simplest are those using nested loops [23, 24, 35]. In the basic version, one compares each example with every other example to determine its K nearest


Figure 5: The difference between Distance-based Outliers definitions. Both of p1 and p2 has 10 neighbors, in some approaches p1 and p2 have the same outlierness ranking and in other approaches p1 has larger outlierness ranking than p2.

neighbors. Given the neighbors for each example in the data set, simply select the top n candidates according to the outlier definition. This approach has quadratic complexity as we must make all pairwise distance computations between examples.

1.3.2. Density Based

This was proposed by Breunig, et al. [8] and by Ester, et al. [11]. It relies on the local outlier factor (LOF: is the average of the ratios of the density of example p and the density of its nearest neighbors) of each point, which depends on the local density of its neighborhood. The neighborhood is defined by the distance to the MinPts-th nearest neighbor, where MinPts is the minimum number of points of the nearest neighbors. Figure 6 shows the concept of Density based outliers and the difference between the nearest neighbors approach and the density based approach, where in the nearest neighbors (NN) approach, p2 is not considered as outlier, while LOF approach find both p1 and p2 as outliers. In typical use, points with a high LOF are flagged as outliers. The process steps as below:

- Compute density of local neighborhood for each point.

- Compute LOF.

- Larger LOF → Outliers.

Density Based approach was proposed primarily to deal with the local density problems of the distance based method. However, selecting MinPts is non-trivial, in order to detect outlying clusters, MinPts has to be as large as the size of these clusters.


It should be mentioned that Clustering-based can be classified under Distance-Based Approach because it basically uses Distance-Based techniques.

1.4. Research MotivationFrom the previous discussion, we notice that all the

mentioned approaches do not consider the class labels of the data set, rather, they focus on the observation that deviates so much from other observations as to arouse suspicion that it was generated by a different mechanism. this means all the previous methods are devoted on the overall data set without looking closely to each class label separately.

Obviously, in the K Nearest Neighbor systems, it is


Figure 6: The Concept of Density Based Outlier. In nearest neighbors (NN) approach, p2 is not considered as outlier, while LOF approach find both p1 and p2 as outliers.

expected that the instances in KNN are to be identified in the same class label. However, this is not always true. It is reasonable to take those instances whose class label are different from that of the majority of the KNN as a class outliers with consideration also to other factors. More details will be discussed in chapter 2.

To show the significance of class outliers, let us notice the following examples. Consider the problem of finding a voter from democrat party that behaves or acts like republicans, in other words, what is the percentage of Democrats that act like (have similar ideas) Republicans and vise versa. Table 1 shows the nearest neighbors of instance #407 in votes dataset [6]. The table shows that the class label of instance #407 is different form that of its neighbors although it is very similar to them. More information about the domain, dataset and experimental details are presented in chapter 3.

The 7 Nearest Neighbors of the Instance #407

Inst #

Issue #

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Class

407 n n n y y y n n n n y y y y n n democrat

306 n n n y y y n n n n n y y y n n republican





339 y n n y y y n n n n y y y y n n republican

Table 1: The nearest neighbors of instance #407 of votes dataset

To understand why the instance #407 (Democrat) is close to a set of Republican instances, an analysis has been performed on the whole dataset and found that the majority of democrats vote in issues 3, 4, 9, 12 and 13 anti-republican, however, instance #407 in all previous issues voted pro-republican. Figure 7. depicts the party behavior in the issues 3, 4, 9, 12 and


13. For example, minority of Republicans vote “yes” for issue 3, although the majority of Democrats vote “yes”. Similarly it is the case for the sane issue in voting “no” with majority of Republicans and minority of Democrats.

In a medical/biological domains, consider the problem of finding the exceptional case (or cases) of a group of similar cases, where the class label is a medical diagnoses (like “live” and “die” in hepatitis dataset or “absent” and “present” in heart-statlog dataset [7]). The question is why the class label of one of the KNN is “die” while the class label of its KNN is “live”. Table 2 shows a case where most of the inputs for instance #31


Figure 7: Party Behavior of votes dataset.

are mostly similar to its seven nearest neighbors but its class is different. Instance #31 is considered to be class outlier. Giving this table, we are not trying to give explanations on how this case medically happened, but it remains as an interesting question which has to be answered by doctors. Table 3 also shows the same interesting notice


Inst. #Att. # 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Class

31 62 female no no yes yes no yes no no no no no 1 105.3 60 3.82 61.85 no DIE

69 44 female no no yes yes no yes no no no no no 1.6 68 68 3.7 61.85 no LIVE

25 27 female no no yes yes no yes no no no no no 0.8 95 46 3.8 100 no LIVE

64 49 female yes no yes yes no yes no no no no no 0.8 103 43 3.5 66 no LIVE

29 51 female no yes yes yes no yes no no no no no 1 78 58 4.6 52 no LIVE

55 37 female no no yes yes yes yes no no no no no 0.8 92 59 3.82 61.85 no LIVE

83 67 male no no yes yes no yes no no no no no 1.5 179 69 2.9 61.85 no LIVE

Table 2: The nearest neighbors of instance #31 of hepatitis dataset.


Inst. #

Att. #1 2 3 4 5 6 7 8 9 10 11 12 13 Class

69 47 1 3 108 243 0 0 152 0 0 1 0 3 present

62 44 1 3 120 226 0 0 169 0 0 1 0 3 absent

150 41 1 3 112 250 0 0 179 0 0 1 0 3 absent

179 50 1 3 129 196 0 0 163 0 0 1 0 3 absent

38 42 1 3 130 180 0 0 150 0 0 1 0 3 absent

253 51 1 3 110 175 0 0 123 0 0.6 1 0 3 absent

23 47 1 4 112 204 0 0 143 0 0.1 1 0 3 absent

Table 3: The 7 Nearest Neighbors of the Instance #69 of heart-statlog dataset


We believe that class outliers have similar advantages and applications of traditional outlier like data preprocessing and cleaning, credit card fraud detection, network intrusion detection, stuck market analysis, health care and monitoring, ...etc., (in general problem of detecting rare events, deviant objects, and exceptions), further more, class outliers have very promising potential advantages, applications, new future research directions.

To the best of our knowledge, the problem of “given a set of observations with class labels, find those that arouse suspicions, taking into account the class labels” has only been explicitly considered in [15, 16, 32 ]. The proposed methods are Semantic outlier [15], Cross-Outlier detection [32], Class Outlier [16].

He, et al. [15] tried to find meaningful outliers that called Semantic Outlier Factor (SOF), the approach is based on running a clustering algorithm on a data set with a class label, it is expected that the instances in every output cluster be identified with the same class label. However, it is not always true. it is reasonable to take those instances whose class label are different form that of the majority of the cluster as semantic outlier. The Semantic Outlier definition is a data point, which behaves differently with other data points in the same class, while looks normal with respect to data points in another class.

Papadimitriou and Faloutsos [32] tried to solve the problem: Given two sets (or classes) of objects, find those which deviate with respect to the other set. Those points are called Cross-outlier, and the problem is identified by Cross-outlier detection. In this case we have a primary set P in which we want to discover cross-outliers with respect to a reference set R (detecting outlying observations: discover points p ∈ P that “arouse suspicions” with respect to points r ∈ R). The proposed solution is to use a statistically intuitive criterion for outlier flagging (the local neighborhood size differ more than three standard deviation from the local average), with no magic cut-offs. Papadimitriou and Faloutsos [32] considered that


Some single class approaches may be modified to deal with multiple classes, but the task is non-trivial. The general problem is open and provides promising future research directions. The authors generally considered the exiting approaches for the single-set problem are not immediately extensible to cross-outlier detection. Also several outlier definitions themselves cannot be extended.

He, et al. [16] tried to find a general framework to all types of outliers (traditional outliers and class outliers) and to generalize the contributions [32, 16] by proposing a practical solution and extending existing outlier detection algorithms. The generalization does not consider only outliers that deviate with respect to their own class, but also outliers that deviate with respect to other classes. In addition, potential applications of customer relationship management (CRM) are introduced.

In this research we propose a new method for mining class outliers based on distance-based approach and nearest neighbors by introducing the Concept of COF (Class Outlier Factor) which represent the degree of being a class outlier. Also we try to overcome some limitations of the related methods.

The main limitations of the previously proposed methods are that they do not handle numeric or mixed dataset. Moreover in [38, 39] a clustering as a pre-process has to be performed. Surely this would increase the computing complexity. In addition to that, the approach used in [15, 16] is based on the probability of the occurrence of the outlier within certain cluster to specify the rank of the outlier. This might produce the same rank for very far/close outliers in the cluster. In [32], the proposed approach does not handle datasets with more than two classes, whereas, our proposed method takes care of this problem.


The main contributions of our research are the following:

● Distance-Based Class Outlier definition.

● Introducing the Concept of COF (Class Outlier Factor)

● CODB Algorithm for mining Class Outliers.

● CODB Algorithm Implementation using weka framework [39].

● Experimental results of the algorithm test on real world data sets which show the capability of COF to find class outliers.

● Comparison study with results of other related methods.

Details of the proposed system is discussed in chapter 2. chapter 3 presents the experimental results followed by a comparison study with results obtained by other previously proposed methods in chapter 4. Finally chapter 5 contains the conclusions and the future work.


2. The Proposed Approach

2.1. The Used Definitions and TermsBefore going into the details of the proposed approach, we

shall give the following definitions.

2.1.1. Distance (Similarity) Function

Given a data set D = {t1, t2, t3, ..., tn} of tuples where each tuple ti = <ti1, ti2, ti3, ..., tim, Ci> contains m attributes and the class label Ci, the similarity function based on the Euclidean Distance [12] between two data tuples, X = <x1, x2, x3, ...., xm> and Y = <y1, y2, y3,..., ym> (excluding the class labels) is

d 2X ,Y = ∑i=1

m

x i− y i2

A generalization of the Euclidean function is the Minkowski [12] similarity function is

d q X ,Y =q∑i=1

m

wi∣x i− y i∣q

The Euclidean function results by setting q to 2 and each weight, wi, to 1.

The Manhattan distance,

d 1X ,Y = ∑i=1

m

∣x i− y i∣

results by setting q to 1 and each weight, wi, to 1.


The distance function is used to determine similarity. For numeric attributes this is usually based on Euclidean distance by normalizing the numeric attributes.

An attribute is normalized by scaling its values so that they fall within a small specific range, such 0.0 to 1.0 [12].

Normalization is particularly useful for classification algorithms including neural networks, or distance measurements such as nearest neighbors systems, classification and clustering. If using neural network back-propagation algorithm for classification mining, normalizing the input values for each attribute measured in the training samples will help speed up the learning phase. For distance-based methods, normalization helps prevent attributes with initially large ranges (e.g., income) form outweighing attributes with initially smaller ranges (e.g., binary attributes). There are many methods for data normalization: min-max normalization, z-score normalization, and normalization by decimal scaling [12].

We use min-max normalization for our distance function, min-max normalization performs a linear transformation on the original data. Suppose that MinA and MaxA are the minimum and the maximum values of an attribute A. min-max normalization maps a value v to v in the range [newMinA, newMaxA] by computing

v =v − MinA

Max A − Min A

∗newMax A − newMinA newMin A

Min-max normalization preserves the relationships among the original data values [12].

In our distance function we use min-max normalization with scaling range between 0.0 and 1.0, so the above formula will be


v =v − MinA

Max A − Min A

∗ 1.0−0.0 0.0

v =v − MinA

Max A − Min A

Symbolic (nominal) features are more problematic as they do not fit in the Euclidean feature space model. To overcome this problem, similarity between symbolic features is determined by counting the matching features. This is a much weaker function as there may be several concepts based on entirely different features, all of which match the current example to the same degree. For domains containing a mixture of numeric and symbolic features the Euclidean distance function is adopted, with the distance between two symbolic values trivialized to zero if the features are the same, and one if they are not. This mismatch between Euclidean feature space and symbolic features means that pure nearest neighbor systems usually perform better in numeric domains than in symbolic ones.

Both the Euclidean distance and Manhattan distance satisfy the following mathematic requirements of a distance function:

1. d(i, j) ≥ 0 : Distance is nonnegative number.

2. d(i, i) = 0 : The distance of an object and itself is 0.

3. d(i, j) = d(j, i) : Distance is a symmetric function.

4. d(i, j) ≤ d(i, h) + d(h, j) : Going directly form object i to object j in space is no more than making a detour over any other object h (triangular inequality).

2.1.2. K Nearest Neighbors (KNN)

For any positive integer K, the K-Nearest Neighbors of a tuple ti are the K closest tuples in the data set. As shown in


Figure 8.

2.1.3. PCL

PCL(T): The Probability of the class label of the instance T with respect to the class labels of its K Nearest Neighbors.

For example, suppose we are working with 7 nearest neighbors of the an instance T (including itself) on a data set with two class labels x and y, where 5 of these neighbors have the class label x, and 2 have the class label y as shown in Figure 9. The instance T has the class label y, so the PCL of the instance T ( The probability of the class label y to the other class labels of the nearest neighbors) is 2/7.


Figure 8: the K Nearest Neighbors of the instance T (K = 3).

2.1.4. Deviation

Given a subset DCL = {t1, t2, t3, ..., th} of a data set D = {t1, t2, t3, ..., tn}. Where h is the number of instances in DCL and n is the number of instances of D.

Given the instance T, DCL contains all the instances that have the similar class label of that of the instance T.

The Deviation of T is how much the instance T deviates from DCL subset.

The deviation is computed by summing the distance between the instance T and every instance in DCL.

Deviation T = ∑i=1

h

d T , t i ,Where t i∈DCL.


Figure 9: The probability of the class label with respect to class label of its nearest neighbors.

x

xx

x

x

y y

x

x

x

xx

x

x

x x

The instance T

To demonstrate the importance of deviation, Figure 10 illustrates that, although the PCL of y1 and y2 are the same (K = 5), but y2 deviates more than y1 from instances with y class. This will reflect the decision of class outlier in favor of y2.

2.1.5. K-Distance (The Density Factor)

K-Distance(T) is the K Distance between the instance T and its K nearest neighbors, i.e. how much the K nearest neighbors instances are close to the instance T.


Figure 10: Deviation: y2 deviates from instances with y class more than y1.

KDist T = ∑i=1

K

d T , ti

As shown in Figure 11, the KNN of the instance y (the bold y) in Figure 11.B are much closer (higher density) to it than the KNN of the instance y (the bold y) in Figure 11.A. Although the PCL(T) for both instances in Figure 11.A and Figure 11.B is 2/7, the instance y in Figure 11.B is considered to be more class outlier than the instance y in Figure 11.A.

2.1.6. Class Outlier

Class Outliers are the top N instances which satisfy the following:

1. The K-Distance to its K nearest neighbors is the least.

2. Its Deviation is the greatest.

3. Has different class label form that of its K nearest neighbors.


Figure 11: K-Distance (Density Factor).

2.1.7. Class Outlier Factor (COF)

COF (Class Outlier Factor) : The Class Outlier Factor of the instance T is the degree of being Class Outlier. The Class Outlier Factor of the instance T is defined as:

COF T = K∗PCLT ∗ 1DeviationT

∗KDist T

Where PCL(T), Deviation(T) and KDist(T) are described in definitions 3, 4 and 5 respectively.

As shown above, we scaled PCL(T) from [1/K,1] to [1,K] by multiplying it by K. α and β factors are to control the importance and the effects of Deviation and K-Distance, where 0 ≤ α ≤ M and 0 ≤ β ≤ 1. M is a changeable value based on the application domain and the initial experimental results. If the Deviation in hundreds for example, the best value for α is 100, and if the Deviation in tens, then the best value for α is 10 and so on. For more details see chapter 3 where we shall present experimental results, and show how those factors affect the conclusions.

2.2. The Proposed Algorithm (CODB Algorithm)In this section we present the proposed algorithm. We call

our proposed algorithm “Class Outliers: Distance-Based” (CODB Algorithm). Figure 12 shows the pseudocode of CODB algorithm, Figure 13 presents the COF_Rank algorithm which is called by CODB, and Figure 14 depicts the CODB algorithm flowchart.

The main concept of CODB is to rank each instance in the dataset D. This is done by calling the COF_Rank procedure after providing the CODB with all necessary data such as the value of α, β and K (the number of nearest neighbors). The COF_Rank finds out the rank of each instance using the


formula 2.1.7 and gives back the rank to CODB. The CODB maintains a list of only the instances of the top N class outliers. The less is the value of COF of an instance, the higher is the priority of the instance to be class outlier.

123456789

10111213141516

CODB AlgorithmInput: D = {t1, t2, t3, ..., tn} /*Dataset*/ n /*Dataset size*/α /*Alpha Factor*/β /*Beta Factor*/K /*Number of Nearest Neighbors*/N /*Number of Top Class Outliers*/

Output: Top N Class Outliers and their COF (Class Outlier Factor) value

Process: /*Initialize empty set (empty list)*/COF_List = new List with size N;/*Initialize Class Outlier Factor*/COF = 0;/*Process each instance in the data set to rank it with COF, and keep only the top n Class Outliers in the list COF_List*/

for i = 1 to n {COF = COF_Rank( Instance(i), K, α, β ); // compute COF for the instance iif COF_List.Size() < N then // if the list is not full

COF_List.Add( Instance(i)); // the add the instance i to the listelse {

/* Keep only top N Class Outliers (that have smaller COF value)*/if COF < COF_List.GetMax() then {

/*Remove instance with highest COF value*/COF_List.RemoveMax();/*Keep only top N Class Outliers (with smaller COF value)*/COF_List.Add( Instance(i));}

}}

/*print out the top N class outliers list*/print(COF_List);

Figure 12: CODB Algorithm.


Procedure: COF_Rank( Instance(i), K, α, β )

Output:COF /*Class Outlier Factor (Degree of Outlierness: Smaller → Top Class Outliers)*/

Process:/*Initialize PCL (The Probability of the class label of the instance T with respect the class labels of its K Nearest Neighbors)*/PCL = 0;/*Initialize Deviation (How much the instance T deviate from the instances that have same class label as its class label, i.e. how the instance T differ from other instances in the same class)*/Deviation = 0;/*Initialize KDist (The Average Distance between the instance T and its K neighbors, i.e. how much the K nearest neighbors instances are close to instance T) */KDist = 0;

PCL = PCL(T);/*Deviation(T): Σ of all distances between T and instances in the same class*/Deviation = Deviation(T);/*KDist(T): Σ of all distances between T and the K neighbors instances*/KDist = KDist(T);COF = PCL + (α / Deviation) + (β * KDist);

return COF;

Figure 13: COF_Rank Algorithm



Figure 14: The CODB algorithm flowchart.

3. Experimental Results

The CODB algorithm has been applied on five different real world datasets. All the datasets are publicly available at the UCI machine learning repository [6]. The datasets are chosen from various domains that might have single or mixed data types and with two or more class labels. This variation is being tested on our proposed algorithm to show its capabilities.

3.1. Experiments

3.1.1. Experiment I (votes dataset)

The dataset of votes [6] (1984 United States Congressional Voting Records Database) is form Congressional Quarterly Almanac (CQA), 98th Congress, 2nd session 1984, Volume XL: Congressional Quarterly Inc. Washington, D.C., 1985. This data set includes votes for each of the U.S. House of Representatives Congressmen on the 16 key votes identified by the CQA. The CQA lists nine different types of votes:

• Voted for, paired for, and announced for (these three simplified to yea).

• Voted against, paired against, and announced against (these three simplified to nay).

• Voted present, voted present to avoid conflict of interest, and did not vote or otherwise make a position known (these three simplified to an unknown disposition).

There are 16 attributes + class name = 17 attributes, all boolean valued, there are 435 instances belonging to two classes, i.e. democrats or republicans. The class distribution is 267 democrats, 168 republicans (61.38% Democrats, 38.62% Republicans).


The attributes information and their possible values are listed below:

1. handicapped-infants: (y,n).

2. water-project-cost-sharing: (y,n).

3. adoption-of-the-budget-resolution: (y,n).

4. physician-fee-freeze: (y,n).

5. el-salvador-aid: (y,n).

6. religious-groups-in-schools: (y,n).

7. anti-satellite-test-ban: (y,n).

8. aid-to-nicaraguan-contras: 2 (y,n).

9. mx-missile: (y,n).

10. immigration: (y,n).

11. synfuels-corporation-cutback: (y,n).

12. education-spending: (y,n).

13. superfund-right-to-sue: (y,n).

14. crime: (y,n).

15. duty-free-exports: (y,n).

16. export-administration-act-south-africa: (y,n).

17. Class Name: (democrat, republican).

Missing attribute values are denoted by "?" in the original dataset, and it is important to recognize that "?" in this database does not mean that the value of the attribute is unknown. It means simply, that the value is not "yea" or "nay" [6]. So we replaced all “?” by “noVote” value to represent the real position of the voter, and to avoid handling it as a missed value in the experiments to get more reality.

The following inputs are provided to the implemented algorithm:

K: 7

Top N COF: 20


Distance type: Euclidean Distance

α: 100

β: 0.1

Remove Instance With Missing Values: false

Replace Missing Values: false

Table 4 shows the top 20 class outliers whereas table 5 shows the distance of each chosen outlier instance from its K nearest neighbors (7 nearest neighbors).

# Instance #PCL Deviation KDist

COF# Instance #

PCL Deviation KDist

COF

1 4071 896.24 6.0

COF: 1.7115811 176

2 519.94 8.49

COF: 3.04086

2 3751 881.78 8.07

COF: 1.9205112 384

2 832.95 9.34

COF: 3.0543

3 3881 857.35 8.07

COF: 1.9237513 365

2 836.65 9.44

COF: 3.0634

4 1611 819.35 8.49

COF: 1.9705814 6

2 849.33 9.76

COF: 3.0934

5 2671 523.66 8.49

COF: 2.0394915 355

2 524.28 9.12

COF: 3.10283

6 711 535.52 9.44

COF: 2.1306116 164

2 845.19 10.07

COF: 3.12576

7 771 799.64 10.39

COF: 2.1642917 402

2 480.79 10.93

COF: 3.30081

8 3252 846.64 8.49

COF: 2.9666418 151

2 879.84 12.0

COF: 3.31366

9 1602 829.17 8.49

COF: 2.9691319 173

3 839.0 8.07

COF: 3.9263

10 3822 851.76 9.02

COF: 3.0198620 75

3 841.28 9.44

COF: 4.06275

Table 4: The top 20 Class Outliers of votes dataset.


Instan

ce #

Distan

ce


407 n,n,n,y,y,y,n,n,n,n,y,y,y,y,n,n,democrat 0.0

306 n,n,n,y,y,y,n,n,n,n,n,y,y,y,n,n,republican 1.0





339 y,n,n,y,y,y,n,n,n,n,y,y,y,y,n,n,republican 1.0


375 n,y,n,y,y,y,n,n,n,n,y,y,n,y,n,n,democrat 0.0

324 n,y,n,y,y,y,n,n,n,n,y,y,y,y,n,n,republican 1.0

154 n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,n,republican 1.41421

55 n,y,n,y,y,y,n,n,n,y,y,y,y,y,n,n,republican 1.41421





388 n,y,y,y,y,y,n,n,n,n,n,y,y,y,n,noVote,democrat 0.0

1 n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,noVote,republican 1.41421



8 n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,y,republican 1.41421




161 n,n,n,n,y,y,y,n,n,n,n,y,y,y,n,y,democrat 0.0

283 n,n,n,y,y,y,n,n,n,n,n,y,y,y,n,y,republican 1.41421

204 n,n,y,y,y,y,y,n,n,n,n,y,y,y,n,y,republican 1.41421

163 n,y,n,y,y,y,y,n,n,n,n,y,y,y,n,y,republican 1.41421



302 n,n,n,y,y,y,y,n,n,y,n,y,y,y,n,y,republican 1.41421

Table 5: The 7NN of the top 10 class outliers of votes dataset.


Instan

ce #

Distan

ce


267 y,n,n,n,n,n,y,y,y,y,n,n,n,y,n,y,republican 0.0

265 y,n,y,n,n,n,y,y,y,y,n,n,n,n,n,y,democrat 1.41421

169 y,n,y,n,n,n,y,y,y,y,y,n,n,y,n,y,democrat 1.41421



280 n,n,y,n,n,n,y,y,y,y,n,n,n,y,n,y,democrat 1.41421

255 y,n,y,n,n,n,y,y,y,y,n,n,n,y,y,y,democrat 1.41421


71 y,y,y,y,n,n,y,y,y,y,y,n,n,y,n,y,republican 0.0

209 y,y,y,n,n,n,y,y,y,y,y,n,n,n,n,y,democrat 1.41421

169 y,n,y,n,n,n,y,y,y,y,y,n,n,y,n,y,democrat 1.41421

326 y,y,n,y,n,n,y,y,y,n,y,n,n,y,n,y,democrat 1.41421

241 y,n,y,n,n,n,y,y,y,y,y,n,n,y,y,y,democrat 1.73205

328 y,y,y,n,n,n,y,y,y,n,y,n,n,n,n,y,democrat 1.73205

63 y,y,y,n,n,n,y,y,y,n,y,n,n,n,n,y,democrat 1.73205


77 n,y,y,y,y,y,n,y,y,y,y,y,y,y,n,y,democrat 0.0

231 n,y,n,y,y,y,n,n,y,y,n,y,y,y,n,y,republican 1.73205

56 n,y,n,y,y,y,n,n,n,y,y,y,y,y,n,y,republican 1.73205

148 n,y,n,y,y,y,n,n,n,y,y,y,y,y,n,y,republican 1.73205

229 n,y,y,y,y,y,y,n,y,y,n,y,y,y,n,y,republican 1.73205

349 n,y,y,y,y,y,y,y,y,n,n,y,y,y,n,y,republican 1.73205

313 n,y,y,y,y,y,n,n,n,y,n,y,y,y,n,y,republican 1.73205


325 n,y,n,n,y,y,n,n,noVote,n,n,y,y,y,n,y,democrat 0.0

160 n,y,n,n,y,y,n,n,n,n,n,y,y,y,y,y,democrat 1.41421






Table 5: The 7NN of the top 10 class outliers of votes dataset. (Continued).


Instan

ce #

Distan

ce


160 n,y,n,n,y,y,n,n,n,n,n,y,y,y,y,y,democrat 0.0




5 n,y,y,n,y,y,n,n,n,n,n,n,y,y,y,y,democrat 1.41421




382 y,y,n,y,y,y,n,n,n,n,y,n,y,y,n,noVote,democrat 0.0

364 y,y,n,y,y,y,n,n,n,n,y,n,y,y,n,y,republican 1.0

392 y,y,n,y,y,y,n,n,n,n,y,y,y,y,n,y,republican 1.41421

37 y,y,n,y,y,y,n,n,n,n,n,n,y,y,n,y,republican 1.41421

11 n,y,n,y,y,y,n,n,n,n,y,noVote,y,y,noVote,noVote,republican 1.73205

164 y,y,n,n,y,y,n,n,n,y,y,y,y,y,n,noVote,democrat 1.73205

1 n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,noVote,republican 1.73205

Table 5: The 7NN of the top 10 class outliers of votes dataset. (Continued).

3.1.2. Experiment II (hepatitis dataset)

The dataset of hepatitis [6] contains 155 instances belonging to two classes, i.e. positive or negative for hepatitis, described by 20 attributes (including the class label attribute) among which 6 attributes are continuous and the remaining 13 attributes are categorical. The class distribution is 32 DIE, 123 LIVE (20.67% DIE, 79.35% LIVE).

The attributes information are the following:

1. AGE: 10, 20, 30, 40, 50, 60, 70, 80 (Continues)

2. SEX: male, female

3. STEROID: no, yes

4. ANTIVIRALS: no, yes


5. FATIGUE: no, yes

6. MALAISE: no, yes

7. ANOREXIA: no, yes

8. LIVER BIG: no, yes

9. LIVER FIRM: no, yes

10. SPLEEN PALPABLE: no, yes

11. SPIDERS: no, yes

12. ASCITES: no, yes

13. VARICES: no, yes

14. BILIRUBIN: 0.39, 0.80, 1.20, 2.00, 3.00, 4.00 (Continues)

15. ALK PHOSPHATE: 33, 80, 120, 160, 200, 250 (Continues)

16. SGOT: 13, 100, 200, 300, 400, 500 (Continues)

17. ALBUMIN: 2.1, 3.0, 3.8, 4.5, 5.0, 6.0 (Continues)

18. PROTIME: 10, 20, 30, 40, 50, 60, 70, 80, 90 (Continues)

19. HISTOLOGY: no, yes

20. Class: DIE, LIVE

Some attributes contain missing values. All the missing values for nominal and numeric attributes in the dataset are replaced with the modes (the mode is the most common value) and means (the mean is the average value for the attribute) respectively obtained from the original data.


K: 7

Top N COF: 10


α: 10

β: 0.1


Replace Missing Values: true




COF

1 311 65.89 5.11

COF: 1.66273

2 351 69.02 6.71

COF: 1.81633

3 1341 75.53 7.72

COF: 1.90424

4 981 72.95 8.3

COF: 1.96694

5 1282 269.39 7.02

COF: 2.73911

6 1202 69.15 7.47

COF: 2.89204

7 712 70.37 7.76

COF: 2.91841

8 302 75.59 9.06

COF: 3.03814

9 762 76.99 9.91

COF: 3.12085

10 1262 296.85 7.21

COF: 3.75509

Table 6: The top 10 Class Outliers of hepatitis dataset.


Instan

ce #

Distan

ce


31 62,female,no,no,yes,yes,no,yes,no,no,no,no,no,1,105.33,60,3.82,61.85,no,DIE 0.0

69 44,female,no,no,yes,yes,no,yes,no,no,no,no,no,1.6,68,68,3.7,61.85,no,LIVE 0.30083

25 27,female,no,no,yes,yes,no,yes,no,no,no,no,no,0.8,95,46,3.8,100,no,LIVE 0.62545

64 49,female,yes,no,yes,yes,no,yes,no,no,no,no,no,0.8,103,43,3.5,66,no,LIVE 1.02086

29 51,female,no,yes,yes,yes,no,yes,no,no,no,no,no,1,78,58,4.6,52,no,LIVE 1.03787

55 37,female,no,no,yes,yes,yes,yes,no,no,no,no,no,0.8,92,59,3.82,61.85,no,LIVE 1.06166

83 67,male,no,no,yes,yes,no,yes,no,no,no,no,no,1.5,179,69,2.9,61.85,no,LIVE 1.06296


35 37,female,yes,no,yes,no,no,yes,no,no,yes,no,no,0.6,67,28,4.2,61.85,no,DIE 0.0

34 35,female,yes,no,yes,no,no,yes,no,no,no,no,no,0.9,58,92,4.3,73,no,LIVE 1.01321

68 39,female,yes,no,yes,no,no,yes,no,no,no,no,no,1,34,15,4,54,no,LIVE 1.01355

54 30,female,yes,no,yes,no,no,yes,no,no,no,no,no,0.7,50,78,4.2,74,no,LIVE 1.01728

16 66,female,yes,no,yes,no,no,yes,no,no,no,no,no,1.2,102,53,4.3,61.85,no,LIVE 1.09175

2 78,female,yes,no,yes,no,no,yes,no,no,no,no,no,0.7,96,32,4,61.85,no,LIVE 1.1608

51 39,female,yes,no,no,no,no,yes,no,no,no,no,no,1,85,20,4,61.85,no,LIVE 1.41785


134 38,female,no,no,no,no,no,yes,yes,no,no,no,no,0.4,243,49,3.8,90,yes,DIE 0.0

114 36,female,no,no,no,no,no,yes,no,no,no,no,no,1.1,141,75,3.3,61.85,yes,LIVE 1.11681

102 27,female,no,no,yes,no,no,yes,yes,no,no,no,no,2.4,168,227,3,66,yes,LIVE 1.15769

93 52,female,no,no,no,no,no,yes,no,no,no,no,no,1.5,105.33,69,2.9,61.85,yes,LIVE 1.20219

148 20,female,no,no,no,no,no,yes,no,no,no,no,no,0.9,89,152,4,61.85,yes,LIVE 1.22639

117 50,female,yes,no,no,no,no,yes,no,no,no,no,no,1,139,81,3.9,62,yes,LIVE 1.50519

33 26,male,no,no,no,no,no,yes,yes,no,no,no,no,0.5,135,29,3.8,60,no,LIVE 1.51027


98 47,female,yes,no,no,no,no,yes,no,no,yes,no,yes,2,84,23,4.2,66,yes,DIE 0.0

137 38,female,yes,no,no,no,no,yes,yes,no,yes,no,yes,1.6,130,140,3.5,56,yes,LIVE 1.05762

92 33,female,yes,no,no,no,no,yes,no,no,no,no,no,1,105.33,60,4,61.85,yes,LIVE 1.43851

117 50,female,yes,no,no,no,no,yes,no,no,no,no,no,1,139,81,3.9,62,yes,LIVE 1.44059

149 36,female,yes,no,no,no,no,yes,no,no,no,no,no,0.6,120,30,4,61.85,yes,LIVE 1.44187

18 38,female,yes,no,no,no,no,yes,no,no,no,no,no,0.7,53,42,4.1,85,yes,LIVE 1.44755

101 22,female,yes,no,no,no,no,yes,no,no,no,no,no,0.7,105.33,24,3.82,61.85,yes,LIVE 1.47255

Table 7: The 7NN of the top 10 class outliers of hepatitis dataset.


Instan

ce #

Distan

ce

he 7 Nearest Neighbors of the Instance #128

128 54,female,no,no,yes,yes,no,yes,no,no,no,yes,no,1.2,85,92,3.1,66,yes,LIVE 0.0

109 33,female,no,no,yes,yes,no,yes,no,no,no,yes,no,0.7,63,80,3,31,yes,DIE 0.47094

141 54,female,no,no,yes,yes,no,yes,no,yes,no,yes,no,3.9,120,28,3.5,43,yes,DIE 1.10074

129 57,female,no,no,yes,yes,no,yes,no,no,yes,yes,no,4.6,82,55,3.3,30,yes,DIE 1.15415

118 61,female,no,no,yes,yes,no,yes,no,no,yes,no,no,1.43,105.33,85.89,3.82,61.85,yes,DIE 1.43036

69 44,female,no,no,yes,yes,no,yes,no,no,no,no,no,1.6,68,68,3.7,61.85,no,LIVE 1.43149



120 56,female,no,no,yes,yes,yes,no,yes,no,yes,no,no,2.9,90,153,4,61.85,yes,DIE 0.0

97 44,female,no,no,yes,yes,no,no,yes,no,yes,no,no,3,114,65,3.5,61.85,yes,LIVE 1.03416

152 61,female,no,no,yes,yes,no,no,yes,no,yes,no,no,0.8,75,20,4.1,61.85,yes,LIVE 1.0616

89 38,female,no,no,yes,yes,yes,no,yes,no,no,no,no,0.6,76,18,4.4,84,yes,LIVE 1.12216

140 36,female,no,no,yes,yes,yes,no,yes,no,yes,no,yes,1.7,295,60,2.7,61.85,yes,LIVE 1.34064

71 34,female,no,no,yes,yes,no,no,yes,no,yes,no,no,2.8,127,182,3.82,61.85,no,DIE 1.45568

132 48,female,yes,no,yes,yes,yes,yes,yes,no,yes,no,no,2,158,278,3.8,61.85,yes,LIVE 1.4599



97 44,female,no,no,yes,yes,no,no,yes,no,yes,no,no,3,114,65,3.5,61.85,yes,LIVE 1.0307

78 34,female,no,no,yes,no,no,no,yes,no,yes,no,no,1,72,46,4.4,57,no,LIVE 1.07851

152 61,female,no,no,yes,yes,no,no,yes,no,yes,no,no,0.8,75,20,4.1,61.85,yes,LIVE 1.1485

120 56,female,no,no,yes,yes,yes,no,yes,no,yes,no,no,2.9,90,153,4,61.85,yes,DIE 1.45568

96 30,female,no,no,yes,yes,no,yes,yes,no,yes,no,no,0.8,147,128,3.9,100,yes,LIVE 1.49309

28 61,female,no,no,yes,no,no,no,yes,no,no,no,no,1.3,78,25,3.8,100,no,LIVE 1.55647


30 39,female,no,yes,yes,yes,no,yes,yes,no,no,no,no,2.3,280,98,3.8,40,no,DIE 0.0



73 36,female,no,no,yes,yes,yes,yes,yes,no,no,no,no,1,105.33,45,4,57,no,LIVE 1.57797

12 41,female,yes,yes,yes,no,no,yes,yes,no,no,no,no,0.9,81,60,3.9,52,no,LIVE 1.61234


26 49,female,no,yes,yes,yes,yes,yes,yes,no,yes,no,no,0.6,85,48,3.7,61.85,no,LIVE 1.62745

Table 7: The 7NN of the top 10 class outliers of hepatitis dataset. (Continued).


Instan

ce #

Distan

ce


76 58,female,yes,no,yes,no,no,no,yes,yes,yes,no,no,2,167,242,3.3,61.85,no,DIE 0.0

99 60,female,no,no,yes,no,no,no,yes,yes,yes,no,no,1.43,105.33,40,3.82,61.85,yes,LIVE 1.47474

21 27,female,yes,no,yes,yes,yes,no,yes,yes,yes,no,no,1.2,133,98,4.1,39,no,LIVE 1.53487

78 34,female,no,no,yes,no,no,no,yes,no,yes,no,no,1,72,46,4.4,57,no,LIVE 1.55538


90 50,male,no,no,yes,no,no,no,yes,yes,yes,no,no,0.9,230,117,3.4,41,yes,LIVE 1.78069

27 58,male,yes,no,yes,no,no,yes,yes,no,yes,no,no,1.4,175,55,2.7,36,no,LIVE 1.78333


126 28,female,yes,no,yes,yes,yes,yes,no,no,yes,yes,no,1,105.33,20,4,61.85,yes,LIVE 0.0

144 45,female,yes,no,yes,yes,yes,yes,no,no,yes,yes,no,1.9,105.33,114,2.4,61.85,yes,DIE 0.48107

121 20,female,no,no,yes,yes,yes,yes,no,no,yes,yes,no,1,160,118,2.9,23,yes,LIVE 1.13767

67 57,female,yes,no,yes,yes,yes,yes,no,no,yes,yes,no,4.1,105.33,48,2.6,73,no,DIE 1.20387

150 46,female,yes,no,yes,yes,yes,yes,no,no,yes,yes,yes,7.6,105.33,242,3.3,50,yes,DIE 1.40076

87 30,female,yes,no,yes,yes,yes,yes,yes,no,yes,yes,yes,2.5,165,64,2.8,61.85,yes,DIE 1.47331

132 48,female,yes,no,yes,yes,yes,yes,yes,no,yes,no,no,2,158,278,3.8,61.85,yes,LIVE 1.51734

Table 7: The 7NN of the top 10 class outliers of hepatitis dataset. (Continued).

3.1.3. Experiment III (heart-statlog dataset)

The dataset of heart-statlog [6] contains 270 instances belonging to two classes, i.e. absent or present for heart disease, described by 14 attributes (including the class label attribute) among which 13 attributes are continuous. The class distribution is 150 absent, 120 present (55.56% absent, 44.44% present). There is no missing values in the dataset.


1. age (Continues)

2. sex (2 values) (boolean: 0, 1)

3. chest pain type (4 values) (Nominal)


4. resting blood pressure (Continues)

5. serum cholestoral in mg/dl (Continues)

6. fasting blood sugar (boolean: > 120 mg/dl → 1, < 120 mg/dl → 0)

7. resting electrocardiographic results (values 0,1,2) (Nominal)

8. maximum heart rate achieved (Continues)

9. exercise induced angina (boolean: 0, 1)

10. oldpeak = ST depression induced by exercise relative to rest (Continues)

11. the slope of the peak exercise ST segment (Continues)

12. number of major vessels (0-3) colored by flourosopy (Continues)

13. thal: 3 = normal; 6 = fixed defect; 7 = reversable defect (Nominal)

14. Class: absent, present.


K: 7

Top N COF: 10


α: 100

β: 0.1






COF

1 691 206.0 1.61

COF: 1.64619

2 111 276.9 3.2

COF: 1.6814

3 2071 268.89 3.66

1.73806

4 2581 205.36 2.85

COF: 1.77229

5 1771 202.1 4.2

COF: 1.91521

6 1691 234.94 5.01

COF: 1.92677

7 671 247.44 5.32

COF: 1.93567

8 1751 250.37 7.69

COF: 2.16821

9 32 277.4 2.75

COF: 2.63559

10 912 220.28 2.1

COF: 2.66381

Table 8: The top 10 Class Outliers of heart-statlog dataset.


Instan

ce #

Distan

ce


69 47,1,3,108,243,0,0,152,0,0,1,0,3,present 0.0

62 44,1,3,120,226,0,0,169,0,0,1,0,3,absent 0.18727

150 41,1,3,112,250,0,0,179,0,0,1,0,3,absent 0.24451

179 50,1,3,129,196,0,0,163,0,0,1,0,3,absent 0.24844

38 42,1,3,130,180,0,0,150,0,0,1,0,3,absent 0.27358

253 51,1,3,110,175,0,0,123,0,0.6,1,0,3,absent 0.29962

23 47,1,4,112,204,0,0,143,0,0.1,1,0,3,absent 0.35418


11 53,1,4,142,226,0,2,111,1,0,1,0,7,absent 0.0

82 58,1,4,150,270,0,2,111,1,0.8,1,0,7,present 0.20806

36 61,1,4,140,207,0,2,138,1,1.9,1,1,7,present 0.52680

34 50,1,4,144,200,0,2,126,1,0.9,2,0,7,present 0.54034

193 35,1,4,126,282,0,2,156,1,0,1,0,7,present 0.54567

65 57,1,4,150,276,0,2,112,1,0.6,2,1,6,present 0.67728

204 55,1,4,160,289,0,2,145,1,0.8,2,1,7,present 0.70454


207 58,1,3,105,240,0,2,154,1,0.6,2,0,7,absent 0.0

7 59,1,4,110,239,0,2,142,1,1.2,2,1,7,present 0.49259

34 50,1,4,144,200,0,2,126,1,0.9,2,0,7,present 0.57500

202 60,1,4,125,258,0,2,141,1,2.8,2,1,7,present 0.63008

237 43,1,4,120,177,0,2,120,1,2.5,2,0,7,present 0.64090

147 40,1,4,110,167,0,2,114,1,2,2,0,7,present 0.65266

92 54,1,4,124,266,0,2,109,1,2.2,2,1,7,present 0.67039


258 64,1,3,140,335,0,0,158,0,0,1,0,3,present 0.0

162 55,1,2,130,262,0,0,155,0,0,1,0,3,absent 0.42833

179 50,1,3,129,196,0,0,163,0,0,1,0,3,absent 0.44498

239 52,1,2,120,325,0,0,172,0,0.2,1,0,3,absent 0.47137

190 54,1,4,140,239,0,0,160,0,1.2,1,0,3,absent 0.49015

222 57,1,3,150,168,0,0,174,0,1.6,1,0,3,absent 0.50701

263 49,1,2,130,266,0,0,171,0,0.6,1,0,3,absent 0.51156

Table 9: The 7NN of the top 10 class outliers of heart-statlog dataset.


Instan

ce #

Distan

ce


177 46,1,3,150,231,0,0,147,0,3.6,2,0,3,present 0.0

209 37,1,3,130,250,0,0,187,0,3.5,3,0,3,absent 0.64508

22 43,1,4,115,303,0,0,181,0,1.2,2,0,3,absent 0.68430

222 57,1,3,150,168,0,0,174,0,1.6,1,0,3,absent 0.68538

259 43,1,4,150,247,0,0,171,0,1.5,1,0,3,absent 0.71738

185 43,1,3,130,315,0,0,162,0,1.9,1,1,3,absent 0.72505

190 54,1,4,140,239,0,0,160,0,1.2,1,0,3,absent 0.74687


169 65,1,1,138,282,1,2,174,0,1.4,2,1,3,present 0.0

170 69,1,1,160,234,1,2,131,0,0.1,2,1,3,absent 0.46232

86 62,1,2,128,208,1,2,140,0,0,1,0,3,absent 0.79494

211 51,1,3,125,245,1,2,166,0,2.4,2,0,3,absent 0.83219

45 58,1,3,140,211,1,2,165,0,0,1,0,3,absent 0.95350

167 53,1,3,130,197,1,2,152,0,1.2,3,0,3,absent 0.96987

64 63,1,1,145,233,1,2,150,0,2.3,3,0,6,absent 0.99843


67 58,0,2,136,319,1,2,152,0,0,1,2,3,present 0.0

52 65,0,3,140,417,1,2,157,0,0.8,1,1,3,absent 0.55954

29 71,0,3,110,265,1,2,130,0,0,1,1,3,absent 0.63178

228 58,0,1,150,283,1,2,162,0,1,1,0,3,absent 0.78205

24 54,0,2,132,288,1,2,159,1,0,1,1,3,absent 1.06176

184 53,1,3,130,246,1,2,173,0,0,1,3,3,absent 1.13567

244 51,0,3,140,308,0,2,142,0,1.5,1,1,3,absent 1.14451


175 62,0,4,138,294,1,0,106,0,1.9,2,3,3,present 0.0

153 64,0,4,130,303,0,0,122,0,2,2,2,3,absent 1.06496

57 60,0,3,120,178,1,0,96,0,0,1,0,3,absent 1.24963

74 67,0,4,106,223,0,0,142,0,0.3,1,2,3,absent 1.27730

113 54,0,3,135,304,1,0,170,0,0,1,0,3,absent 1.31256

68 44,0,3,118,242,0,0,149,0,0.3,2,1,3,absent 1.38572

194 48,1,3,124,255,1,0,175,0,0,1,2,3,absent 1.39786

Table 9: The 7NN of the top 10 class outliers of heart-statlog dataset. (Cont.).


Instan

ce #

Distan

ce


3 64,1,4,128,263,0,0,105,1,0.2,2,1,7,absent 0.0

122 57,1,4,152,274,0,0,88,1,1.2,2,1,7,present 0.34061

257 55,1,4,132,353,0,0,132,1,1.2,2,1,7,present 0.38379

126 62,1,4,120,267,0,0,99,1,1.8,2,2,7,present 0.43281

145 53,1,4,123,282,0,0,95,1,2,2,2,7,present 0.50779

220 54,1,4,110,239,0,0,126,1,2.8,2,1,7,present 0.52612

84 57,1,4,110,201,0,0,126,1,1.5,2,0,6,absent 0.55983


91 61,0,4,130,330,0,2,169,0,0,1,0,3,present 0.0

112 60,0,4,158,305,0,2,161,0,0,1,0,3,present 0.27784

236 53,0,4,138,234,0,2,160,0,0,1,0,3,absent 0.29365

166 50,0,4,110,254,0,2,159,0,0,1,0,3,absent 0.35221

14 57,0,4,128,303,0,2,159,0,0,1,1,3,absent 0.35782

216 63,0,3,135,252,0,2,172,0,0,1,0,3,absent 0.38381

27 51,0,3,120,295,0,2,157,0,0.6,1,0,3,absent 0.43308

Table 9: The 7NN of the top 10 class outliers of heart-statlog dataset (Continued).

3.1.4. Experiment IV (credits approval dataset)

The dataset of credit approval (credit-a) [6] contains 690 instances belonging to two classes, i.e. “+” or “-” for credit approval, described by 16 attributes (including the class label attribute) among which 6 attributes are continuous and the remaining 10 attributes are categorical. The class distribution is 307 “+”, 383 “-” (44.49% “+”, 55.51% “-”). All attribute names and values have been changed to meaningless symbols to protect confidentiality of the data.


This dataset is interesting because there is a good mix of attributes: continuous, nominal with small numbers of values, and nominal with larger numbers of values. There are also few missing values.

The attribute information are the following:

1. A1: b, a.

2. A2: continuous.

3. A3: continuous.

4. A4: u, y, l, t.

5. A5: g, p, gg.

6. A6: c, d, cc, i, j, k, m, r, q, w, x, e, aa, ff.

7. A7: v, h, bb, j, n, z, dd, ff, o.

8. A8: continuous.

9. A9: t, f.

10. A10: t, f.

11. A11: continuous.

12. A12: t, f.

13. A13: g, p, s.



16. A16: +,- (class attribute)


K: 7

Top N COF: 10


α: 100

β: 0.1






COF

1 1151 833.32 .85

COF: 1.2047

2 5231 837.58 1.12

COF: 1.23097

3 1101 806.66 1.65

COF: 1.28872

4 991 841.87 2.26

COF: 1.34454

5 3201 841.87 2.26

COF: 1.34454

6 481 669.1 2.77

COF: 1.42608

7 3481 820.44 3.73

COF: 1.49474

8 5461 877.5 4.42

COF: 1.55615

9 1161 865.02 4.58

COF: 1.57336

10 131 636.3 4.55

COF: 1.61253

Table 10: The top 10 Class Outliers of credit-a dataset.


Instan

ce #

Distan

ce


115 a,25.42,1.13,u,g,q,v,1.29,t,t,2,f,g,200,0,- 0.0

518 a,28.17,0.38,u,g,q,v,0.59,t,t,4,f,g,80,0,+ 0.08678

235 a,20.67,1.84,u,g,q,v,2.09,t,t,5,f,g,220,2503,+ 0.09620

182 a,20.67,3,u,g,q,v,0.17,t,t,3,f,g,100,6,+ 0.11776

63 a,20.42,0.84,u,g,q,v,1.59,t,t,1,f,g,0,0,+ 0.12685

184 a,22.42,5.67,u,g,q,v,2.59,t,t,7,f,g,129,3257,+ 0.19565

17 a,23.25,5.88,u,g,q,v,3.17,t,t,10,f,g,120,245,+ 0.22374


523 a,22.5,8.5,u,g,q,v,1.75,t,t,10,f,g,80,990,- 0.0

17 a,23.25,5.88,u,g,q,v,3.17,t,t,10,f,g,120,245,+ 0.10888

178 a,18.42,9.25,u,g,q,v,1.21,t,t,4,f,g,60,540,+ 0.11393

184 a,22.42,5.67,u,g,q,v,2.59,t,t,7,f,g,129,3257,+ 0.11929

182 a,20.67,3,u,g,q,v,0.17,t,t,3,f,g,100,6,+ 0.23140

235 a,20.67,1.84,u,g,q,v,2.09,t,t,5,f,g,220,2503,+ 0.26126

243 a,18.75,7.5,u,g,q,v,2.71,t,t,5,f,g,?,26726,+ 0.28106


110 b,29.17,3.5,u,g,w,v,3.5,t,t,3,t,g,329,0,- 0.0

3 b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100,3,+ 0.13923

556 b,29.58,4.5,u,g,w,v,7.5,t,t,2,t,g,330,0,+ 0.14572

35 b,27.83,1.5,u,g,w,v,2,t,t,11,t,g,434,35,+ 0.15903

29 b,42.08,1.04,u,g,w,v,5,t,t,6,t,g,500,10000,+ 0.25979

43 b,39.58,13.92,u,g,w,v,8.63,t,t,6,t,g,70,0,+ 0.46257

570 b,59.5,2.75,u,g,w,v,1.75,t,t,5,t,g,60,58,+ 0.48113


99 a,28.5,1,u,g,q,v,1,t,t,2,t,g,167,500,- 0.0

24 a,41.17,6.5,u,g,q,v,0.5,t,t,3,t,g,145,0,+ 0.27488

203 a,20.75,10.25,u,g,q,v,0.71,t,t,2,t,g,49,0,+ 0.35543

34 a,22.58,10.75,u,g,q,v,0.42,t,t,5,t,g,0,560,+ 0.37226

123 a,44.17,6.67,u,g,q,v,7.38,t,t,3,t,g,0,0,+ 0.39206

124 a,23.5,9,u,g,q,v,8.5,t,t,5,t,g,120,0,+ 0.39890

14 a,45.83,10.5,u,g,q,v,5,t,t,7,t,g,0,0,+ 0.46400

Table 11: The 7NN of the top 10 class outliers of credit-a dataset.


Instan

ce #

Distan

ce


320 b,21.25,1.5,u,g,w,v,1.5,f,f,0,f,g,150,8,+ 0.0

373 b,26.25,1.54,u,g,w,v,0.13,f,f,0,f,g,100,0,- 0.09278

422 b,29.42,1.25,u,g,w,v,1.75,f,f,0,f,g,200,0,- 0.12600

324 b,33.67,1.25,u,g,w,v,1.17,f,f,0,f,g,120,0,- 0.18795

670 b,47.17,5.84,u,g,w,v,5.5,f,f,0,f,g,465,150,- 0.46947

455 b,36.17,18.13,u,g,w,v,0.09,f,f,0,f,g,320,3552,- 0.64329

421 b,20.42,1.09,u,g,q,v,1.5,f,f,0,f,g,108,7,- 1.00041


48 b,41.5,1.54,u,g,i,bb,3.5,f,f,0,f,g,216,0,+ 0.0

650 b,48.08,3.75,u,g,i,bb,1,f,f,0,f,g,100,2,- 0.16456

458 b,36.17,5.5,u,g,i,bb,5,f,f,0,f,g,210,687,- 0.17103

343 b,33.75,2.75,u,g,i,bb,0,f,f,0,f,g,180,0,- 0.17566

391 b,39.92,5,u,g,i,bb,0.21,f,f,0,f,g,550,0,- 0.23885

640 b,34.17,2.75,u,g,i,bb,2.5,f,f,0,t,g,232,200,- 1.00763

413 b,40.58,1.5,u,g,i,bb,0,f,f,0,f,s,300,0,- 1.00848


348 b,63.33,0.54,u,g,c,v,0.59,t,t,3,t,g,180,0,- 0.0

498 b,25.75,0.5,u,g,c,v,1.46,t,t,5,t,g,312,0,+ 0.57056

40 b,34.17,9.17,u,g,c,v,4.5,t,t,12,t,g,0,221,+ 0.57645

589 b,25.33,0.58,u,g,c,v,0.29,t,t,7,t,g,96,5124,+ 0.57844

153 b,23.08,2.5,u,g,c,v,1.09,t,t,11,t,g,60,2184,+ 0.62441

599 b,20.5,2.42,u,g,c,v,2,t,t,11,t,g,200,3000,+ 0.66107

517 b,16.08,0.75,u,g,c,v,1.75,t,t,5,t,g,352,690,+ 0.71757


546 b,23.58,0.46,y,p,w,v,2.63,t,t,6,t,g,208,347,- 0.0

509 b,21,4.79,y,p,w,v,2.25,t,t,1,t,g,80,300,+ 0.18777

510 b,13.75,4,y,p,w,v,1.75,t,t,2,t,g,120,1000,+ 0.21052

193 b,22.67,1.59,y,p,w,v,3.09,t,t,6,f,g,80,0,+ 1.00308

588 b,26.67,1.75,y,p,c,v,1,t,t,5,t,g,160,5777,+ 1.00562

571 b,21,3,y,p,d,v,1.09,t,t,8,t,g,160,1,+ 1.00704

198 b,27.58,2.04,y,p,aa,v,2,t,t,3,t,g,370,560,+ 1.00790

Table 11: The 7NN of the top 10 class outliers of credit-a dataset. (Cont.).


Instan

ce #

Distan

ce


116 b,37.75,7,u,g,q,h,11.5,t,t,7,t,g,300,5,- 0.0

134 b,32.67,5.5,u,g,q,h,5.5,t,t,12,t,g,408,1000,+ 0.24822

251 b,41.42,5,u,g,q,h,5,t,t,6,t,g,470,0,+ 0.26002

210 b,39.33,5.88,u,g,cc,h,10,t,t,14,t,g,399,0,+ 1.00912

564 b,42.17,5.04,u,g,q,h,12.75,t,f,0,t,g,92,0,+ 1.01635

246 b,45,8.5,u,g,cc,h,14,t,t,1,t,g,88,2000,+ 1.02083

168 b,36.67,3.25,u,g,q,h,9,t,f,0,t,g,102,639,+ 1.02306


13 b,48.08,6.04,u,g,k,v,0.04,f,f,0,f,g,0,2690,+ 0.00000

468 b,22.08,2.34,u,g,k,v,0.75,f,f,0,f,g,180,0,- 0.42405

680 b,19.5,0.29,u,g,k,v,0.29,f,f,0,f,g,280,364,- 0.49709

254 b,?,0.63,u,g,k,v,0.25,f,f,0,f,g,380,2010,- 0.58319

341 b,42.75,4.09,u,g,aa,v,0.04,f,f,0,f,g,108,100,- 1.00742

526 b,41.58,1.75,u,g,k,v,0.21,t,f,0,f,g,160,0,- 1.01990

451 b,39.5,1.63,u,g,c,v,1.5,f,f,0,f,g,0,316,- 1.02210

Table 11: The 7NN of the top 10 class outliers of credit-a dataset. (Continued).

3.1.5. Experiment V (vehicle dataset)

The dataset of vehicle [6] is from the Turing Institute, Glasgow, Scotland. It contains 846 instances belonging to four classes, i.e. opel, saab, bus and van, described by 19 attributes (including the class label attribute) among which 18 attributes are continuous. The class distribution is 212 opel, 217 saab, 218 bus and 199 van (25.06% opel, 25.65% saab, 25.77% bus, 23.52% van). There is no missing values in the dataset. This experiment is very interesting because all the attribute of the dataset are continues. furthermore the dataset includes multiple class label which are almost equally distributed.


1. compactness (continues)


2. circularity (continues)

3. distance circularity (continues)

4. radius ratio (continues)

5. pr. axis aspect ratio (continues)

6. max. length aspect ratio (continues)

7. scatter ratio (continues)

8. elongatedness (continues)

9. pr.axis rectangularity (continues)

10. max. length rectangularity (continues)

11. scaled variance_major (continues)

12. scaled variance_minor (continues)

13. scaled radius of gyration (continues)

14. skewness about major (continues)

15. skewness about minor (continues)

16. kurtosis about major (continues)

17. kurtosis about minor (continues)

18. hollows ratio (continues)

19. class :opel, saab, bus, van.


K: 9

Top N COF: 10


α: 100

β: 0.1



Table 12 shows the top 10 class outliers whereas table 13


shows the distance of each chosen outlier instance from its K nearest neighbors (9 nearest neighbors).


COF

1 3491 289.25 2.61

COF: 1.60661

2 2161 233.31 1.93

COF: 1.6217

3 5991 263.36 2.43

COF: 1.62245

4 4221 293.19 2.82

COF: 1.62287

5 1631 243.16 2.16

COF: 1.6273

6 321 257.48 2.48

COF: 1.63681

7 8061 216.87 2.05

COF: 1.66612

8 6451 218.67 2.25

COF: 1.68263

9 1131 361.99 4.15

COF: 1.69109

10 4511 230.82 2.72

COF: 1.7051

Table 12: The top 10 Class Outliers of vehicle dataset.


Instan

ce #

Distan

ce


349 89,40,69,147,58,6,132,50,18,137,155,260,151,61,16,6,203,209,opel 0.0

460 90,41,62,147,60,6,128,52,18,141,149,246,157,61,13,4,201,208,van 0.21985

469 92,40,62,144,59,8,127,52,17,139,149,241,150,62,13,1,204,210,van 0.25482

703 93,43,78,162,64,8,137,48,18,145,156,281,159,63,17,12,203,210,van 0.30403

629 90,42,63,144,59,7,131,50,18,142,154,259,162,65,15,3,197,204,van 0.31923

636 96,41,69,153,56,7,141,47,18,141,162,297,169,61,11,8,202,209,saab 0.33394

403 96,39,77,160,62,8,140,47,18,150,161,294,124,62,15,3,201,208,van 0.36438

148 90,43,72,172,59,8,154,42,19,144,174,360,158,61,15,9,203,209,saab 0.39062

330 98,44,78,160,63,8,142,47,18,148,160,300,171,63,19,2,201,207,van 0.42197


216 84,44,77,150,59,5,152,44,19,143,175,344,177,77,8,2,183,187,saab 0.0

276 83,46,73,137,59,6,148,45,19,146,167,327,183,75,8,0,185,191,bus 0.21283

368 84,45,68,148,64,6,146,46,19,142,168,317,180,75,5,1,183,187,bus 0.21956

208 86,46,70,149,65,8,149,45,19,146,170,331,185,77,6,6,183,188,bus 0.22412

632 86,44,70,140,64,6,148,45,19,145,170,322,185,82,10,1,181,183,bus 0.24139

423 85,45,70,120,54,7,149,45,19,145,169,326,186,81,8,4,181,184,bus 0.24349

39 81,45,68,169,73,6,151,44,19,146,173,336,186,75,7,0,183,189,bus 0.25910

645 86,44,77,155,60,7,152,44,19,141,174,345,161,72,9,0,187,192,opel 0.26308

808 83,46,68,139,59,6,150,44,19,146,172,336,183,74,5,3,185,191,bus 0.26725


599 93,39,63,146,58,7,128,52,18,134,149,246,158,63,9,7,198,204,saab 0.0

204 89,40,58,137,58,7,122,54,17,140,146,225,150,63,7,4,199,206,van 0.23894

268 86,39,60,140,60,7,119,55,17,134,140,212,141,61,7,8,200,207,van 0.28835

460 90,41,62,147,60,6,128,52,18,141,149,246,157,61,13,4,201,208,van 0.29689

607 86,39,62,129,59,6,116,57,17,135,137,203,145,64,7,9,199,204,van 0.29716

754 91,41,64,148,61,8,129,51,18,142,161,249,153,68,6,12,194,201,van 0.30660

262 89,40,60,131,56,6,118,56,17,137,143,209,153,65,10,8,193,199,van 0.32386

537 86,40,66,139,59,7,122,54,17,139,145,225,143,63,7,11,202,208,van 0.32801

55 94,36,66,151,61,8,133,50,18,135,154,265,119,62,9,3,201,208,van 0.34760

Table 13: The 9NN of the top 10 class outliers of vehicle dataset.


Instan

ce #

Distan

ce


422 90,34,66,158,59,7,140,47,18,124,165,298,117,61,1,3,201,207,saab 0.0

673 91,35,66,159,59,7,147,45,19,131,169,322,123,64,1,1,197,203,opel 0.25780

660 88,35,60,143,59,7,128,52,18,129,147,246,109,62,1,6,202,209,van 0.26971

519 88,39,76,155,62,8,137,48,18,137,156,281,124,63,3,6,201,209,van 0.34470

541 88,34,58,140,59,6,127,52,18,130,148,243,113,63,4,10,199,206,van 0.35032

32 93,35,66,154,59,6,142,46,18,128,162,304,120,64,5,13,197,202,opel 0.38854

279 94,37,73,186,71,7,154,42,19,127,171,362,132,67,2,8,197,206,bus 0.39759

295 90,38,75,164,64,7,151,43,19,131,168,345,139,66,0,0,195,204,bus 0.40310

601 93,39,78,164,66,8,139,48,18,140,157,290,126,64,4,7,201,208,van 0.40611


163 85,40,72,139,59,5,132,50,18,135,159,260,150,68,3,9,191,195,saab 0.0

483 86,38,76,143,59,8,142,47,18,131,167,301,138,71,5,10,189,196,van 0.23179

316 91,41,66,131,56,9,126,53,18,144,159,237,155,72,3,10,191,194,van 0.25511

340 89,40,72,155,63,7,146,45,19,135,175,321,145,72,4,10,192,196,bus 0.26946

286 83,41,70,155,65,7,144,46,19,141,168,309,147,71,4,12,188,195,bus 0.27079

211 86,37,69,150,63,8,138,48,18,134,163,284,124,71,1,6,189,195,van 0.27496

46 91,43,70,133,55,8,130,51,18,146,159,253,156,70,1,8,190,194,van 0.27545

774 94,37,72,146,60,9,133,50,18,135,161,262,128,69,2,7,192,195,van 0.28875

514 89,38,74,138,59,7,136,49,18,133,167,278,128,72,7,7,189,193,van 0.29409


32 93,35,66,154,59,6,142,46,18,128,162,304,120,64,5,13,197,202,opel 0.0

227 94,35,66,147,62,9,131,50,18,127,159,258,115,66,8,7,196,201,van 0.26709

435 85,37,68,145,60,6,130,51,18,130,150,253,121,65,3,14,195,203,van 0.29721

10 86,36,70,143,61,9,133,50,18,130,153,266,127,66,2,10,194,202,van 0.30324

419 93,34,72,144,56,6,133,50,18,123,158,263,125,63,5,20,200,206,saab 0.30658

256 91,36,77,157,56,7,155,42,19,126,177,361,123,65,8,15,195,201,saab 0.30713

753 91,36,72,162,60,8,150,44,19,133,166,334,121,63,2,22,196,205,saab 0.33312

541 88,34,58,140,59,6,127,52,18,130,148,243,113,63,4,10,199,206,van 0.33425

767 88,39,70,166,66,7,148,44,19,134,167,332,143,69,5,13,193,201,bus 0.33563

Table 13: The 9NN of the top 10 class outliers of vehicle dataset. (Continued).


Instan

ce #

Distan

ce


806 88,45,82,155,56,8,154,43,19,149,180,357,170,69,3,0,188,193,saab 0.0

358 87,45,82,164,60,8,156,42,19,144,181,366,174,70,2,2,190,196,opel 0.17285

387 90,47,85,145,58,9,152,44,19,155,175,345,184,73,4,2,186,197,van 0.24231

287 88,43,84,136,55,11,154,44,19,150,174,350,164,73,6,2,185,196,van 0.25329

305 86,45,73,152,63,6,149,44,19,145,170,335,176,71,6,1,189,196,bus 0.25702

124 85,45,71,150,63,8,143,46,19,147,171,307,179,72,2,3,187,196,van 0.27650

80 87,46,71,159,66,6,151,44,19,146,175,343,189,73,2,0,186,190,bus 0.27689

366 90,47,85,149,60,10,155,43,19,155,179,355,186,75,1,5,185,196,van 0.28373

25 85,45,80,154,64,9,147,45,19,148,169,324,174,71,1,4,188,199,van 0.28760


645 86,44,77,155,60,7,152,44,19,141,174,345,161,72,9,0,187,192,opel 0.0

276 83,46,73,137,59,6,148,45,19,146,167,327,183,75,8,0,185,191,bus 0.23804

305 86,45,73,152,63,6,149,44,19,145,170,335,176,71,6,1,189,196,bus 0.24678

216 84,44,77,150,59,5,152,44,19,143,175,344,177,77,8,2,183,187,saab 0.26308

626 83,44,70,166,69,5,143,46,18,143,166,306,170,69,7,6,188,193,bus 0.28693

393 86,47,75,165,68,6,154,43,19,146,176,356,190,74,7,3,188,194,bus 0.28911

287 88,43,84,136,55,11,154,44,19,150,174,350,164,73,6,2,185,196,van 0.29813

192 93,43,76,149,57,7,149,44,19,143,172,335,176,69,14,0,189,194,saab 0.31473

627 88,44,71,165,70,7,144,46,19,141,167,312,172,71,4,4,188,193,bus 0.31637


113 88,35,50,121,58,5,114,59,17,122,132,192,138,74,21,4,182,187,opel 0.0

751 85,36,51,115,56,5,119,57,17,124,139,207,127,81,13,5,181,184,van 0.41368

297 82,37,66,126,54,7,132,52,18,127,148,252,142,72,17,7,183,187,saab 0.44176

120 89,37,54,119,53,5,134,50,18,127,151,266,146,79,16,14,184,185,saab 0.50258

834 82,36,51,114,53,4,135,50,18,126,150,268,144,86,15,4,181,182,saab 0.51485

103 92,38,60,130,62,5,114,58,17,132,135,194,137,72,14,5,190,194,van 0.54152

289 88,37,57,132,62,6,135,50,18,125,151,265,144,83,16,16,180,184,saab 0.54642

138 88,37,63,130,58,5,125,54,18,130,141,230,145,74,14,20,184,188,saab 0.59244

533 89,41,63,134,59,6,123,55,17,137,148,223,150,76,12,3,186,188,van 0.59512



Instan

ce #

Distan

ce


451 94,37,74,169,59,7,162,41,20,133,178,394,130,63,6,6,198,204,opel 0.0

279 94,37,73,186,71,7,154,42,19,127,171,362,132,67,2,8,197,206,bus 0.29612

646 90,38,79,185,69,6,160,40,20,130,178,393,133,66,2,14,198,205,bus 0.32887

256 91,36,77,157,56,7,155,42,19,126,177,361,123,65,8,15,195,201,saab 0.33441

542 93,39,86,180,59,9,167,39,20,134,186,418,129,63,6,17,197,204,saab 0.34376

43 93,37,76,183,63,8,164,40,20,134,191,405,139,67,4,7,192,197,saab 0.35013

812 98,38,72,192,69,5,166,38,20,131,189,427,138,70,1,3,200,202,bus 0.35224

548 94,39,75,184,72,8,155,42,19,133,175,365,145,70,4,5,192,200,bus 0.35567

697 92,37,75,184,70,6,154,42,19,131,184,363,127,71,0,4,198,202,bus 0.35731


Since this experiment has been performed on dataset with more than two class labels, we shall analyze the experimental results shown in tables 12 and 13 to examine the ranking mechanism. Consider the instance #349 (opel) which is chosen at the top despite that there is more than only one class labels of its surrounding. It is to be noticed that the PCL of its surrounding are 2/9 (saab) and 6/9 (van). comparing the case described above with instance #599, where there is only one class of its surrounding, (i.e., the PCL of its surrounding is 8/9 (van)). The justification can be extracted from table 12, where the Deviation is 289.25 and 263.36 for the instances #349 and #599 respectively. We performed more investigations about the instance #349 neighbors, especially the surrounding instances that have second minority PCL which is 2/9 (saab), these instances are 636, and 148. We found that both PCL(636) and PCL(128) are 2/9, which means they are also considered as class outliers but in an order which is beyond 10.

comparing instances #349 and #163, they are almost similar (the PCL of their surrounding are 2/9 and 6/9), but the Deviation of the instance #163 is 243.16 which is less than the Deviation of the instance # 349.


The explanation described above reflects the importance of Deviation factors.

3.2. Experimental Results Analysis Study

In this section, we shall consider only the votes dataset to understand the used mechanism to obtain the class outlier rank as shown previously in table 4. The same concept can be noticed on other applications.

As mentioned in section 2.1, we scaled PCL(T) from [1/K,1] to [1,K] by multiplying it by K. α and β factors are to control the importance and the effects of Deviation and K-Distance. The values of α and β are 0 ≤ α ≤ M and 0 ≤ β ≤ 1, where M is a changeable value based on the application domain and the initial experimental results. For example If the Deviation is in hundreds, then the best value for α is 100, and if the Deviation is in tens, then the best value for α is 10 and so on. Referring to our man formula

COF T = K∗PCL T ∗ 1DeviationT

∗KDist T

The main goal of scaling PCL, α and β factors is to obtain the COF value in the format X.YYYY where X reflects the scaled PCL and YYYY reflects Deviation and KDist factors. We consider PCL as the most important factor of taking the decision regarding the Class Outlierness. α and β factors are to make a trade-off between the importance of Deviation and KDist. The proposed ranges for α and β are chosen in a way to maintain the trade-off.

Let us examine the experimental results in experiment I, the Deviation was in hundreds, so the best choice for α value was chosen 100, and KDist was in tens, so the best choice for β value was 0.1 to keep always the effect of KDist in the YYYY portion.


The optimal value of K is determined by trial and error technique. Of course, there are many factors affecting the optimal value, for example dataset size and number of classes are very important factors that affect choosing the value of K. Keeping a very high value of K would necessarily mean you are not sure in localized regions of the search space (e,g., instances from the same class would enter the search space (KNN region)) and this might result in wrong estimation for PCL. On the other hand, keeping an extremely low value of K means KNN is not well utilized and will give wrong impression about the importance of PCL.

Odd values of K would make more sense because we would like to have a clear bias value for the PCL.


4. Comparison Study

In this chapter, we shall perform a comparison study with He’s method [15, 16]. below an overview of the He’s formula that used for the ranking score.

Let A1, ...,Am be set of attributes with domains D1, ...,Dm

respectively. Let the data set D be a set of records where each record t : t∈D1×...×DM . The results of clustering algorithm executed on D is denoted as : C = { C1, C2, ..., Ck} where

C i∩C j= and C1∪C2∪...∪C k=D the number of clusters is k.

Suppose CL is an additional attribute for D, which distinguishes the class of the records and has the set of different attribute values { cl1, cl2, ..., clp}. The output C = { C1, C2, ..., Ck}, just as described above. We define Pr (cli | D) and Pr (cli | Cj) as the frequency of cli in D and the frequency of cli in Cj.

Pr cl i∣D =∣{t∣t.CL=cl i , t ∈D }∣

∣D∣.

Pr cl i∣C j =∣{t∣t.CL=cl i , t∈C j }∣

∣C j∣.

Given a set of records R and a record t, the similarity between R and t is defined as:

similarty t , R =∑i=1

∣R∣

similarity t ,T i

∣R∣where ∀ t i∈R.


Semantic Outlier factor of a record t: Suppose the clustering algorithm assign t to Ck and the class value of t is cli. And R is subset of D with class value cli. The semantic outlier factor of a record t is defined as:

SOF t =Pr cl i∣C k ∗ similarty t , R

Pr cli∣D .

In the following, we will try to abstract the main difference between SOF formula and our proposed COF formula.

Figure 15 illustrates the difference between the ranking criteria of our proposed approach and semantic outlier (COF vs SOF). Suppose that the size of the both clusters A and B is 100, and the probability of the class x with respect to the cluster is 3/100 for both the cases A and B. Now for x1, the similarity (x1, R) is the same. In SOF approach x1 in both the cases A and B has the some rank, but in COF the rank is different because PCL of x1 is 3/7 for the case A, and 1/7 for the case B (assuming K = 7).



Figure 15: SOF vs COF.

# Instance # SOF # Instance # SOF

1 176 0.3036 11 375 1.4520

2 71 0.3394 12 151 1.4927

3 355 0.3645 13 372 1.4950

4 267 0.3659 14 388 1.6365

5 183 0.8726 15 2 1.6489

6 97 0.9892 16 382 1.6727

7 88 1.0724 17 215 1.7010

8 402 1.1690 18 164 1.7168

9 407 1.3309 19 6 1.7236

10 248 1.3487 20 325 1.7259

Table 14: The top 20 Semantic Outliers for votes dataset.

The experimental results obtained of using SOF on the votes dataset is shown in table 14. comparing our results obtained in table 4 using our approach with table 14, we notice that the instance #176 in votes dataset is the top (rank 1) outlier using SOF (table 14) whereas using the COF the rank for the same instance is 11 (table 4) . It is to be noted that the PCL of the instance is 2/7 which indicate that there is another instance of the same class within the seven nearest neighbors.

Instance 407 is ranked first using COF while it has the rank 9 using SOF. From our observation, instance 407 is alone of its class type among seven nearest neighbors. Moreover, its Deviation is the greatest which implies sort of uniqueness of the instance (object) behavior. The K-Distance of the instance is very small (high density of other class type). In SOF ranking 9 of the instance # 407 indicates the disability of recognizing such important cases. We believe that ranking Class Outliers using KNN gives more reasonable results.


5. Conclusion

We have presented a novel approach for Class Outliers mining based on the K nearest neighbors using distance-based similarity function to determine the nearest neighbors. We introduced a motivation about Class Outliers and their significance as exceptional cases. We proposed a Class Outlier definition and a ranking score that is Class Outlier Factor (COF) to measure the degree of being a Class Outlier for an object.

Beyond the problem definition and motivation, we proposed an efficient algorithm for mining and detection Class Outliers. An implementation has been developed using weka framework [39]. The algorithm implementation software has been tested for various domain datasets (medical, business, and other domains), and for different dataset type (continues, nominal with small numbers of values, nominal with larger numbers of values, and mixed). The experimental results were very interesting and reasonable. Furthermore, a comparison study has been performed with related methods.

In the future work, we shall study the performance efficiency enhancement of the algorithm which requires O(N2). Furthermore, proposing Class Outlier Detection Model. Moreover, getting advantage of the output of this work to find out a scheme to induce Censored Productions Rules (CPRs) [30] from large datasets.


References

[1] Aha, D., D. Kibler: Instance-based learning algorithms, Machine Learning, vol.6, pp. 37-66, 1991.

[2] Angiulli, F., Pizzuti, C.: Fast Outlier detection in high dimensional spaces, In Proceedings of the Sixth European Conference on the Principles of Data Mining and Knowledge Discovery, pp. 15-26, 2002.

[3] Barbarà, D., Chen, P.: Using the fractal dimension to cluster datasets, In: Proc. KDD, pp. 260–264, 2000.

[4] Barnett, V., Lewis, T.: Outliers in Statistical Data, John Wiley, 1994.

[5] Bay, S. D., and Schwabacher, M.: Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule, Proceedings of The Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2003.

[6] Blake C., Keogh E., Merz C. J.: UCI Repository of Machine Learning Databases, [Online Available]: http://www.ics.uci.edu/~mlearn/MLRepository.htm, 1998.

[7] Bolton, R. J., Hand, D. J.: Statistical fraud detection: A review (with discussion), Statistical Science, 17(3): pp. 235-255, 2002.

[8] Breunig, M., Kriegel, H., Ng, R., Sander, J.: LOF: Identifying density-based local outliers, In: Proc. SIGMOD Conf, pp. 93–104, 2000.

[9] Dunham, M. H.: Data Mining Introductory and Advanced Topics, Prentice Hall, 2003.

[10] Eskin E., Arnold A., Prerau M., Portnoy L., Stolfo S.: A geometric framework for unsupervised anomaly detection: Detecting intrusions in unlabeled data, In Data Mining for Security Applications, 2002.


[11] Ester M., Kriegel H.-P., Sander J., Xu X.: A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise, Proc. 2nd Int. Conf. on Knowledge Discovery and Data Mining (KDD'96), Portland, OR. pp. 226-231, 1996.

[12] Han, J., Kamber, M.: Data Mining: Concepts and Techniques, San Francisco, Morgan Kaufmann, 2001.

[13] Hawkins, D.: Identification of Outliers, Chapman and Hall, 1980.

[14] Hawkins, S., He, H. X., Williams, G. J., Baxter, R. A.: Outlier detection using replicator neural networks, In Proc. of the Fifth Int. Conf. and Data Warehousing and Knowledge Discovery (DaWaK02), 2002.

[15] He, Z., Deng, S., Xu., X.: Outlier detection integrating semantic knowledge, In: Proc. of WAIM’02, pp. 126-131, 2002.

[16] He, Z., Xu, X., Huang, J., Deng, S.: Mining Class Outliers: Concepts, Algorithms and Applications in CRM, Expert Systems with Applications (ESWA'04), 27(4): pp. 681-697, 2004.

[17] Jain, A., Murty, M., Flynn, P.: Data clustering: A review, ACM Comp, Surveys 31, 264–323, 1999.

[18] Johnson, T., Kwok, I., Ng, R.: Fast computation of 2-dimensional depth contours, In: Proc. KDD. pp. 224–228, 1998.

[19] Joshi, M., Agarwal, R., Kumar, V., Pnrule: Mining Needles in a Haystack: Classifying Rare Classes via Two-Phase Rule Induction, ACM SIGMOD, 2001.

[20] Joshi, M., Agarwal, R., Kumar, V.: Predicting Rare Classes: Can Boosting Make Any Weak Learner Strong?, ACM SIGKDD, 2002.

[21] Joshi, M., Kumar, V.: CREDOS: Classification using Ripple Down Structure, ICDE, 2003.

[22] Khan, M., Ding, Q., Perrizo, W.: k-Nearest Neighbor


Classification on Spatial Data Streams Using P-trees, PAKDD, pp. 517-518, 2002.

[23] Knorr E. M., Ng. R. T.: Finding intensional knowledge of distance-based outliers, In Proceedings of the 25th VLDB Conference, 1999.

[24] Knorr, E., Ng, R., Tucakov, V.: Distance-based outliers: Algorithms and applications, VLDB Journal 8, pp. 237–253, 2000.

[25] Knorr, E., Ng, R.: A unified notion of outliers: Properties and computation, In: Proc. KDD. pp. 219–222, 1997.

[26] Knorr, E., Ng, R.: Finding intentional knowledge of distance-based outliers, In: Proc. VLDB. pp. 211–222, 1999.

[27] Knorr, E.M., Ng, R.: Algorithms for mining distance-based outliers in large datasets, In: Proc. VLDB pp. 392–403, 1998.

[28] Lane, T., Brodley, C. E.: Temporal sequence learning and data reduction for anomaly detection, ACM Transactions on Information and System Security, 2(3): pp. 295-331, 1999.

[29] Martin, B.: Instance-Based learning: Nearest Neighbor With Generalization, Master Thesis, University of Waikato, Hamilton, New Zealand, 1995.

[30] Michalski, R. S., Winston, P. H.: Variable Precision Logic, Artificial Intelligence Journal 29, Elsevier Science Publishers B.V. (North-Holland), pp. 121-146,1986.

[31] Okamoto, S., Yugami, N.: Effects of domain characteristics on instance-based learning algorithms, Theoretical Computer Science, 1(298): pp. 207-233, 2003.

[32] Papadimitriou, S., Faloutsos C.: Cross-outlier detection, In: Proc. of SSTD’03, pp. 199-213, 2003.


[33] Quinlan, J.R.: C4.5: Program for Machine Learning, Morgan Kaufmann Publishers, San Mateo, CA., 1993.

[34] Quinlan, J.R.: Induction of Decision Trees, Machine Learning, 1: pp. 81–106, 1986.

[35] Ramaswamy, S., Rastogi, R., Shim, K.: Efficient algorithms for mining outliers from large data sets, In Proceedings of the ACM SIGMOD Conference, pp. 427-438, 2000.

[36] Rousseeuw, P., Leroy, A.: Robust Regression and Outlier Detection, John Wiley and Sons, 1987.

[37] Rulequest Research, Gritbot, http://www.rulequest.com

[38] Tan, P., Steinbach, M., Kumar, V.: Introduction to Data Mining, Addison-Wesley, 2005.

[39] Witten, I. H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques (Second Edition), San Francisco, Morgan Kaufmann, 2005.


class outliers mining d -b...

Documents