finding key attribute subset in dataset for outlier detection

6
Finding key attribute subset in dataset for outlier detection Peng Yang , Qingsheng Zhu College of Computer Science, Chongqing University, Chongqing 400044, China article info Article history: Received 22 March 2010 Received in revised form 10 September 2010 Accepted 10 September 2010 Available online 19 September 2010 Keywords: Outlier detection Key attribute subset Outlying reduction Data mining High dimensional dataset abstract Detection of outlier from high dimensional dataset have found important applications in many fields, yet the unexpected time consumption is likely to hinder its practical use. Thus, it makes sense to build an efficient method for finding meaningful outliers and analyzing their intentional knowledge. In this paper, we utilize the concept of rough set to construct a method for outlying reduction, based on an outlier detection and analysis system. By defining outlying partition similarity, we can mine outliers on the key attribute subset rather than on the full dimensional attribute set of dataset, as long as the similarity between outlying partitions produced on them is large enough. For this purpose, we propose a novel method for finding the key attribute subset in dataset, which starts by seeking all outliers on the full attri- bute set, and then searches through all outlying attribute subsets for these points. After that, it turns out to be able to determine the key attribute subset in accordance with the similarity between outlying par- titions. By experiments, we show that our method allows more efficient seeking of key attribute subset than the previous methods, thereby improving the feasibility of outlier detection. Ó 2010 Elsevier B.V. All rights reserved. 1. Introduction An outlier is defined as a data point which is considerably dis- similar or inconsistent with the rest of the data. Outlier detection finds applications in various fields, such as telecom or credit card fraud [1], medical analysis [2], text mining [3], network intrusion [4], and so on. All known outlier detection methods can be classi- fied into several categories, including methods based on data dis- tribution [5], depth [6], distance [7], density [8] and clustering [9]. Most of these studies, however, merely focus on how to identify outliers and failed to provide intentional knowledge of outliers. Nevertheless, some researchers have paid attention to why the identified outlier is exceptional and what plays the important role in outlier formulation. For example, Knorr and Ng [10] pointed out that the intentional knowledge can help the user to understand the data. They proposed a method for finding the strongest as well as weak outliers, along with their corresponding structural inten- tional knowledge. Though two algorithms were developed to speed it up, the method by Knorr and Ng still fails to work effi- ciently with respect to sparse high dimensional data. With the help of categorical and behavioral similarities of outliers, Chen and Tang [11] introduced the notion of outlying patterns as inten- tional knowledge of outliers, and proposed algorithms to mine those patterns and knowledge sets of distance-based outliers. Their algorithms, however, require some domain knowledge, and thus lack general application. Bandyopadhyay and Santra [12] pre- sented a genetic solution to the outlier detection problem. They defined outliers by examining those projections of the data, along which the data points have abnormal or inconsistent behavior. In order to determine the projections, an evolutionary searching technique is employed, by using a data structure of grid count tree to compute the sparsity factor. Ghoting et al. [13] presented a fast algorithm RBRP for mining distance-based outliers, particularly targeted at high dimensional datasets. It scales log-linearly as a function of the number of data points and linearly as a function of the number of dimensionality. In addition, Ye et al. [14] pro- posed an algorithm to detect projected outliers in high dimen- sional mixed attribute dataset. Combining with information entropy, they defined a novel measure of anomaly subspace, in which meaningful outliers can be detected and explained. Unlike the previous projected outlier detection methods, the dimension- ality of anomaly subspace needs not to be decided beforehand. However, only the bottom-up method has been implemented so far to find the anomaly subspace. The above related work tackled outlier detection and analysis based on the full dimensional attribute space, whereby some important information about the subsets on which these outliers exist may be omitted. Moreover, attribute reduction technique is never used directly to find outliers. In this paper, we introduce the concept of outlying reduction as well as key attribute subset (KAS), and propose an effective method to find the KAS in dataset, which can produce the outlying partition approximating to that on the full dimensional attribute set. As a preliminary example, assume that we have a historical technical statistics of NBA 0950-7051/$ - see front matter Ó 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.knosys.2010.09.003 Corresponding author. E-mail addresses: [email protected] (P. Yang), [email protected] (Q. Zhu). Knowledge-Based Systems 24 (2011) 269–274 Contents lists available at ScienceDirect Knowledge-Based Systems journal homepage: www.elsevier.com/locate/knosys

Upload: peng-yang

Post on 26-Jun-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Knowledge-Based Systems 24 (2011) 269–274

Contents lists available at ScienceDirect

Knowledge-Based Systems

journal homepage: www.elsevier .com/ locate /knosys

Finding key attribute subset in dataset for outlier detection

Peng Yang ⇑, Qingsheng ZhuCollege of Computer Science, Chongqing University, Chongqing 400044, China

a r t i c l e i n f o

Article history:Received 22 March 2010Received in revised form 10 September2010Accepted 10 September 2010Available online 19 September 2010

Keywords:Outlier detectionKey attribute subsetOutlying reductionData miningHigh dimensional dataset

0950-7051/$ - see front matter � 2010 Elsevier B.V. Adoi:10.1016/j.knosys.2010.09.003

⇑ Corresponding author.E-mail addresses: [email protected] (P. Yang), qszh

a b s t r a c t

Detection of outlier from high dimensional dataset have found important applications in many fields, yetthe unexpected time consumption is likely to hinder its practical use. Thus, it makes sense to build anefficient method for finding meaningful outliers and analyzing their intentional knowledge. In this paper,we utilize the concept of rough set to construct a method for outlying reduction, based on an outlierdetection and analysis system. By defining outlying partition similarity, we can mine outliers on thekey attribute subset rather than on the full dimensional attribute set of dataset, as long as the similaritybetween outlying partitions produced on them is large enough. For this purpose, we propose a novelmethod for finding the key attribute subset in dataset, which starts by seeking all outliers on the full attri-bute set, and then searches through all outlying attribute subsets for these points. After that, it turns outto be able to determine the key attribute subset in accordance with the similarity between outlying par-titions. By experiments, we show that our method allows more efficient seeking of key attribute subsetthan the previous methods, thereby improving the feasibility of outlier detection.

� 2010 Elsevier B.V. All rights reserved.

1. Introduction

An outlier is defined as a data point which is considerably dis-similar or inconsistent with the rest of the data. Outlier detectionfinds applications in various fields, such as telecom or credit cardfraud [1], medical analysis [2], text mining [3], network intrusion[4], and so on. All known outlier detection methods can be classi-fied into several categories, including methods based on data dis-tribution [5], depth [6], distance [7], density [8] and clustering [9].Most of these studies, however, merely focus on how to identifyoutliers and failed to provide intentional knowledge of outliers.Nevertheless, some researchers have paid attention to why theidentified outlier is exceptional and what plays the important rolein outlier formulation. For example, Knorr and Ng [10] pointed outthat the intentional knowledge can help the user to understandthe data. They proposed a method for finding the strongest as wellas weak outliers, along with their corresponding structural inten-tional knowledge. Though two algorithms were developed tospeed it up, the method by Knorr and Ng still fails to work effi-ciently with respect to sparse high dimensional data. With thehelp of categorical and behavioral similarities of outliers, Chenand Tang [11] introduced the notion of outlying patterns as inten-tional knowledge of outliers, and proposed algorithms to minethose patterns and knowledge sets of distance-based outliers.Their algorithms, however, require some domain knowledge, and

ll rights reserved.

[email protected] (Q. Zhu).

thus lack general application. Bandyopadhyay and Santra [12] pre-sented a genetic solution to the outlier detection problem. Theydefined outliers by examining those projections of the data, alongwhich the data points have abnormal or inconsistent behavior. Inorder to determine the projections, an evolutionary searchingtechnique is employed, by using a data structure of grid count treeto compute the sparsity factor. Ghoting et al. [13] presented a fastalgorithm RBRP for mining distance-based outliers, particularlytargeted at high dimensional datasets. It scales log-linearly as afunction of the number of data points and linearly as a functionof the number of dimensionality. In addition, Ye et al. [14] pro-posed an algorithm to detect projected outliers in high dimen-sional mixed attribute dataset. Combining with informationentropy, they defined a novel measure of anomaly subspace, inwhich meaningful outliers can be detected and explained. Unlikethe previous projected outlier detection methods, the dimension-ality of anomaly subspace needs not to be decided beforehand.However, only the bottom-up method has been implemented sofar to find the anomaly subspace.

The above related work tackled outlier detection and analysisbased on the full dimensional attribute space, whereby someimportant information about the subsets on which these outliersexist may be omitted. Moreover, attribute reduction technique isnever used directly to find outliers. In this paper, we introducethe concept of outlying reduction as well as key attribute subset(KAS), and propose an effective method to find the KAS in dataset,which can produce the outlying partition approximating to that onthe full dimensional attribute set. As a preliminary example,assume that we have a historical technical statistics of NBA

270 P. Yang, Q. Zhu / Knowledge-Based Systems 24 (2011) 269–274

available and need to find outliers (maybe the outstanding players)on some attribute set selected by domain expert. Since detection ofoutlier on full attribute set in large dataset is likely to be a time-consuming job, we find the KAS on a sample (e.g. the data fromone season) instead of on the whole set, and then detect the outli-ers on the KAS from the whole dataset. To this end, an efficientmethod for finding the KAS is proposed in this paper. This methodallows us to find all outliers on the full attribute set, after which itbecomes possible to search through all outlying attribute subsetsfor these points. As a result, the KAS can be identified by calculat-ing the outlying partition similarity between each outlying attri-bute subset and the full attribute set. Experimental results showthat our novel method can be efficiently applied on high dimen-sional datasets for outlier detection.

The rest of this paper is organized as follows. Section 2 outlinesthe concepts of rough set and outlying reduction, followed by thenotion of KAS. Section 3 presents an efficient method for findingthe KAS. Experimental results for demonstrating the effectivenessand efficiency of our proposed method are shown in Section 4. Thispaper finishes with the conclusions given in Section 5.

2. Rough set theory and outlying reduction

Rough set theory was proposed by Pawlak to deal with the clas-sificatory analysis of data tables. It has been successfully applied todata analysis tasks in the field of data mining and knowledge dis-covery [15–17]. In rough set theory, a dataset is represented as atable, in which each row represents an object and each columnrepresents an attribute that can be measured for the object. Thistable is called an information system. More formally, it can be rep-resented by a 2-tuple (X, A), where X is a non-empty finite set ofobjects, called universe, and A is a non-empty finite set of attri-butes such that "a e A, a: X ? Va, where Va is called the value setof a.

An important concept of rough set theory is attribute reduc-tion, which aims to reduce the number of attributes and at thesame time, preserve a certain property that we want. An attributereduction is a minimal subset of a given attribute set, and its in-duced rules have almost the same level of performance as the fulldimensional attribute set [18]. Many researchers have studied onattribute reduction. Shen et al. [19] proposed an attribute reduc-tion algorithm by keeping the dependency degree invariant. How-ever, their algorithm lacks mathematical basis. Hu et al. [20]proposed a reduction method, by using information entropy tomeasure the significance of attributes. The method, however, can-not study the structure of attribute reduction. Tsang et al. [21]proposed a formal notion of attribute reduction based on fuzzyapproximation operators of existing fuzzy rough sets. They alsoanalyzed the mathematical structure of the attribute reductionby using the discernibility matrix approach with strict mathemat-ical reasoning. Their method performs well on some numericalproblems.

In this section, we extend the concept of attribute reduction tooutlier detection problem. Since outlier detection aims to find theobjects which are markedly different from other objects in dataset,it can be considered as a partition problem, i.e., identifyingwhether an object is abnormal or not. In particular, we are inter-ested in the property of outlying partition, by which a reductionshould be as far as possible to preserve the original partition pro-vided by the full attribute set. If most outliers detected on the fullattribute set are still abnormal on some attribute subsets, suchsubsets will be considered as attribute reduction in accordancewith outlying partition.

In analogy to a rough set, an outlier detection and analysis sys-tem (ODAS) here is represented by a triple–tuple (X, A, F), where X

and A are the same as those in a rough set. F represents an outlierdetection method, for example, as a distance-based algorithm inusual.

Definition 1. Let XO be the set of outliers obtained by applyingalgorithm F on dataset X. The outlying partition (OP) of X is definedas C = (XO, X � XO) if XO – £.

Actually, an object in X will belong to any set of the partition,i.e., being outlier or normal object.

Definition 2. Let CS and CR be the OPs of X on S and R respectively,where S, R # A. S is called a relative outlying reduction set (RORS)of R, if for S � R we have CS = CR.

Definition 3. Let S be a RORS of R where S � R # A. "T, T � S, if T isnot the RORS of S, S is called the outlying reduction set (ORS) of R.

Note that the ORS of the full attribute set A can be used to re-place A for outlier detection in high dimensional dataset. Findingoutliers on such outlying reduction set will be more efficient,whilst acquiring the same outlying partition as on the full attributeset A. For attribute subset which is not the ORS of A, however, if theoutlying partition produced on such subset is approximate to thatproduced on A, we can still find and analyze outliers on such subsetrather than on A to achieve efficiency.

Definition 4. Let CS and CR be the OPs of X on S and R respectively,where S, R # A. The outlying partition similarity (OPS) between CS

and CR, denoted as ops(CS, CR), is defined as follows:

opsðbC S; bCRÞ ¼ w1 �cardðXOS \ XORÞcardðXOS [ XORÞ

þw2 �cardðXOS \ XORÞ

cardðXOSÞ

þw3 �cardðXOS \ XORÞ

cardðXORÞ: ð1Þ

Note that the card function here is to calculate the cardinality ofa set. w1, w2 and w3 are corresponding weights of support, confi-dence and inclusion degree of outlying partition CS relative to CR.The value of OPS indicates the approximate degree of outlier detec-tion between attribute set S and R. Obviously, we have0 6 ops(CS, CR) 6 1, and ops(CS, CR) = 1 holds if. XOS = XOR.

Definition 5. Let CA be the OP of dataset X obtained by applyingalgorithm F on the full attribute set A. Assume that among allsubsets of A, we have mops ¼maxSopsðbCS; bCAÞ. S is called the keyattribute subset (KAS) of X.

Intuitively, the outliers found on the KAS are the most approx-imate to those found on full attribute set A in dataset X. The valueof mops represents the approximate degree. If such degree isgreater than a desirable threshold e, the KAS will be called thee-approximation outlying reduction of A. We will consider thatthe KAS has almost the same level of performance as the full attri-bute set A for finding outliers in X. Obviously, the KAS will be theORS of A if e = 1. Given a dataset X and its outliers set XOA, our taskis to find the KAS of X. Whenever the corresponding value of mopsis large enough, we can find and analyze outliers on the KAS ratherthan on A to achieve higher efficiency.

3. Finding KAS of dataset

3.1. Brute-force method

Given ODAS (X, A, F) as defined in Section 2, a naive methodfor finding the KAS of X is based on brute-force searching, bywhich all subsets of A need to be examined. Since there are2d–2 non-empty subsets for A, we can first obtain the outlier

P. Yang, Q. Zhu / Knowledge-Based Systems 24 (2011) 269–274 271

set on each subset: "{S|S � A}, implement outlier detection algo-rithm F to obtain outlying partition CS. Then we can calculate theoutlying partition similarity between CS and CA. The correspond-ing attribute subset of OP with the maximum ops will be the KAS.While applying on high dimensional dataset, however, thismethod is computationally infeasible due to the exponentiallyincreasing by dimensionality d. Thus, we need another moreefficient method for finding the KAS, which will be discussed inthe next subsection.

3.2. Our proposed method

Before discussing our proposed method, we give the followingdefinitions:

Definition 6. Let p be an object in dataset X, and S be the subset offull attribute set A. "p e X on A,

QS(p) is called the projection of p

on S.

Definition 7. Let p be an object in dataset X, and S be the subset offull attribute set A. If

QS(p) is an outlier, S is called an outlying attri-

bute subset (OAS) w.r.t p.Since the outlying partition produced on the KAS approximates

that produced on A, we need not find the KAS among all 2d–2subsets of A. Instead, the KAS can be found only among the OASs,on which the projections of the outliers found on A still remainabnormal. Below is the idea of our proposed method for findingthe KAS.

1. Detect all outliers of X by applying algorithm F on A, and thenfind all OASs for these outliers.

2. Apply F on each derived OAS and obtain its corresponding out-lier set as well as OP.

3. Calculate ops between the OPs obtained at previous step andthe OP produced on A.

According to definition 5, the corresponding OAS of OP with va-lue mops will be the KAS.

3.2.1. Outlier detection on AIt is worth noting here that in the ODAS (X, A, F), F can be any

existing outlier detection algorithm. In this paper, we use the Hil-Out algorithm [22] due to its simplicity. In this algorithm, the out-lying degree of each point p can be calculated. Given an integer K,the outlying degree of a point p is defined as follows:

outðpÞ ¼XK

i¼1

Distðp;piÞ; ð2Þ

where pi denotes the ith nearest neighbors of p. Thus, given n, theexpected number of outliers in dataset, and an application depen-dent parameter K, HilOut algorithm can find the n points scoringthe maximum out, i.e., the top n outliers.

The algorithm makes use of the notion of space-filling curve tolinearize the dataset, and it consists of two phases. The first phaseprovides an approximate solution after the execution of at mostd + 1 sorts and scans of the dataset, which isolates points candidateto be outliers and reduces this set at each iteration. If the size ofthis set becomes n, then the algorithm stops reporting the exactsolution. The second phase calculates the exact solution with a fi-nal scan examining further the candidate outliers that remainedafter the first phase.

3.2.2. Finding OAS for a given pointNote that in the whole process, finding the OASs for a given out-

lier in XOA is very crucial. We can utilize the lattice structure to

optimize the subset searching problem. Since the value out of apoint p on some attribute set cannot be less than that on its subset[23], i.e., $S1, S2 � A, if S1 � S2 then outS1 ðpÞP outS2 ðpÞ, a propertycan be derived as follow:

Property 1. If point p is not an outlier on S (S � A), then itcannot be an outlier on any attribute set that is a subset of S,otherwise it will be an outlier on any attribute set that is a supersetof S.

This property allows speed up the pruning operation. While apoint is definitely judged as an outlier on some attribute set S, allsupersets of S will be OASs and they need not be processed any-more, whereas all subsets of S will not be OASs and they neednot be processed neither.

If there are only few small subsets containing the given point asan outlier, a bottom-up search method will waste much time be-fore most OASs are encountered. There is similar problem in top-down search method. So we begin from the intermediate level ofthe lattice to search the OASs for given point. The Jump_FindOASalgorithm is presented as follow. It receives in input the datasetX, the full d-dimensional attribute set A and its corresponding out-lier o obtained by applying algorithm F, the outlying threshold Tand the lattice lever k (1 < k < d). It returns all OASs on which o isstill considered as an outlier.

Algorithm Jump_FindOASInput: Dataset X, attribute set A, outlier o, threshold T, lattice

lever kOutput: All OASs on which

QOAS(o) is an outlier

OASUP = OASDOWN = /, CurrDim = k;Q = Insert_up_subsets(A, CurrDim);Ascend_subsets(Q);WHILE (Q – /) {

A0 = POP_subset(Q);IF (|A0| = k) THEN

OASDOWN = DOWN_FindOAS(A0, o, T);ELSEIF ðoutA0 ðoÞP TÞ THEN {

OASUP = OASUP [ A0;Prune_up(A0);

}}Return (OASUP [ OASDOWN);

In the algorithm, CurrDim represents the dimensionality number ofthe current processed attribute subset. OASUP stores the OASs whosedimensionality number is larger than CurrDim, and is initially set toempty.

The Insert_up_subsets function inserts into queue Q all thesubsets of A whose dimensionality number is larger than CurrDim.

The Ascend_subsets function sorts the subsets in ascending or-der by their cardinalities in queue Q.

In the iterative, the POP_subset function removes the attributesubset at the front of Q and let that be A0. If the cardinalities of A0

equals to k, call DOWN_FindOAS function. Otherwise, if the out va-lue of o on A0 is larger than the given threshold T, i.e., o is an outlieron such subset, A0 will be saved in OASUP. The Prune_up functionremoves from Q all supersets of A0. The whole program terminateswhen Q becomes empty.

Finally, the algorithm returns OASUP [ OASDOWN, i.e., all OASsof o.

The DOWN_FindOAS function is described as follow, whichreturns the OASs whose dimensionality number is less thank. Similarly, OASDOWN stores such OASs and is initially set toempty.

Table 2The outliers on each OAS.

OASs Outlying projection

{a1} o1

{a4} o4, o5

{a1, a3} o2

{a2, a3} o1, o2, o3

{a2, a5} o1

{a3, a5} o2

Table 3The identified top 10 outliers on full attribute set {PPG, FG%, FT%, MIN, G}.

Names PPG FG% FT% MIN G

Kobe Bryant 26.8 46.7 85.6 36.1 82Dirk Nowitzki 25.9 47.9 89.0 37.7 81Danny Granger 25.8 44.7 87.8 36.2 67Chris Paul 22.8 50.3 86.8 38.5 78

272 P. Yang, Q. Zhu / Knowledge-Based Systems 24 (2011) 269–274

Function DOWN_FindOASInput: Attribute set A0 with dimensionality number is k,

outlier o, threshold TOutput: All OASs on which

QOAS(o) is an outlier

OASDOWN = /;Q = Insert_down_subsets(A0);Descend_subsets(Q);WHILE (Q – /) {

B = POP_subset(Q);IF ðoutBðoÞP TÞ THEN

OASDOWN = OASDOWN [ B;ELSE Prune_down(B);}

Return (OASDOWN);

The Insert_down_subsets function inserts into queue Q all the sub-sets of A0.

The Descend_subsets function sorts the subsets in descendingorder by their cardinalities in queue Q.

In the iterative, the POP_subset function removes the attributesubset at the front of Q and let that be B. If the out value of o on B islarger than the given threshold T, i.e., o is an outlier on such subset,B will be saved in OASDOWN. Otherwise, call the Prune_down func-tion to remove from Q all subsets of B. The whole program termi-nates when Q becomes empty.

Note that all attribute subsets with the same dimensionalitynumber in queue Q can be sorted in decreasing order by valueout calculated on them, which allows the algorithm to visit themost promising OASs earlier in order to hopefully accelerate theconvergence of pruning.

3.2.3. Finding KAS among OASsAnalogously, the OASs of all outliers in XOA can be obtained by

applying Jump_FindOAS algorithm described above. After that, theoutlying partition similarity can be worked out between each COAS

and CA. Obviously, the KAS of dataset X is among these OASs, andwill be easily determined according to Definition 5.

For example, there is a dataset with 4 outliers {o1, o2, o3, o4} onthe full dimensional attribute set {a1, a2, a3, a4}. Assume that thecorresponding OASs of all outliers by applying Jump_FindOASalgorithm is shown in Table 1, while the outlier set obtained oneach OAS is shown in Table 2. Obviously, the KAS is {a2, a3}, sincewe have mops = ops(C{a2,a3},C{a1,a2,a3,a4}) = 3

4þ 33þ 3

4

� �=3 ¼ 0:833

according to Definition 5, where w1 = w2 = w3 = 1/3. It means thatthe KAS {a2, a3} is the 0.833-approximation outlying reduction of{a1, a2, a3, a4}. In other words, if we can accept the error of outlierdetection with 0.167, the full dimensional attribute set{a1, a2, a3, a4} can be replaced by the attribute subset {a2, a3} forfinding outliers on such kind of dataset.

4. Experimental results

By experiments, we show that our proposed method can effi-ciently find the KAS for dataset. All experiments were performed

Table 1The outlying attribute subsets.

Outliers OASs

o1 {a1}, {a2, a3}, {a2, a5}o2 {a1, a3}, {a2, a3}, {a3, a5}o3 {a2, a3}o4 {a4}

on a 1.6 GHz Pentium PC with 1G of main memory running onWindows XP.

We first implemented our proposed method to find the KAS forreal life datasets. Furthermore, we compared the performances ofoutlier detection on different attribute sets. The results indicatethe efficiency of data mining on the KAS, whilst retaining almostthe same ability for producing outlier partition. We also evaluatedthe performance of different strategies for finding the KAS on syn-thetic dataset in Section 4.2.

4.1. Real life dataset

Experiment 1. We constructed a dataset for analysis by using thetechnical statistics of NBA 2008–2009 season obtained from‘‘http://nba.sports.sina.com.cn/”, which consists of 381 instances.Since there are only linear values of attribute in such dataset, wewant to ensure that all the columns are given equal weight bytransforming the value a in column to ða� �aÞ=ra, where �a is theaverage value of the column and ra, its standard deviation. Wepicked the more offensive statistics, containing the player’s name,the points per game (PPG), the total points (PTS), the field goalpercentage (FG%), the free throw percentage (FT%), the average ofplaying minutes per game (MIN), and the number of games (G).Basing on these statistics we derived a dataset with 5-dimensionalattribute set {PPG, FG%, FT%, MIN, G} by eliminating redundantattributes such as ‘‘name” and ‘‘PTS” (PTS = PPG � G).

We ran HilOut algorithm on such attribute set to find the top 10outliers, and the results were shown in Table 3. Note that theparameter K is set to 5 in the algorithm. The average executiontime is 22 s. We continued to find the KAS for the dataset. Duringrunning Jump_FindOAS algorithm, lattice lever k is set to 3, andthreshold T is set to two times of the average outlying degreecalculated on the full attribute set [23]. Due to the lack of space,we do not list all the outlying attribute subsets obtained by Jump_FindOAS. Finally, we identified the outlying attribute subset{PPG, FG%, FT%, G} as the KAS, which is a 0.76-approximation out-lying reduction of the full 5-dimensional attribute set.

Chris Bosh 22.7 48.7 81.7 38.0 77Antawn Jamison 22.2 46.8 75.4 38.2 81Tony Parker 22.0 50.6 78.2 34.1 72Joe Johnson 21.4 43.7 82.6 39.5 79Devin Harris 21.3 43.8 82.0 36.1 69David West 21.0 47.2 88.4 39.2 76

Maximum 29.8 65.6 95.4 39.9 82Median 10.9 46.3 67.5 29.1 68Minimum 2.1 32.9 35.3 12.2 2Mean 12.1 46.2 77.1 27.9 66.6

Table 4The identified top 10 outliers on KAS {PPG, FG%, FT%, G}.

Names PPG FG% FT% G

Kobe Bryant 26.8 46.7 85.6 82Dirk Nowitzki 25.9 47.9 89.0 81Chris Paul 22.8 50.3 86.8 78Danny Granger 25.8 44.7 87.8 67Chris Bosh 22.7 48.7 81.7 77Vince Carter 20.8 43.7 81.7 80Tony Parker 22.0 50.6 78.2 72Devin Harris 21.3 43.8 82.0 69David West 21.0 47.2 88.4 76Dwight Howard 20.6 57.2 59.4 79

15

16 B-F method(d=5) B-F method(d=10)

P. Yang, Q. Zhu / Knowledge-Based Systems 24 (2011) 269–274 273

The results of applying HilOut algorithm on the KAS for findingthe top 10 outliers are shown in Table 4. Comparing with the re-sults in Table 3, only two outliers are not the same points, whichare indicated by bold entries. However, running HilOut algorithmon the KAS spent merely 19 s on average. Thus, detecting outlierson the KAS can actually reduce execution time whilst maintainingalmost the same ability for producing outlier partition. As areshown in the tables, the outliers are those players who played mostnumber of games and got the maximum PPG.

Experiment 2. The other real dataset obtained from a mobilecommunications corporation is used to further validate ourproposed method. Owing to the business security consideration,the corporation name is not revealed in this paper. The datasetincludes 200,000 records with 14 attributes, which are selected bythe domain expert of the corporation. Table 5 lists the descriptionof each attribute. Note that Manhattan distance function is used oncategorical attributes, while Euclidean distance function is used oninteger and real attributes for calculating the out of each point inthe experiments. The parameter K used in HilOut algorithm is set to100. The lattice lever k is set to 7 in Jump_FindOAS algorithm. Weran 30 repetitions and recorded the average execution time.

We first found 200 outliers (the candidate calling fraud) on thefull attribute set in this dataset and then found the outlying attri-bute subsets for them. Among the OASs, {ChargeLastMonth,ChargeCurrMonth, NameMark, PayType, CallCounts, CallDuration,RoamDuration, Balance} is identified as the KAS, which is a 0.81-approximation outlying reduction of the full 14-dimensional attri-bute set. The average execution time of running HilOut algorithmon the KAS for finding the top 200 outliers is only 488 s, while thaton the full dimensional attribute set is 776 s. Clearly, applying Hil-Out algorithm on the KAS is significantly faster than on the fullattribute set. The relatively small level of detection precision sacri-fice by finding outliers on the KAS will be offset by the huge perfor-mance gain.

Table 5Description of the full attribute set.

Attribute Type Description

OnlineDay Integer Days of becoming subscriberBalance Real Current balance of the subscriberChargeLastMonth Real Charge of the last monthChargeCurrMonth Real Charge of this monthBrandID Categorical The brand selected by subscriberNameMark Categorical Whether the private information has been

recordedIDSameCounts Integer Counts of being subscriber with the same

identificationPayType Categorical Type of paymentCallCounts Integer Counts of callCallDuration Real Duration of callRoamCounts Integer Counts of roamRoamDuration Real Duration of roamGPRSDuration Real Duration of GPRS serviceNewserCharge Real Charge of using new service

4.2. Synthetic dataset

We constructed several synthetic datasets to study the scalabil-ity of our proposed method for finding the KAS. Each dataset con-sists of N objects with d attributes, where N e {1 � 103, 10 �103, 100 � 103, 1000 � 103, 5000 � 103}, and d e {5, 10, 15, 20, 25}.Attribute values are obtained with the similar process as describedin [24]. Initially, 0.99�N values are generated according to a normaldistribution N(0, 1) and sorted. Then, these values are grouped into10 equally spaced bins and replaced with the center of the bin theybelong to. Finally, 0.01�N outliers are generated as follows: 50% ofcorresponding attribute values are set to the center of the mostpopulated bins and 50% of those are set according to randomlychosen bin centers. We implemented both the brute-force methodand our proposed method for KAS searching. Note that our pro-posed method includes three versions, denoted as top-downsearch, bottom-up search and jump-lattice search respectively,which utilize different strategy to find OASs for a given point.Fig. 1 and 2 show the execution times of applying different search-ing methods and pruning strategies, respectively, averaged over 10trials.

During using the brute-force searching strategy, we first foundall 2d–2 non-empty subsets of the full d-dimensional attributeset, and then ran HilOut algorithm on each subset to obtain respec-tive outlying partition. Finally we calculated the OPS of each outly-ing partition relative to that obtained on the full attribute set andidentified the KAS. Thus, the time complexity of brute-force basedmethod is O(2d�d2�N). On the other hand, during using our proposedmethod, we first ran HilOut algorithm on the full attribute set toobtain outlier set. Let no denote the cardinality in outlier set. Next,we ran Jump_FindOAS algorithm to find OASs for these no outliers,and let n1 be the number of OASs. This step takes O(no�2d) time.After that, we ran HilOut algorithm on each OAS to obtain respec-tive outlying partition, with which the KAS can be easily identified.The HilOut algorithm takes O(n1�d2�N) time. As a result, the totalexecution time of our proposed method is O(no�2d + n1�d2�N).

Fig. 1 shows, in logarithmic scale, the execution times obtainedby varying the size N of dataset from 1 � 103 to 5000 � 103 for var-ious values of d. Solid lines are related to our proposed method (thejump-lattice search version) for finding the KAS, while dashed linesto the brute-force based (B–F) method. The total execution time in-creases almost linearly, in accordance with the time complexityanalyzed above. Note that brute-force search can work well forlower dimensional datasets. While d = 5, the number of OASs

1 2 3 4 55

6

7

8

9

10

11

12

13

14

5000100010010

Exec

utio

n tim

e (lo

g 2(sec

))

The size of dataset (x103)

Our method(d=5) Our method(d=10) Our method(d=15) Our method(d=20) Our method(d=25)

1

Fig. 1. Experimental results of different KAS searching methods.

5 10 15 20 250

1000

2000

3000

4000

5000

6000

7000

8000

9000

Exec

utio

n tim

e (s

ec)

The dimensionality of dataset

Jump-lattice search Top-down search Bottom-up search

Fig. 2. Experimental results of different pruning strategies.

274 P. Yang, Q. Zhu / Knowledge-Based Systems 24 (2011) 269–274

approaches that of all attribute subsets. In this case, our method forfinding the KAS will take more CUP time because of the additionalOAS searching. However, the efficiency of brute-force search de-grades greatly with dimensionality d increasing since it needs toprocess 2d–2 attribute subsets. While d > 10, brute-force searchcannot terminate in a reasonable amount of time, and thereforewe do not report the results for these cases in the figure. Obviously,our proposed method scales well on datasets with dimensionalityincreasing.

Fig. 2 shows the execution times obtained by varying thedimensionality d of datasets from 5 to 25 for N = 100 � 103. Wedo not report the curves related to the experiments on the othersize datasets since they are analogous. The jump-lattice searchachieves higher performance among all the three versions of ourproposed method. The top-down search only employs a downwardpruning strategy, i.e., once a point is definitely judged as an outlieron some attribute set S, all supersets of S will be OASs and theyneed not be processed anymore. While the bottom-up search onlyuses an upward pruning strategy, i.e., once a point is not an outlieron some attribute set S, all subsets of S will not be OASs and theyneed not be processed anymore. The jump-lattice search, however,utilizes Jump_FindOAS algorithm to optionally select the layer inthe lattice, which achieves a hybrid of upward and downwardsearching. Therefore, as the dimensionality increases, the execu-tion time of top-down and bottom-up search increase faster thanthat of jump-lattice search, thereby it testified the efficiency ofpruning strategy in Jump_FindOAS algorithm.

5. Conclusions

In this paper, we applied the rough set theory to construct anoutlier detection and analysis system. By defining the concept ofoutlying reduction as well as KAS, outlying reduction of datasetcan be realized since running conventional outlier detection algo-rithms on the KAS can produce the outlying partition approximat-ing to that produced on the original full dimensional attribute set.Moreover, we presented an efficient pruning strategy for findingthe KAS. Experimental results demonstrated that our proposedmethod is more efficient and more effective than the previousmethods, and hence, it has the potential to be used for outlierdetection in practical applications. Future research efforts will in-clude extensive experiments on different high dimensional real life

datasets and more intelligent selection of lattice lever k forJump_FindOAS algorithm.

Acknowledgments

We are grateful to the referees for their valuable comments andsuggestions. This work is supported by the National NaturalScience Foundation of China (No. 61073058) and the ChongqingUniversity Postgraduates’ Science and Innovation Fund (No.200811A1B0080297).

References

[1] Shuyan Chen, Wei Wang, Henk van Zuylen, A comparison of outlier detectionalgorithms for ITS data, Expert Systems with Applications 37 (2) (2010) 1169–1178.

[2] Paul L. Canner, Y.B. Huang, Curtis L. Meinert, On the detection of outlier clinicsin medical and surgical trials, Controlled Clinical Trials 2 (3) (1981) 231–240.

[3] François Jacquenet, Christine Largeron, Discovering unexpected documents incorpora, Knowledge-Based Systems 22 (6) (2009) 421–429.

[4] P. García-Teodoro, J. Díaz-Verdejo, G. Maciá-Fernández, E. Vázquez, Anomaly-based network intrusion detection: techniques, systems and challenges,Computers and Security 28 (1–2) (2009) 18–28.

[5] S. Mittnik, S.T. Rachev, G. Samorodnitsky, The distribution of test statistics foroutlier detection in heavy-tailed samples, Mathematical and ComputerModelling 34 (9–11) (2001) 1171–1183.

[6] H.J. Escalante, A comparison of outlier detection algorithms for machinelearning, Programming and Computer Software (2005) 228–237.

[7] E. Knorr, R. Ng, Algorithms for mining distance-based outliers in large datasets,in: Proceedings of the 24th Conference on VLDB, 1998, pp. 392–403.

[8] M.M. Breunig, H.P. Kriegel, R.T. Ng, LOF: identifying density based localoutliers, in: Proceedings of ACM Conference, 2000, pp. 93–104.

[9] Zhenxia Xue, Youlin Shang, Aifen Feng, Semi-supervised outlier detectionbased on fuzzy rough C-means clustering, Mathematics and Computers inSimulation 80 (9) (2010) 1911–1921.

[10] E. Knorr, R. Ng, Finding intensional knowledge of distance-based outliers, in:Proceedings of the 25th VLDB conference, Edinburgh, Scotland, 1999, pp. 211–222.

[11] Zhixiang Chen, Jian Tang, Modeling and efficient mining of intentionalknowledge of outliers, in: Proceedings of the Seventh International DatabaseEngineering and Applications Symposium, 2003, pp. 1–10.

[12] Sanghamitra Bandyopadhyay, Santanu Santra, A genetic approach for efficientoutlier detection in projected space, Pattern Recognition 41 (4) (2008) 1338–1349.

[13] A. Ghoting, S. Parthasarathy, M.E. Otey, Fast mining of distance-based outliersin high-dimensional datasets, Data Mining and Knowledge Discovery 16 (3)(2008) 349–364.

[14] Mao Ye, Xue Li, Maria E. Orlowska, Projected outlier detection in high-dimensional mixed-attributes data set, Expert Systems with Applications 36(3) (2009) 7104–7113.

[15] Tianrui Li, Da Ruan, Geert Wets, Jing Song, Yang Xu, A rough sets basedcharacteristic relation approach for dynamic attribute generalization in datamining, Knowledge-Based Systems 20 (5) (2007) 485–494.

[16] Zuqiang Meng, Zhongzhi Shi, A fast approach to attribute reduction inincomplete decision systems with tolerance relation-based rough sets,Information Sciences 179 (16) (2009) 2774–2793.

[17] M. Inuiguchi, Y. Yoshioka, Y. Kusunoki, Variable-precision dominance-basedrough set approach and attribute reduction, International Journal ofApproximate Reasoning 50 (8) (2009) 1199–1214.

[18] Yiyu Yao, Yan Zhao, Attribute reduction in decision-theoretic rough setmodels, Information Sciences 178 (2008) 3356–3373.

[19] Q. Shen, R. Jensen, Selecting informative features with fuzzy-rough sets and itsapplication for complex systems monitoring, Pattern Recognition 37 (2004)1351–1363.

[20] Q.H. Hu, D.R. Yu, Z.X. Xie, Information-preserving hybrid data reduction basedon fuzzy-rough techniques, Pattern Recognition Letters 27 (2006) 414–423.

[21] E.C.C. Tsang, D.G. Chen, D.S. Yeung, X.Z. Wang, J.W.T. Lee, Attributes reductionusing fuzzy rough sets, IEEE Transaction on Fuzzy Systems 16 (5) (2008) 1130–1141.

[22] Fabrizio Angiulli, Clara Pizzuti, Outlier mining in large high-dimensional datasets, IEEE Transactions on Knowledge and Data Engieering 17 (2) (2005) 203–215.

[23] Ji Zhang, Hai Wang, Detecting outlying subspaces for high dimensional data:the new task, algorithms, and performance, Knowledge Information System 10(3) (2006) 333–355.

[24] Fabrizio Angiulli, Luigi Palopoli, Detecting outlying properties of exceptionalobjects, ACM Transactions on Database Systems 34 (1) (2009) 1–62.