a family of joint sparse pca algorithms for anomaly localization in network data streams

8/12/2019 A Family of Joint Sparse PCA Algorithms for Anomaly Localization in Network Data Streams

http://slidepdf.com/reader/full/a-family-of-joint-sparse-pca-algorithms-for-anomaly-localization-in-network 1/13

A Family of Joint Sparse PCA Algorithms forAnomaly Localization in Network Data Streams

Ruoyi Jiang, Hongliang Fei, and Jun Huan

Abstract—Determining anomalies in data streams that are collected and transformed from various types of networks has recently

attracted significant research interest. Principal component analysis (PCA) has been extensively applied to detecting anomalies in

network data streams. However, none of existing PCA-based approaches addresses the problem of identifying the sources that

contribute most to the observed anomaly, or anomaly localization. In this paper, we propose novel sparse PCA methods to perform

anomaly detection and localization for network data streams. Our key observation is that we can localize anomalies by identifying a

sparse low-dimensional space that captures the abnormal events in data streams. To better capture the sources of anomalies, we

incorporate the structure information of the network stream data in our anomaly localization framework. Furthermore, we extend our

joint sparse PCA framework with multidimensional Karhunen Loeve Expansion that considers both spatial and temporal domains of

data streams to stabilize localization performance. We have performed comprehensive experimental studies of the proposed methods

and have compared our methods with the state-of-the-art using three real-world data sets from different application domains. Our

experimental studies demonstrate the utility of the proposed methods.

Index Terms—Anomaly detection, anomaly localization, PCA, network data stream, joint sparsity, optimization

Ç

1 INTRODUCTION

DETERMINING anomalies in data streams that are collectedand transformed from various types of networks has

recently attracted significant research interest in the datamining community [1], [2], [3], [4]. Applications of the workcould be found in network traffic data [4], sensor networkstreams [1], social networks [3], cloud computing [5], andfinance networks [2] among others.

The common limitation of aforementioned methods isthat they are incapable of determining the sources thatcontribute most to the observed anomalies or anomalylocalization. With fast-accumulating stream data, an out-standing data analysis issue is anomaly localization, wherewe aim to discover the specific sources that contribute mostto the observed anomalies. Anomaly localization in networkdata streams is apparently critical to many applications,including monitoring the state of buildings [6] or locatingthe sites for flooding and forest fires [7]. In the stock market,pinpointing the change points in a set of stock price timeseries is critical for making intelligent trading decisions [8].For network security, localizing the sources of the mostserious threats in computer networks helps ensure securityin networks [9].

Principal component analysis (PCA) is arguably the mostwidely applied unsupervised anomaly detection techniquefor network data streams [9], [10], [11]. However, afundamental problem of PCA, as claimed in [12], is thatthe current PCA-based anomaly detection methods cannot be applied to anomaly localization. We believe that themajor obstacle for extending PCA techniques to anomaly

localization lies in the mixed nature of the abnormal space.In particular, the projection of the data streams in theabnormal subspace is a combination of data from all thesources, which makes any localization difficult. Our keyobservation is that if we manage to identify a low-dimensional approximation of the abnormal subspace usinga subset of sources, we “localize” the abnormal sources. Thestarting point of our investigation hence is the recentlystudied sparse PCA framework [13], where PCA isformalized in a sparse regression problem where eachprincipal component (PC) is a sparse linear combination of the original sources. However, sparse PCA does not fit

directly into our problems in that sparse PCA enforcessparsity randomly in the normal and abnormal subspaces.In this paper, we explore two directions in improvingsparse PCA for anomaly detection and localization.

First, we develop a new regularization scheme tosimultaneously calculate the normal subspace and thesparse abnormal subspace. In the normal subspace, we donot add any regularization but use the same normalsubspace as ordinary PCA for anomaly detection. In theabnormal subspace, we enforce that different PCs share thesame sparse structure; hence; it is able to do anomalylocalization. We call this method joint sparse PCA (JSPCA).

Second, we observe that abnormal streams are usually

correlated to each other. For example, in stock market,index changes in different countries are often correlated.For incorporating stream correlation in anomaly localiza-tion, we design a graph-guided sparse PCA (GJSPCA)

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 25, NO. 11, NOVEMBER 2013 2421

. R. Jiang is with the Adometry Corp., 1337 Wilderness Dr., #B, Austin, TX 78746, and the EECS Department, University of Kansas.E-mail: [email protected].

. H. Fei is with the IBM T. J. Watson Research, 3 Schuman Rd, Millwood,NY 10546 and the EECS Department, University of Kansas.E-mail: [email protected].

. J. Huan is with the EECS Department, University of Kansas, 340 Nichols Hall, 2335 Irving Hill Road, Lawrence, KS 66045.E-mail: [email protected].

Manuscript received 11 Oct. 2011; revised 1 May 2012; accepted 18 Aug.

2012; published online 4 Sept. 2012.Recommended for acceptance by L. Chen.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TKDE-2011-10-0623.Digital Object Identifier no. 10.1109/TKDE.2012.176.

1041-4347/13/$31.00 2013 IEEE Published by the IEEE Computer Society



technique. Our experimental studies demonstrate theeffectiveness of the proposed approaches on three real-world data sets from financial markets, wireless sensor

networks, and machinery operating condition studies.A major drawback of PCA-based anomaly detectionmethods is that the performance of the methods is verysensitive to the number of PCs representing the normalsubspace. To overcome this problem, we introduce amultidimensional Karhunen Loeve Expansion (KLE) as anextension of PCA (one-dimensional KLE) to consider thespatial correlation among different sources and thetemporal correlation among different time stamps [14].The corresponding methods are named joint sparse KLE(JSKLE) and graph-guided sparse KLE (GJSKLE), respec-tively. The experimental results demonstrate that the JSKLE and GJSKLE effectively stabilize localization perfor-

mance when changing the number of PCs representing thenormal subspace.As an example of anomaly detection and anomaly

localization in network data streams, we show the normal-ized stock index streams of eight countries over a period of three months in Fig. 1. We notice an anomaly in the markedwindow between time stamps 25 and 42. In that window,sources 1, 4, 5, 6, and 8 (denoted by dotted lines) are normalsources. Sources 2, 3, and 7 (denoted by solid lines) areabnormal ones because they have a different trend fromthat of the other sources. In the marked window, the threeabnormal sources clearly share the same increasing trendwhile the rest share a decreasing trend.

The remainder of the paper is organized as follows: InSection 2, we present related work of anomaly localization.In Section 3, we discuss the challenge of applying PCA toanomaly localization. In Section 4, we introduce theformulation of JSPCA and GJSPCA, their extended version JSKLE, GJSKLE, and the related optimization algorithm.We present our experimental study in Section 5 andconclude in Section 6.

2 RELATED WORK

Existing work on anomaly localization from network data

streams could be roughly divided into two categories:those at the source level and those at the network level.The source-level anomaly localization approaches embeddetection algorithm at each stream source, resulting in a

fully distributed anomaly detection system [5], [15], [16].The major problem of these approaches is that source-levelanomalies may not be indicative of network level anoma-

lies due to the ignorance of the rest of the network [10].To improve source-level anomaly localization methods,several algorithms have been recently proposed to localizeanomaly at the network level. Brauckhoff et al. [17] appliedassociation rule mining to network traffic data to extractabnormal flows from the large set of candidate flows. Theirwork is based on the assumption that anomalies often resultin many flows with similar characteristics. Such anassumption holds in network traffic data streams but maynot be true in other data streams such as finance data.Keogh et al. [18] proposed a nearest neighbor-basedapproach to identify abnormal subsequences within uni-variate time series data by sliding windows. They extracted

all possible subsequences and located the one with thelargest euclidean distance from its closest nonoverlappingsubsequences. However, the method only works forunivariate time series generated from a single source. Inaddition, if the data are distributed on a noneuclideanmanifold, two subsequences may appear deceptivelyclose as measured by their euclidean distance [19]. Lung-Yut-Fong et al. [20] developed a nonparametric change-point test based on U-statistics to detect and localize changepoints in high-dimensional network traffic data. Thelimitation is that the method is specifically designed forthe denial-of-service (DOS) attack in communication net-

works and cannot be generalized to other types of networkdata streams easily.

Closely related to our work, Ide et al. [21], [22] measuredthe change of neighborhood graph for each source toperform anomaly localization and developed a methodcalled stochastic nearest neighbor (SNN). Hirose et al. [23]designed an algorithm named eigen equation compression(EEC) to localize anomalies by measuring the deviation of covariance matrix of neighborhood sources. In these twostudies, we have to build a neighborhood graph for eachsource for each time interval, which is unlikely to scale to alarge number of sources. Another closely related work toours is the stream projected outlier detector (SPOT) [24], in

which a subspace is learned with genetic algorithm from apotential huge number of subsets of sources and the outliersin temporal domain are detected in the reduced subspace.The limitation of their work is that they used a genetic

2422 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 25, NO. 11, NOVEMBER 2013

Fig. 1. Illustration of time-evolving stock indices data. Indexes 2, 3, and 7 in solid lines are abnormal.



algorithm to select a subset of sources. The computationalcomplexity to find the optimum set grows exponentiallywith the number of features, and there is no guarantee thatwe will reach the optimal subset of sources. Our workformalizes anomaly localization via a sparse regularizationframework and solved it efficiently with a convex optimiza-tion technique. Furthermore, the anomalies may not beobservable in the original space. Instead of coping with

original space, we localize anomalies in abnormal subspacein which the anomalous behaviors of data are significant.Compared with [24], Yang and Zhou [25] learned thesubspace with locally linear embedding and PCA and thendetect outliers in the reduced space. However, there is nomapping between the newly learned space and originaldata space; therefore, it is not applicable for anomalylocalization. Cao et al. [26] partitioned data streams within awindow into clusters based on their similarity, and outlierswere detected on each individual cluster. A stream is anoutlier if the number of streams lies within a predefineddistance is smaller than k. However, if normal instances donot have enough close neighbors or if the abnormal

instances that have enough close neighbors, the techniquemay have high level of false positive and false negative.

We have investigated the anomaly localization problemin our previous publications [27], [28]. In [27], we proposeda two-step approach where we first compute normalsubspace from ordinary PCA and then derive a sparseabnormal subspace on the residual data subtracted from theoriginal data. The critical limitation of the two stage methodis that after removing the abnormal subspace, the resultingdata are a linear combination of all the sources. It is verydifficult to identify which source contributes most to theobserved anomaly. In [28], we designed a single-stepapproach to jointly learn normal subspace for anomaly

detection and sparse abnormal subspace for anomalylocalization. In this paper, we substantially extended [28] by generalizing our proposed JSPCA framework to KLE.KLE considers both temporal and spatial correlation of data, and it has been shown to reduce the sensitivity fromthe choice of number of PCs [14]. We also extended theexperiment study by adding one more data set. Ourexperimental studies demonstrate the effectiveness of theproposed method over the state of the art.

PCA is extensively applied to network data streamsanomaly detection [9], [10], [11]. For example, Lakhina et al.applied PCA to detect network traffic anomalies. Huang

et al. [10] developed a distributed PCA anomaly detector byequipping a local filter in each source. Brauckhoff et al. [14]consider both the temporal and spatial correlation of streamed data by extending PCA to KLE and solve thesensitivity problem of PCA proposed by Ringberg et al. [12].The major limitation of these works, as pointed out in in[12], is that PCA cannot be applied to anomaly localization.

3 PRELIMINARIES

We introduce the notations used in this paper and background information regarding PCA and sparse PCA.

3.1 NotationWe use bold uppercase letters such as X to denote a matrixand bold lowercase letters such as x to denote a vector.Greek letters such as 1; 2 are Lagrangian multipliers.

hA; Bi represents the matrix inner product defined ashA; Bi ¼ trðAT BÞ, where tr represents the matrix trace.Given a matrix X, we use xij to denote the entry of X at theith row and jth column. We use xi to represent the ith entryof a vector x. kxk p ¼ ð

Pni¼1 jxij pÞ

1 p denotes the l p norm of the

vector x 2 IRn. Given a matrix X ¼ ½x1; . . . ; xnT 2 IRn p,kXk1;q ¼ Pn

i¼1 kxikq is the l1=lq norm of the matrix X, where

~xi is the ith row of X in column vector form. Unless statedotherwise, all vectors are column vectors. In Table 1, wesummarize the notations in our paper.

3.2 Network Data Streams

Our work focuses on data streams that are collected frommultiple sources. We call the set of data stream sourcestogether as a network because we often have informationregarding the structure of the sources.

Following [29], network data streams are multivariatetime series S from p sources where S ¼ fS iðtÞg andi 2 ½1; p. p is the dimensionality of the network data

streams. Each function S i : IR ! IR is a source. A source isalso called a “node” in the communication networkcommunity and a “feature” in the data mining andmachine learning community.

Typically, we focus on time series sampled at (synchro-nized) discrete time stamps ft1; t2; . . . ; tng. In such cases, thenetwork data streams are represented as a matrix X ¼ ðxi;jÞwhere i 2 ½1; n, j 2 ½1; p, and xi;j is the reading of thestream source j at the time sample ti.

3.3 Applying PCA for Anomaly Localization

Our goal is to explore a PCA-based method for performing

anomaly detection and localization simultaneously. A PCA- based anomaly detection technique has been widelyinvestigated in [9], [10], [11]. In applying PCA to anomalydetection, one first constructs the normal subspace spanned by Vð1Þ (the top k PCs) and the abnormal subspace spanned by Vð2Þ (the remaining p-k PCs), then projects the originaldata on Vð1Þ and Vð2Þ as

X ¼ XVð1ÞVð1ÞT þ XVð2ÞVð2ÞT

¼ Xn þ Xa; ð1Þ

where X 2 IRn p is the data matrix with n time stampsfrom p data sources, Xn and Xa are the projections of X

on normal subspace and abnormal subspace, respectively.

The underlying assumption of PCA-based anomalydetection is that Xn corresponds to the regular trendsand Xa captures the abnormal behaviors in the datastreams. By performing statistical testing on the squared

JIANG ET AL.: A FAMILY OF JOINT SPARSE PCA ALGORITHMS FOR ANOMALY LOCALIZATION IN NETWORK DATA STREAMS 2423

TABLE 1Notations in the Paper



prediction error SP E ¼ trðXT a XaÞ, o ne d et er mi nes

whether an anomaly happens [9], [10]. The larger SP E is, the more likely an anomaly exists.

Although PCA has been widely studied for anomalydetection, it is not applicable for anomaly localization. Thefundamental problem, as claimed in [12], lies in the fact thatthere is no direct mapping between two matrices Vð1Þ andVð2Þ and the data sources. Specifically, let Vð2Þ be the last p k PCs that spans the abnormal subspace, Xa isessentially an aggregated operation that performs linearcombination of all the data sources, as follows:

Xa ¼ XVð2ÞVð2ÞT ¼

X p

j¼1

x j~vT j ~vi

i¼1;...;p

; ð2Þ

where x j is the data from the jth source and ~v j is thetranspose of the jth row of Vð2Þ. Considering the ith columnof Xa:

P p j¼1 x j~v j~vT

i , there is no correspondence between theoriginal ith column of X and ith column of Xa. Such anaggregation makes PCA difficult to identify the particular

sources that are responsible for the observed anomalies.Although all the previous works claim that PCA-based

anomaly detection methods cannot do localization, we solvethe problem of anomaly localization in a reverse way.Instead of locating the anomalies directly, we filter normalsources to identify anomalies by employing the fact thatnormal subspace captures the general trend of data andnormal sources have little or no projection on abnormalsubspace. The following provides a sufficient condition fordata sources to have no projection on abnormal subspace.

Suppose I ¼ fi j ~vi ¼ 0g is the set that contains all theindices for the zero rows of Vð2Þ, then 8i 2 S , xi has no

projection on the abnormal subspace. In other words, thesesources have no contribution to the abnormal behavior. LetVð2Þ ¼ ½~v1; ~v2; . . . ; ~v pT , and consider the squared predictionerror S P E ¼ trðXT

a XaÞ and plug (2) in

tr

XT a Xa

¼ tr

XaXT

a

¼ tr

VT

2 XT XV2

¼ tr

X p

j¼1

x j~vT j

!T X p

j¼1

x j~vT j

!0@

1A:

ð3Þ

From (3), it is clear that 8i 2 I , the data xi from the source ihave no projection on the abnormal subspace and hence

could be excluded from the statistics used for anomalydetection. We call such a pattern with an entire row of zeros“joint sparsity.”

Unfortunately, ordinary PCA does not afford sparsity inPCs. Sparse PCA is a recently developed algorithms whereeach PC is a sparse linear combination of the originalsources [13]. However, existing sparse PCA method has noguarantee that different PCs share the same sparserepresentation and hence has no guarantee for the jointsparsity. To illustrate the point, we plotted the entries of each PC for ordinary PCA (Fig. 2a) and for sparse PCA(Fig. 2b) for the stock data set shown in Fig. 1. White blocks

indicate zero entries, and the darker color indicates a largerabsolute loading. Sparse PCA produces sparse entries butthat alone does not indicate sources that contribute most tothe observed anomaly.

Below, we present our extensions of PCA that enable usto reduce dimensionality in the abnormal subspace.

4 METHODOLOGY

In this section, we propose a novel regularization frame-work called JSPCA to enforce joint sparsity in PCs in theabnormal space while preserving the PCs in the normal

subspace so that we can perform simultaneous anomalydetection and anomaly localization. Starting from JSPCA,we proposed two extensions. In the first extension, weconsider the network topology in the original data andincorporate such topology into JSPCA and develop anapproach called GJSPCA. In the second, we extend JSPCAand GJSPCA to JSKLE and GJSKLE, which taking thetemporal correlation into account as well as spatialcorrelation considered in JSPCA and GJSPCA.

Before formally providing the detailed methods, we givean overall workflow of our method as shown in Fig. 4 for JSPCA. Note that the rest methods share the same flow, andwe only show JSCPA for simplicity.

Given several data streams, the first step is to calculate aset of PCs with an ordinary normal subspace and abnormalsubspace with joint sparsity. Our example as shown inFig. 3 is a network with three sources, then a 3 3 PCmatrix is calculated by JSPCA. The first PC with nonzeroentries represents the normal subspace. The subtraction between original data and the projection on normal sub-space is used for anomaly detection. The remaining two PCsrepresent abnormal subspace, with the first two rows beingzero but the last row being nonzero. Based on the abnormalsubspace, the second step is to calculate the abnormalscores. A larger score indicates larger possibility of the

corresponding source is abnormal. Therefore, we completethe task of anomaly detection and localization simulta-neously.

4.1 Joint Sparse PCA

Our objective here is to derive a set of PCs V ¼ ½Vð1Þ; Vð3Þsuch that Vð1Þ is the normal subspace and Vð3Þ is a sparseapproximation of the abnormal subspace with the jointsparsity. The following regularization framework guaran-tees the two properties simultaneously:

minVð1Þ;Vð3Þ

1

2X XVð1ÞVð1ÞT XVð3ÞVð3ÞT

2

F

þ kVð3Þk1;2

s:t: VT V ¼ I p p:

ð4Þ


Fig. 2. Comparing PCA and sparse PCA.



Using one variable V, we simplify (4) as

minV

1

2kX XVVT k2

F þ kW Vk1;2

s:t: VT V ¼ I p p:

ð5Þ

Here, is the Hadamard product operator (entrywiseproduct), is a scalar controlling the balance betweensparse and fitness, W ¼ ½ ~w1; . . . ; ~w pT with ~w j is defined below:

~w j ¼ ½0; . . . ; 0

|fflfflfflffl{zfflfflfflffl} k

; 1; . . . ; 1

|fflfflfflffl{zfflfflfflffl} pk

T ; j ¼ 1; . . . ; p: ð6Þ

The regularization term kW Vk1;2 is called group lassopenalty [30], in which L2 norm is used to aggregate thecoefficients within a group and L1 norm is applied toachieve sparsity among groups. In our framework, eachgroup is corresponding to a row of the abnormal subspacematrix Vð3Þ and L1=L2 penalty enforces joint sparsity foreach source across the abnormal subspace.

The major disadvantage of (5) is that it poses a difficultoptimization problem since the first term (the trace norm) isconcave and the second term (the L1=L2 norm) is convex.The similar situation was first investigated in sparse PCA[13] with elastic net penalty [31], in which two variables andan alternative optimization algorithm were introduced.Here, we share the first least square loss term but adopt adifferent regularization term. Motivated by Zou et al. [13],we consider a relaxed version:

minA;B

1

2kX XBAT k2

F þ kW Bk1;2

s:t: AT A ¼ I p p;

ð7Þ

where A; B 2 IR p p. The advantage of the new formaliza-tion is twofold: First, (7) is convex to each subproblem whenfixing one variable and optimizing the other. As asserted in[13] disregarding the Lasso penalty, the solution of (7)

corresponds to exact PCA; second, we only impose penaltyon the remaining p k PCs and preserve the top k PCsrepresenting the normal subspace from ordinary PCA. Sucha formalization will guarantee that we have the ordinary

normal subspace for anomaly detection and the sparseabnormal subspace for anomaly localization. Note that Jenatton et al. [32] recently proposed a structured sparsePCA, which is similar to our formalization. But theirstructure is defined on groups and cannot be directlyapplied for anomaly localization.

Fig. 4a demonstrates the PCs generated from JSPCA forthe stock market data shown in Fig. 1. Joint sparsity acrossthe PCs in abnormal subspace pinpoints the abnormalsources 2, 3, and 7 by filtering out normal sources 1, 4, 5, 6,and 8. Such result matches the truth in Fig. 1.

4.2 Anomaly Scoring

To quantitatively measure the degree of anomalies for eachsource, we define anomaly score and normalized anomalyscore as following:

Definition 4.1. Given p sources and the approximated abnormalsubspace represented by Vð3Þ from JSPCA, the anomaly score for source i; i ¼ 1; . . . ; p is defined on the normalized L1 normof the ith row of Vð3Þ ~vi:

i ¼

P p j¼kþ1 j~vijj

p k ; ð8Þ

where ~vij is the jth entry of ~vi.

For each input data matrix X, (8) results in a vector ¼

½ 1; . . . ; pT of anomaly scores. The normalized score for

source i is defined as ~ i ¼ i=maxf i; i ¼ 1; . . . ; pg.


Fig. 4. Comparing JSPCA and GJSPCA.

Fig. 3. Demonstration of the system architecture of JSPCA on three network data streams with one anomaly (solid line) and two normal streams(dot lines).



A higher score indicates a higher probability that asource is abnormal. We show the anomaly scores obtainedfrom PCA, SPCA, JSPCA, for the stock data in Fig. 5. JSPCAsucceeds to localize three anomalies by assigning nonzeroscores to anomalous sources and zero to normal ones, whilePCA and SPCA both fail. With abnormal scores, we canrank abnormality or generate ROC curve to evaluatelocalization performance. Bellow, we give a skeleton of algorithm for computing abnormal score, and the detailedoptimization algorithm is introduced later.

Algorithm 1. Anomaly Localization with JSPCA1: Input: X, k and 1.2: Output: anomaly scores.3: Calculate a set of PCs V ¼ ½Vð1Þ; Vð3Þ (matrix B in

equation (7)), Vð1Þ is normal subspace, Vð3Þ is abnormalsubspace with joint sparsity;

4: Compute abnormal score for each source by thedefinition (4.1);

4.3 Graph-Guided JSPCA

In many real-world applications, the sources generating

the data streams may have structure, which may or maynot change with time. As the example mentioned in Fig. 1,stock indices from sources 2, 3, and 7 are closely correlatedover a long time interval. If sources 2 and 3 are anomaliesas demonstrated in Fig. 4a, it is very likely that source 7 isan anomaly as well. This observation motivates us todevelop a regularization framework that enforces smooth-ness across features. In particular, we model the structureamong sources with an undirected graph, where eachnode represents a source and each edge encodes a possiblestructure relationship. We hypothesize that incorporatingstructure information of sources, we can build a more

accurate and reliable anomaly localization model. Below,we introduce the graph-guided JSPCA, which effectivelyencodes the structure information in the anomaly localiza-tion framework.

To achieve the goal of smoothness of features, we add anextended l2 (Tikhonov) regularization factor on the graphLaplacian regularized matrix norm of the p k PCs. This isan extension of the l2 norm regularized Laplacian on asingle vector in [33]. With this addition, we obtain thefollowing optimization problem:

minA;B

1

2kX XBAT k2

F þ 1kW Bk1;2

þ 12

2trððW BÞT LðW BÞÞ

s:t: AT A ¼ I p p;

ð9Þ

where L is the Laplacian of a graph that captures thecorrelation structure of sources [33].

In Fig. 4, we show the comparison of applying JSPCAand GJSPCA on the data shown in Fig. 1. Both JSPCA andGJSPCA correctly localize the abnormal sources 2, 3, and 7.Comparing JSPCA and GJSPCA, we observe that inGJSPCA the entry values corresponding to the threeabnormal sources 2, 3, and 7 are closer (a.k.a. smoothnessin the feature space). In the raw data, we observe that

sources 2, 3, and 7 share an increasing trend. Thesmoothness is the reflection of the shared trend and helpshighlight the abnormal source 7. As evaluated in ourexperimental study, GJSPCA outperforms JSPCA. We believe that the additional structure information utilizedin GJSPCA helps.

The same observation is also shown in Fig. 5. Comparing JSPCA and GJSPCA, we find that JSPCA assigns higheranomaly scores to sources 2 and 3 but a lower score tosource 7, and GJSPCA has smooth effect on the abnormalscores. It assigns similar scores for the three sources. Thesimilar scores demonstrate the effect of smooth regulariza-tion term induced by the graph Laplacian. The smoothnessalso sheds light on the reason why GJSPCA outperforms JSPCA a little in anomaly localization in our detailedexperimental evaluation.

4.4 Extension with KLE

A limitation of PCA is that it only considers the spatialcorrelation but ignores the temporal correlation. As anextension of PCA, KLE was introduced to solved thisproblem in [14] by taking both spatial and temporalcorrelation into consideration. In [14], Brauckhoff et al.claimed that by extending PCA to KLE, they stabilized theanomaly detection performance and reduced the sensitiv-

ity of PCA when changing the number of PCs representingthe normal subspace [12]. Since JSPCA and GJSPCA are based on PCA, they both involve the same problemproposed in [12].

In this section, we extend our regularization framework toKLE, called JSKLE and GJSKLE, respectively. Our contribu-tions are to formalize a regularized JSPCA with KLE forlocalization and design efficient optimization algorithms tosolve the objective with KLE. Our goal is to stabilizelocalization performance and reduce the localization perfor-mance sensitivity. Such advantage will be illustrated in ourexperimental studies. For the details of derivation KLE

and its connection with ordinary PCA, refer to the appendix,which can be found on the Computer Society DigitalLibrary at http://doi.ieeecomputersociety.org/10.1109/TKDE.2012.176.


Fig. 5. Comparing different anomaly localization methods. From left to right: PCA, sparse PCA, JSPCA, and GJSPCA.



However, it is nontrivial to adopt the regularizationframework proposed in (7) and (9) to expanded data matrixX0 because the data stream from each source has beenextended from a vector to a matrix. The model parameterscorresponding to each source also become a matrix, namelyB ¼ ½BT

1 ; BT 2 ; . . . ; BT

p T where Bi is a N by pN matrix. Thetop k PCs of B representing the normal subspace in regularPCA will become kN PCs after KLE extension. Similarly,abnormal subspace is the rest ð p kÞN PCs of B. Morespecifically, we consider the following optimization problem similar to the objective of JSKLE:

minA;B

1

2kX0 X0BAT k2

F þ 1

X p

j¼1

kW j B jkF

s:t: AT A ¼ I pN pN ;

ð10Þ

where W j 2 f0; 1gN pN is the jth matrix block of WT ¼½W1; W2; . . . ; W p similar to (6) with first kN columns being0s and the rest being 1s:

W j ¼

0 0 1 1

..

. . .. ..

. ... . .

. ...

0 0 1 124 35:

For GJSKLE, we have to adjust the structured traceregularization component for extended data. Since eachsource has been extended to multiple streams, we takeaverage values across the N extended streams and make theaverage values smooth according to the network topology.More formally, considering the following objective:

minA;B

1

2kX0 X0BAT k2

F þ 1

X p

j¼1

kW j B jkF

1

2N 2trðW BÞT

PT LPðW BÞs:t: AT A ¼ I pN pN ;

ð11Þ

where P 2 f0; 1g p pN is used to summing each block of B

and defined as

P ¼

1 1 0 0 0 0

0 0 1 1 0 0

..

. . .. ..

. ... . .

. ... . .

. ... . .

. ...

0 0 0 0 1 1

2664

3775:

In Fig. 6, we show the PC space computed from JSKLEand GJSKLE. There are two PCs representing the normal

subspace and the rests presenting the abnormal subspace.Both JSKLE and GJSKLE highlight the abnormal sources,while GJSKLE shows a smooth effect on three abnormalsources 2, 3, and 7.

For JSKLE and GJSKLE, the definition of abnormal score

is a little different from that of JSPCA and GJSPCA. Suppose

the abnormal subspace is given by Vð3ÞT

¼ ½Vð3Þ1; Vð3Þ

2; . . . ;

Vð3Þ p (the rest ð p kÞN columns of B from (10) or (11)), the

anomaly score for source i; i ¼ 1; . . . ; p is

i ¼

Vð3Þi

1

ð p kÞN ; ð12Þ

where V

ð3Þ

i is the ith matrix block of Vð3Þ

.Abnormal scores computed by JSKLE and GJSKLE areshown in Fig. 6. JSKLE and GJSKLE perform similarly to JSPCA and GJSPCA but they are insensitive to the numberof PCs representing the normal subspace, which will bestudied in our experimental studies.

4.5 Optimization Algorithms

We present our optimization technique to solve (7), (9), (10),and (11) based on accelerated gradient descent [34] andprojected gradient scheme [35]. Since (10) and (11) aresimilar to (7) and (9), our following discussion will focus on(7) and (9). The solutions for (10) and (11) can be obtained

by the same procedure with only minor changes oncalculating gradient and gradient projection.

The proposed algorithm is expected to run with a singlethread on a single-core machine. The most time-consumingcomponent is the function evaluation and the gradientcalculation at each step. However, both the functionevaluation and the gradient calculation can be written in acertain “summation form” across training samples andhence follow the statistical query model. Learning algo-rithms that follow the statistical quer model can be easilyparallelized with the MapReduce programming paradigm,as discussed in [36]. Parallelizing learning algorithms is not

the focus of the paper; hence; we do not discuss it here. Fordetailed discussion on statistical query model and utilizingMapRedue to support learning algorithms with the statis-tical query model, see [36].

Although (7) and (9) are not joint convex for A and B, theyare convex for A and B individually. The algorithm solves A

and B iteratively and achieves a local optimum. Due to thespace constrain, we provide our optimization algorithm inappendix, available in the online supplemental material.

A given B: If B is fixed, we obtain the optimal A

analytically. Ignoring the regularization part, (7) and (9)degenerate to

minA

12

kX XBAT k2F

s:t: AT A ¼ I p p:

ð13Þ


Fig. 6. From left to right: PC space for JSKLE and GJSKLE, abnormal score for JSKLE, and GJSKLE.



The solution is obtained by a reduced rank form of theProcrustes rotation. We compute the SVD of GB to obtainthe solution, where G ¼ XT X is the gram matrix:

GB ¼ UDVT

A ¼ UVT :ð14Þ

Solution in the form of Procrustes rotation is widely

discussed; see [13], for example, for a detailed discussion.B given A: If A is fixed, we consider (9) only since (7) is

a special case of (9) when 2 ¼ 0. Now, the optimizationproblem becomes

minA;B

1

2kX XBAT k2

F þ 1kW Bk1;2

þ 1

22trððW BÞT

LðW BÞÞ:

z ð15Þ

Equation (15) can be rewritten as minB F ðBÞ def ¼ f ðBÞ þ

RðBÞ, where f ðBÞ takes the smooth part of (15)

f ðBÞ ¼ 12

kX0 X0BAT k2F þ 1

22trððW BÞT LðW BÞÞ

ð16Þ

and RðBÞ takes the nonsmooth part, RðBÞ ¼ 1kW Bk1;2.It is easy to verify that (16) is a convex and smooth functionover B and the gradient of f is: rf ðBÞ ¼ GðB AÞ þ2LðW BÞ.

Considering the minimization problem of the smoothfunction f ðBÞ using the first order gradient descent method,it is well known that the gradient step has the followingupdate at step i þ 1 with step size 1=Li:

Biþ1 ¼ Bi 1Li

rf ðBiÞ: ð17Þ

In [37], [34], it has shown that the gradient step (17) can bereformulated as a linear approximation of the function f atpoint Bi regularized by a quadratic proximal term asBi ¼ argminB f Li ðB; BiÞ, where

f Li ðB; BiÞ ¼ f ðBiÞ þ hB Bi; rf ðBiÞi þ Li

2 kB Bik2

F :

ð18Þ

Based on the relationship, we combine (18) and RðB Þtogether to formalize the generalized gradient update step:

QLiðB; BiÞ ¼ f Li

ðB; BiÞ þ 1kW Bk1;2

q LiðBiÞ ¼ argmin

B

QLiðB; BiÞ:

ð19Þ

The insight of such a formalization is that by exploringthe structure of regularization Rð:Þ, we can easily solve theoptimization in (19), then the convergence rate is the sameas that of the gradient decent method. In this paper, we useaccelerated gradient descent [34] to handle the smooth partand projected gradient scheme [35] to tackle nonsmoothpart. Due to the space constrain, we do not discuss detailedprocedures, but provide them in the appendix, available in

the online supplemental material.We summarize what is briefly discussed previously for

GJSPCA in Algorithm 2. Note that JSPCA is a special case of GJSPCA, we obtain the algorithm for JSPCA by setting

2 ¼ 0. For JSKLE and GJSKLE, the only changes are thegradient of smooth parts in the objective (10), (11) andprojected gradient given by (17) in the appendix, availablein the online supplemental material. Given data matrix X 2Rn p and the number of PCs representing normal subspacek and regularization parameters 1; 2, GJSPCA optimizestwo matrix variables alternatively and returns the matrix B

composed of ordinary PCs representing normal subspaceand joint sparse PCs representing the abnormal subspace.

Algorithm 2. Graph Joint Sparse PCA (GJSPCA)1: Input: X, k; 1; 2 and max iter.2: Output: B.3: A :¼ I p p; G :¼ XT X;4: for iter ¼ 1 to max iter do

5: Compute B given A using the accelerated gradientdescent and gradient projection as shown in theappendix;

6: Compute A given B via (14);7: if (Converge) then

8: break;9: end if

10: end for

11: return B;

5 EXPERIMENTAL STUDIES

We have conducted extensive experiments with three real-world data sets to evaluate the performance of JSPCA andGJSPCA on anomaly localization. We implemented ourversion of three state-of-the-art anomaly localization meth-ods at the network level: SNN [22], structured sparse

learning (SSL) [21], and EEC [23] since no executables wereprovided by the original authors. We implemented all fourmethods with Matlab and performed all experiments on adesktop machine with 6-GB memory and a Intel core i72.66-GHz CPU.

5.1 Data Sets

We used four real-world data sets from different applicationdomains. For each data set, we singled out several intervalswith anomalies. The anomalies are either labeled by theoriginal data provided or manually labeled by ourselveswhen no labeling is provided. Note that we are only

interested in the intervals where anomalies really exist because we focus on localizing anomalies. We used a slidingwindow with fixed size L and offset L=2 to create multipledata windows from the given intervals. The sliding windowmoves forward with the offset L=2 until it reaches the end of the intervals. We run all four methods on each data windowto evaluate and compare their performances.

To run GJSPCA, we calculated the pairwise correlation between any two sources within the window. We produceda correlation graph for the data streams with a correlationthreshold in that if the correlation between two sources isgreater than , we connect the two sources with an edge.

This construction is meaningful because for highly corre-lated data, streams influence each other and such influencehas been shown critical for better anomaly localization, asevaluated in our experimental studies.




Below, we briefly discuss the data collection and datapreprocessing procedures for the three data sets. In Table 2,we list the intervals that we selected, the dimensionality of the network data streams, the sliding window size L, andthe total number of data windows W for each data set. ForKDD99 intrusion data set, T is the number of connectionsand p is the number of features.

The stock indices data set: The stock indices data setincludes eight stock market index streams from eightcountries: Brazil (Brazil Bovespa), Mexico (Bolsa IPC),Argentina (MERVAL), USA (S&P 500 Composite), Canada(S&P TSX Composite), HK (Heng Seng), China (SSEComposite), and Japan (NIKKEI 225). Each stock marketindex stream contains 2,396 stamps recording the dailystock price indices from 1 January 2001 to 5 March 2010.

Since this data set has no ground truth, we manuallylabeled all the daily indices for the selected intervals. In ourlabeling, we followed the criteria list in [38] where smallturbulence and comovements of most markets are consid-ered as normal, dramatic price changes or significancedeviation from the comovement trend (e.g., one index goesup while the others in the market drop down) is consideredas abnormal.

The sun spot sensor data set: We collected a sensor dataset in a car trial for transport chain security validationusing seven wireless Sun Small Programmable ObjectTechnologies (SPOTs). Each SPOT contains a three-axisaccelerometer sensor. In our data collection, seven SunSPOTs were fixed in separated boxes and were loaded onthe back seat of a car. Each Sun SPOTs recorded themagnitude of accelerations along x-, y-, z -axis with a samplerate of 390 ms. We simulated a few abnormal eventsincluding box removal and replacement, rotation, andflipping. The overall acceleration

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiðx2 þ y2 þ z 2Þ

p was used

to detect the designed anomalous events.The motor current data set: The motor current data are the

current observation generated by the state space simulationsavailable at UCR Time Series Archive [39]. The anomaliesare the simulated machinery failure in different componentsof a machine. The current value was observed from21 different motor operating conditions, including onehealthy operating mode and 20 faulty modes. For each

motor operating condition, 20 time series were recordedwith a length of 1,500 samples. Therefore, there are20 normal time series and 400 abnormal time seriesaltogether.

In our evaluation, we randomly extracted 20 time seriesout of 420 with the length 1,500. Ten time series are fromnormal series and the rest are from abnormal series.

KDDCup 99 intrusion detection data set: The KDDCup99intrusion detection data set is obtained from UCI repository[40]. The 10 percent training data set consisting of 494,021 connection records is used. Each connection can beclassified as normal traffic or one of 22 different classes of attacks. All attacks fall into four main categories: DOS,Remote-to-local, User-to-root, and Probing (Probe). For eachconnection, 41 features are recorded, including sevendiscrete features and 34 continuous features. Since ouralgorithm is calculated for continuous features, the discretefeatures such as protocol (TCP/UDP/ICMP), service type(http/ftp/telnet/...) and TCP status flag (SF/REJ/...) aremapped into distinct positive integers from 0 to W 1 (W isthe number of states for a specific discrete feature). For threefeatures spanning over a very large range, namely “dura-tion,” “src bytes,” and “dst bytes,” logarithmic scale isapplied to reduce the ranges. Finally, all the 41 features are

linearly scaled to the range [0, 1]. The task of anomalylocalization on the intrusion detection data set is to identifythe set of features most relevant to a specific anomaly, whichis similar to feature selection.

5.2 Model Evaluation

For evaluation, since our focus is anomaly localization, wedid not evaluate anomaly detection although our method isable to do both. We used the standard ROC curves and areaunder ROC curve (AUC) to evaluate the anomaly localiza-tion performance. There is no training phase because ourframework is unsupervised. Below, we introduce the details

regarding the construction of ROC curves.As defined in (8), a higher abnormal score indicates ahigher probability the source is abnormal, which is the sameas that of the baseline methods [23], [22], [21]. To have a faircomparison, we compared the normalized abnormal scoreamong each method. The reason for normalization is thatthe anomaly scores generated by the baseline methodshave different orders of magnitude. We used the term“anomaly score” to refer to the normalized abnormal scorein the following analysis.

For each data window, the abnormal score vector ~ ¼½ ~ 1; . . . ; ~ pT was generated and compared with a cutoff threshold between ½0; 1 to separate abnormal sources andinnocent sources. We performed the same procedure for allthe data windows, and finally, we obtained a predictionmatrix with size w by p, such that w is the number of datawindow and p is the number of sources. Each entry in theprediction matrix is 0 or 1 to indicate whether the source isnormal or abnormal. Comparing the prediction matrix withthe ground truth resulted in a pair of true-positive rate(TPR) and false-positive rate (FPR), where TPR is the totalnumber of true detected abnormal sources over the totalnumber of abnormal sources, and FPR is the total number of incorrect detected abnormal sources over the total numberof normal sources in W windows. By changing the

threshold, we obtained the ROC curve and the AUC value.For network traffic data set, we evaluated our method in

a qualitative because there is no ground truth about whichfeatures contribute to the observed anomaly, also there is no


TABLE 2Characteristics of Data Sets

D: Data sets. D1: Stock indices, D2: Sensor, D3: MotorCurrent,D4: Network traffic. T : total number of time stamps, p: dimensionality of the network data streams, I : total number of intervals, Indices: starting point and ending point of the abnormal intervals, W : total number of data windows, L: sliding window size, -: not applicable.



way to do manually label. For each kind of anomaly, weshow the abnormal score of each feature and analyze withsome prior knowledge such as what is the cause of aspecific attack, and how this attack effects the 41 features.To better demonstrate the effectiveness of JSPCA andGJSPCA, we also compare our results with those obtainedfrom other feature selection methods such as informationgain [41] and SVM [42] on KDDCUP 99 data set.

5.3 Anomaly Localization Performance

We have two parameters to tune in JSPCA: 1: controllingthe sparsity, and k: the dimension of normal subspace.GJSPCA has two more parameters: 2: controlling thesmoothness, and , the correlation threshold to construct thecorrelation graph. For the other two methods, we need toselect the number of neighbors k for SSN, regularizationparameter for SSL and the number of clusters c for EEC.We first performed a grid search for each method toidentify the optimal parameters and then compared theperformance. The performances of different methodsdepend on the parameter selection. We evaluated thesensitivity of our results in the next selection.

For each data set , we tuned 1 and 2 withinf28; 27; . . . ; 28g and from 0.1 to 0.9. k was tuned from1 to 4 for the stock market and sensor data, and from 2 to 7for the motor current data. All the ranges were set byempirical knowledge. Our empirical study showed that theperformance did not change significantly as the para-meters vary in a wide range, which reduced the parametersearch space significantly.

Table 5 lists the best parameter combination for JSPCAand GJSPCA. For SNN, we tuned the number of neighbors kin the range 2-6 (for stock index data set and sensor data)and in the range 2-10 (for motorcurrent data), respectively.For the EEC method, the number of clusters c was tuned between 2-4. For SSL, we tuned the the same range as 1

in JSPCA.In Fig. 7, we show the performances for the four methods

on three different data sets. JSPCA and GJSPCA clearlyoutperform the other three methods. The AUC values of

JSPCA and GJSPCA are both above 0.85 on three data sets,while that of EEC, SNN, and SSL are around [0.5-0.6].Compared with JSPCA, GJSPCA is slightly better, whichsupports our hypothesis on the importance of incorporatingthe structure information of network data streams intoanomaly localization. SNN clearly outperforms EEC onsensor data and is comparable with EEC for the other twodata sets. SSL outperforms SSN and ECC for two data sets.

We did a case study in which PCA, SPCA, JSPCA, and

GJSPCA are compared on a selected time interval. Asshown in Fig. 5, PCA is not able to identify the abnormalsources. SPCA fails to localize source 7 and introducesmany false positives when threshold is wrongly selected. Tofurther support the argument that PCA and SPCA areinadequate for anomaly localization, we did experiment onstock indices data and calculated AUC. We found the AUCvalue of PCA and SPCA are 0.0667 and 0.6703, respectively,which are much lower compared to that of JSPCA. We donot extend our experiments to the other data sets since thetwo methods are clearly not competitive.

We also test the KLE extension of localization methods

and Table 6 lists the best parameter combination. In Fig. 8,we show the performance of JSKLE and GJSKLE incomparison with JSPCA and GJSPCA with N ¼ 2. Fromthe Figure, we observe that KLE extension does notoutperform JSPCA and GJSPCA on anomaly localizationwith the expense of introducing more computationalcomplexity due to the data matrix expansion. However,KLE extension stabilizes localization performance as shownin the section of parameter selection.

5.4 Feature Selection Performance

As mentioned earlier, anomaly localization on the KDDCUP99 intrusion detection data set performs as a feature relevant

analysis. Localizing abnormal data streams amounts toidentify features most related to a specific anomaly. Morespecifically, our algorithm aims to identify a set of relevantfeatures among all the 41 features for each type of attacks.The features are indexed and given in Table 3.


Fig. 7. ROC curves and AUC for different methods on three data sets. From left to right: ROC for the stock indices data, ROC for the sensor data,ROC curve for MotorCurrent data, AUC for the three ROC plots.

Fig. 8. ROC curve for KLE extension methods on three data sets. From left to right: ROC for the stock indices data, ROC for the sensor data, ROCcurve for MotorCurrent data.



In Fig. 9, we show the abnormal scores for the41 features under the attack of DOS computed by SNN,EEC, and GJSPCA, respectively. Since four joint sparsemethods provide similar abnormal scores, we just showthe result of GJSPCA in Fig. 9. Features 5, 6, 23, 24, 32, and33 are the most relevant for DOS attack, which isreasonable since the nature of DOS attacks involves manyconnections to some host(s) in a very short period of time.In Table 4 , we summarize the most relevant features foreach attack from our method GJSPCA. Our result isconsistent with the relevant features found in Mukkamalaand Sung [43] using SVM.

5.5 Parameter Selection

In this section, we evaluated the sensitivity of our methodsto different modeling parameters. To do so, we selected oneparameter at a time, systematically changed its value whilefixing the others at their optimal values. Although ourapproaches have more parameters than the other twomethods, the sensitivity analysis shows that performancesof our methods are remarkably stable over a wide range of parameters. Next, we show the sensitivity study on thestock indices data set for the parameters 1 and 2, , k.Similar results are observed on the other two data sets.

In Fig. 10, we show the stability by changing 1 inGJSPCA. We observe that AUC is quite stable over a widerange of 1. A similar phenomenon is also observed when

changing 2. On the middle part of Fig. 10, we performedsensitivity analysis on parameter . We observe that AUCremains stable for 2 ½0:15; 0:6. When ¼ 0, the graph is acomplete graph and the smoothness regularization willpenalize the loadings of each source across the PCs to besimilar each other. Hence, very low leads to a worseperformance. On the other hand, when ¼ 1, the graph is

just a set of isolated sources. The structure information ismissing; therefore, the performance is not optimal.

An important parameter in PCA-based anomaly detec-

tion methods is k, the number of PCs spanning the normal

subspace. In [12], Ringberg et al. claimed that the anomaly

detection performance was sensitive to k. Such a claim is

confirmed in the second subfigure of Fig. 10, where the

overall AUC gradually decreases from 0.96 to 0.72 as k

changes from 1 to 3 and then increases to 0.77 at k ¼ 4.

Compared with GJSPCA, GJSKLE significantly stabilize the

localization performance when (the threshold for deriving

network topology) and k (dimension of normal space)

change. As shown in Fig. 11, when changes from 0 to 1

with step size 0.1, AUC increases to its optimum 0.94 at


Fig. 9. Anomaly localization comparison of stochastic nearest neighborhood, EEC, GJSPCA on network intrusion data set (DoS attack).

TABLE 4Most Relevant Features for Different Attacks (JSPCA)

TABLE 3Features Indexes in KDD 99 Intrusion Detection Data Set

TABLE 5Optimal Parameters Combinations on Three Data Sets

J: JSPCA, GJ: GJSPCA.

TABLE 6Optimal Parameters Combinations on Three Data Sets

JK: JSKLE, GJK: GJSKLE.



¼ 0:5, and then decreases 3 percent to its minimum 0.91 at

¼ 1. Furthermore, AUC remains above 0.9 for k 2 ½1; 4. JSKLE and GJSKLE involves one more parameter: the

temporal offset N . To test the sensitivity of N , we repeatedthe experiments of KLE with different N s from 1 to 5 on thefinance data set. Note that (G)JSPCA is a special case of (G)JSKLE when N ¼ 1. The result is shown in Fig. 12. Withthe change of N , AUC performance is very stable. Thedifference between the optimal case (N ¼ 1) and the worsecase (N ¼ 5) is just 0.07. It may be apparent that N ¼ 1

(degenerated to (G)JSPCA) is better than other cases.However, by selecting N ¼ 2, AUC of GJSKLE is stabilizedwhen changing and k as shown in Fig. 11 compared with

GSPCA in Fig. 10.

6 CONCLUSIONS AND FUTURE WORK

Previous work on PCA-based anomaly detection claimedthat PCA cannot be used for anomaly localization. Weproposed two novel approaches, JSPCA and GJSPCA, foranomaly detection and localization in network datastreams. By enforcing joint sparseness on PCs and in-corporating the structure information of network viaregularization, we significantly extended the applicabilityof PCA-based technique for localization. Moreover, we

developed JSKLE and GJSKLE based on multidimensional

KLE that considers both spatial and temporal domains of data streams to stabilize localization performance. Our

experimental studies on three real-world data sets demon-

strate the effectiveness of our approach. Our future works

will focus on two directions: 1) how to efficiently and

effectively select model parameters; and 2) how to extend

our approach to kernel PCA.

ACKNOWLEDGMENTS

This work was supported by the Office of Naval Research

Award (N00014-07-1-1042) and the US National Science

Foundation under Grant No. 0845951.

REFERENCES

[1] S. Budhaditya, D.-S. Pham, M. Lazarescu, and S. Venkatesh,“Effective Anomaly Detection in Sensor Networks Data Streams,”Proc. IEEE Int’l Conf. Data Mining (ICDM ’09), pp. 722-727, 2009.

[2] M. Isaac, B. Raul, E. Gerard, and G. Moises, “On-Line FaultDiagnosis Based on the Identification of Transient Stages,” Proc.20th European Symp. Computer Aided Process Eng., 2010.

[3] J. Silva and R. Willett, “Detection of Anomalous Meetings in aSocial Network,” Proc. 42nd Ann. Conf. Information Sciences andSystems (CISS ’08), pp. 636-641, 2008.

[4] J. Zhang, Q. Gao, and H. Wang, “Anomaly Detection in High-

Dimensional Network Data Streams: A Case Study,” Proc.IEEE Int’l Conf. Intelligence and Security Informatics, pp. 251-253, June 2008.

[5] H. Nguyen, Y. Tan, and X. Gu, “PAL: Propagation-AwareAnomaly Localization for Cloud Hosted Distributed Applica-tions,” Proc. Managing Large-Scale Systems via the Analysis of SystemLogs and the Application of Machine Learning Techniques, 2011.

[6] N. Xu et al., “A Wireless Sensor Network for StructuralMonitoring,” Proc. Int’l Conf. Embedded Networked Sensor Systems,pp. 13-24, 2004.

[7] C. Franke and M. Gertz, “Orden: Outlier Region Detection andExploration in Sensor Networks,” Proc. ACM SIGMOD Int’l Conf.

Management of Data, pp. 1075-1078, 2009.[8] X. Liu, X. Wu, H. Wang, R. Zhang, J. Bailey, and K. Ramamoha-

narao, “Mining Distribution Change in Stock Order Streams,”Proc. Int’l Conf. Data Eng. (ICDE), pp. 105-108, 2010.

[9] A. Lakhina, M. Crovella, and C. Diot, “Diagnosing Network-WideTraffic Anomalies,” Proc. ACM SIGCOMM, pp. 219-230, 2004.[10] L. Huang, M.I. Jordan, A. Joseph, M. Garofalakis, and N. Taft, “In-

Network PCA and Anomaly Detection,” Proc. Advances in NeuralInformation Processing System (NIPS), pp. 617-624, 2006.


Fig. 11. Sensitivity analysis of GJSKLE on stock indices data set. From left to right: , the dimension of the normal subspace, 1 and 2.

Fig. 12. Sensitivity analysis of GJSKLE on stock indices data set on N.

Fig. 10. Sensitivity analysis of GJSPCA on stock indices data set. From left to right: , the dimension of the normal subspace, 1 and 2.



[11] A. Lakhina, M. Crovella, and C. Diot, “Mining Anomalies UsingTraffic Feature Distributions,” Proc. ACM SIGCOMM, pp. 217-228,2005.

[12] H. Ringberg, A. Soule, J. Rexford, and C. Diot, “Sensitivity of PCAfor Traffic Anomaly Detection,” SIGMETRICS ’07: Proc. ACMSIGMETRICS Int’l Conf. Measurement and Modeling of ComputerSystems, pp. 109-120, 2007.

[13] H. Zou, T. Hastie, and R. Tibshirani, “Sparse Principal ComponentAnalysis,” J. Computational and Graphical Statistics, vol. 15, no. 2,pp. 265-286, 2006.

[14] D. Brauckhoff, K. Salamatian, and M. May, “Applying PCA forTraffic Anomaly Detection: Problems and Solutions,” Proc. IEEEINFOCOM, pp. 2866-2870, 2009.

[15] X. Gu and H. Wang, “Online Anomaly Prediction for RobustCluster Systems,” Proc. Int’l Conf. Data Eng. (ICDE), pp. 1000-1011,2009.

[16] J.-G. Lee, J. Han, and X. Li, “Trajectory Outlier Detection: APartition-and-Detect Framework,” Proc. Int’l Conf. Data Eng.(ICDE), pp. 140-149, 2008.

[17] D. Brauckhoff, X. Dimitropoulos, A. Wagner, and K. Salamatian,“Anomaly Extraction in Backbone Networks Using AssociationRules,” Proc. Ninth ACM SIGCOMM Conf. Internet MeasurementConf., pp. 28-34, 2009.

[18] E. Keogh, J. Lin, and A. Fu, “Hot Sax: Efficiently Finding the MostUnusual Time Series Subsequence,” Proc. Fifth IEEE Int’l Conf.

Data Mining, pp. 226-233, 2005.[19] J.B. Tenenbaum, V. de Silva, and J.C. Langford, “A GlobalGeometric Framework for Nonlinear Dimensionality Reduction,”Science, vol. 290, no. 5500, pp. 2319-2323, 2000.

[20] A. Lung-Yut-Fong, C. Levy-Leduc, and O. Cappe, “DistributedDetection/Localization of Change-Points in High-DimensionalNetwork Traffic Data,” Statistics and Computing, vol. 22, pp. 485-496, 2009.

[21] T. Ide, A.C. Lozano, N. Abe, and Y. Liu, “Proximity-BasedAnomaly Detection Using Sparse Structure Learning,” Proc. SIAMInt’l Conf. Data Mining, pp. 97-108, 2009.

[22] T. Ide, S. Papadimitriou, and M. Vlachos, “Computing CorrelationAnomaly Scores Using Stochastic Nearest Neighbors,” ICDM ’07:Proc. Seventh IEEE Int’l Conf. Data Mining, pp. 523-528, 2007.

[23] S. Hirose, K. Yamanishi, T. Nakata, and R. Fujimaki, “NetworkAnomaly Detection Based on Eigen Equation Compression,”

KDD ’09: Proc. 15th ACM SIGKDD Int’l Conf. Knowledge Discoveryand Data Mining, pp. 1185-1194, 2009.

[24] J. Zhang, Q. Gao, H.H. Wang, Q. Liu, and K. Xu, “DetectingProjected Outliers in High-Dimensional Data Streams,” Proc. Int’lConf. Database and Expert Systems Applications, pp. 629-644, 2009.

[25] S. Yang and W. Zhou, “Anomaly Detection on Collective MovingPatterns: Manifold Learning Based Analysis of Traffic Streams,”Proc. IEEE Third Int’l Conf. Social Computing (socialcom), pp. 704-707, 2011.

[26] H. Cao, Y. Zhou, L. Shou, and G. Chen, “Attribute OutlierDetection over Data Streams,” Proc. Int’l Conf. Database Systems for

Advanced Applications, vol. 2, pp. 216-230, 2010.[27] R. Jiang, H. Fei, and J. Huan, “Anomaly Localization by Joint

Sparse PCA in Wireless Sensor Networks,” Proc. Fourth Int’lWorkshop Knowledge Discovery from Sensor Data, 2010.

[28] R. Jiang, H. Fei, and J. Huan, “Anomaly Localization for NetworkData Streams with Graph Joint Sparse PCA,” Proc. ACM SIGKDDInt’l Conf. Knowledge Discovery and Data Mining (KDD), pp. 886-894, 2011.

[29] P.H. dos Santos Teixeira and R.L. Milidiu, “Data Stream AnomalyDetection through Principal Subspace Tracking,” SAC ’10: Proc.

ACM Symp. Applied Computing, pp. 1609-1616, 2010.[30] M. Yuan, M. Yuan, Y. Lin, and Y. Lin, “Model Selection and

Estimation in Regression with Grouped Variables,” J. RoyalStatistical Soc., Series B, vol. 68, pp. 49-67, 2006.

[31] H. Zou and T. Hastie, “Regularization and Variable Selection viathe Elastic Net,” J. Royal Statistical Soc. B, vol. 67, pp. 301-320, 2005.

[32] R. Jenatton, G. Obozinski, and F. Bach, “Structured SparsePrincipal Component Analysis,” Proc. 13th Int’l Conf. ArtificialIntelligence and Statistics (AISTATS), 2010.

[33] H. Fei and J. Huan, “Boosting with Structure Information in the

Functional Space: An Application to Graph Classification,” Proc. ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining(SIGKDD), 2010.

[34] Y. Nesterov, Introductory Lectures on Convex Optimization: A BasicCourse. Springer, 2003.

[35] S. Boyd and L. Vandenberghe, Convex Optimization. CambridgeUniv. Press, 2004.

[36] C.-T. Chu, S.K. Kim, Y.-A. Lin, Y. Yu, G.R. Bradski, A.Y. Ng, andK. Olukotun, “Map-Reduce for Machine Learning on Multicore,”Proc. Advances in Neural Information Processing System (NIPS),pp. 281-288, 2006.

[37] D.P. Bertsekas, Nonlinear Programming, second ed. AthenaScientific, 1999.

[38] V. Chandola, A. Banerjee, and V. Kumar, “Anomaly Detection: ASurvey,” ACM Computing Survey, vol. 41, no. 3, article 15, 2009.

[39] E. Keogh and T. Folias, “The UCR Time Series Data MiningArchive,” http://www.cs.ucr.edu/eamonn/TSDMA/index.html,2002.

[40] A. Frank and A. Asuncion, “UCI Machine Learning Repository,”http://archive.ics.uci.edu/ml, 2010.

[41] H.G. Kayacik, A.N. Zincir-Heywood, and M.I. Heywood, “Select-ing Features for Intrusion Detection: A Feature RelevanceAnalysis on KDD 99,” Proc. Third Ann. Conf. Privacy, Security andTrust (PST), 2005.

[42] S. Mukkamala and A.H. Sung, “Feature Ranking and Selection forIntrusion Detection Systems Using Support Vector Machines,”Proc. Second Digital Forensic Research Workshop, 2002.

[43] S. Mukkamala and A.H. Sung, “Feature Selection for IntrusionDetection Using Neural Networks and Support Vector Machines,”

J. Transportation Research Board of the Nat’l Academies, 2003.

Ruoyi Jiang received the bachelor’s degree inelectrical engineering from the Nanjing Univer-sity of Post and Telecommunication, China, in2008 and the master’s degree in computerscience from the University of Kansas in May2012. She is now working as a data scientist inAdometry. Her research interests include anom-aly detection, anomaly localization, and socialnetwork analysis. She has served as an ex-ternal/invited reviewer for leading data mining

conferences and journals, including IEEE ICDM, ACM SIGKDD, ACMCIKM, and the ACM Transactions on Knowledge Discovery from Data .

Hongliang Fei received the bachelor’s degreein computer science from the Harbin Institute of

Technology, China, in 2007 and the PhDdegree in computer science from the Universityof Kansas in 2012 with honors. He is nowworking as a research scientist at IBM T.J.Watson Research. His research focuses onleveraging structured input/output informationfrom data for enhancing supervised and un-supervised learning algorithms, including

sparse learning, feature selection, anomaly detection and multitasklearning. He received the ICDM ’11 Best Student Paper Award andserves as a reviewer for leading journals including the IEEE Transactions on Knowledge and Data Engineering , the ACM Transac- tions on Knowledge Discovery from Data , and Pattern Recognition .

Jun Huan received the PhD degree in computerscience from the University of North Carolina at

Chapel Hill in 2006 (with Wei Wang, Jan Prins,and Alexander Tropsha). He is an associateprofessor in the Electrical Engineering andComputer Science Department at the Universityof Kansas. He holds courtesy appointments atthe Information and Telecommunication Tech-nology Center, Bioinformatics Center, andBioengineering Program-all KU research orga-

nizations. He received the US National Science Foundation FacultyEarly Career Development Award in 2009. He has published more than60 peer reviewed papers in leading conferences and journals andserved on the program committees of leading international conferencesincluding ACM SIGKDD, ACM CIKM, IEEE ICDE, and IEEE ICDM.

. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.


a family of joint sparse pca algorithms for anomaly localization in network data streams

Documents