sketch-based network-wide trafï¬c anomaly detection

Sketch-based Network-wide Traffic AnomalyDetection

Yang Liu, Linfeng Zhang, and Yong GuanDepartment of Electrical and Computer Engineering

Iowa State University, Ames, Iowa 50011Email: yangl, zhanglf, [email protected]

F

Abstract

Internet has become an essential part of the daily life for billions of users worldwide, who are using a large variety of

network services and applications everyday. However, there have been serious security problems and network failures

that are hard to resolve, for example, Botnet attacks, polymorphic worm/virus spreading, DDoS, and flash crowds. To

address many of these problems, we need to have a network-wide view of the traffic dynamics, and more importantly,

be able to detect traffic anomaly in a timely manner. Existing network measurement and monitoring solutions often

suffer scalability problems caused by overly large processing, space, or communication overhead. In this paper, we

propose to develop sketch-based algorithms for network-wide anomaly detection that are able to detect both high-profile

and coordinated low-profile traffic anomalies as an outlier in the regular traffic patterns. Our approach is based on the

spatial analysis by using traffic measurements from multiple monitors. Spatial analysis methods have been proved to

be effective in detecting network-wide traffic anomalies that are not detectable at a single monitor. To our knowledge,

Principle Component Analysis (PCA) is the best-known spatial detection method for the coordinated low-profile traffic

anomalies. However, existing PCA-based solutions have scalability problems in that they require O(m2n) running time

and O(mn) space to analyze traffic measurements from m aggregated traffic flows within a sliding window of the length

n, which makes it often infeasible to be deployed for monitoring large-scale high-speed networks. We propose two novel

sketch-based algorithms for PCA-based traffic anomaly detection in a distributed fashion. Our algorithm can archive

O(w log n) running time and O(w log2n) space at local monitors, and O(m2 log n) running time and O(m log n) space at

Network Operation Center, where w denotes the maximum number of traffic flows at a single local monitor. Additionally,

our algorithm can protect the privacy of traffic measurements for Internet Service Providers.

1

Sketch-based Network-wide Traffic AnomalyDetection

1 INTRODUCTION

The Internet has become an essential part of the daily life for billions of users worldwide. People areusing and relying on a large variety of services built on the top of the Internet, such as web browsing,online banking, shopping, entertainment, VoIP, Video on demand, auction, social networks, etc. However,everyday we are still reading news stories about major security breaches, new polymorphic worm/virusspreading, identity theft, Botnet activity, DDoS or phishing emails. To address many of these problems(e.g. DDoS, Botnet, worm/virus, etc.), we need to have a network-wide view of the traffic dynamics,and more importantly, be able to detect the traffic anomaly in a timely manner. Otherwise, the failure ofdoing so may cause catastrophic damages or unwanted results with impacts affecting online business,public safety, homeland security, personal privacy, the economy and the society at large.

Traffic anomalies can occur due to a variety of problems. Firstly, security threats like DDoS, worms,and Botnets, can generate extremely large-volume anomalous traffic. Secondly, unusual events can causetraffic anomalies, like equipment failures, vendor implementation errors, and software bugs. Thirdly,abnormal user behaviors can change the traffic patterns, for example, flash crowds, non-malicious largefile transfers, etc. In the early days, traffic anomalies often involve unusual large-volume traffic, i.e.high-profile traffic, which are mainly caused by traditional DoS, worm, or flash crowds. In recent years,new threats like Botnets introduce low-profile but in a coordinated manner, which only generate a smallamount of traffic but follow specific coordinated traffic patterns. Besides these, there are also some trafficanomalies that are low-profile and non-coordinated, e.g. Black mails and spam voice IP calls.

R1R2

R3

R4

R5

Network Operation

Center 1LOCAL

MONITOR

User

R6

R7 R8R9

Server

AS1

AS2

Network Operation

Center 2

LOCAL

MONITORLOCAL

MONITOR

LOCAL

MONITOR

LOCAL

MONITOR

LOCAL

MONITOR

LOCAL

MONITOR

Fig. 1. A Distributed Framework for Network Measurement and Monitoring

For the purpose of addressing problems like intrusion detection, fault detection and recovery, andQoS provisions, many ISPs have chosen to use a distributed architecture for the network monitoringas shown in Fig.1, which consists of local monitors and Network Operation Centers (NOC). In thisframework, local monitors collect data from routers and other network devices, perform some processingat or close to the data sources, and then transfer their measurements to NOCs. NOCs are responsible formining characteristics of interest from collected measurements, and identifying the problems and the rootsthereof. Many of such measurements from these system are also the data source for the traffic anomalydetection. Monitoring and detecting network-wide traffic anomalies has been and is still challenging forthe following reasons: Firstly, the Internet traffic exhibits huge fluctuations and long range dependence [1],which makes traffic anomalies often be hidden by large volumes of the normal traffic. Secondly, trafficanomalies show an extreme diversity and new varieties of traffic anomalies are emerging everyday.Thirdly, ISPs want to detect traffic anomalies when they are still at a low-profile volume in order toreduce the damage as much and early as possible. Last but not the least, there are many systems where

2

data, computing, and other resources are distributed and cannot be transported to a center for variousreasons, e.g. low bandwidth, security, privacy, and load balancing issues.

The spatial analysis like Principal Component Analysis (PCA) [2] has been verified to be an effectivemethod for the traffic anomaly detection. But it introduces several challenges for applying PCA onlinein practice [3]. This method only classifies specific time intervals as anomalies, but cannot identifyresponsible ones. Local monitors must send their data to the NOC periodically, which could causethe scalability problems. PCA requires a singular value decomposition (SVD) of a n × m matrix. Thecomputation complexity of SVD is O(nm2) and the space requirement is O(nm), which would become abottleneck to perform PCA in high-spend networks. Because the bandwidth to a NOC may be limited,local monitors cannot send data at a high frequency. To address this problem, the NOC must set a longenough period for each monitor to update their measurements.

1.1 Our Contribution

In this paper, we propose sketch-based algorithms for the traffic anomaly detection based on network-wide traffic measurements, which can significant improve the computation and storage overhead for thePCA-based methods [2], [4]–[7]. In our algorithms, each local monitor only maintains a series of sketchesfor each traffic flow rather than the raw measurements, which requires O(w log n) running time andO(w log2 n) space, where w denotes the maximum possible number of traffic flows at a local monitor.Traffic measurements are projected into random selected sketches that are sent to the NOC, and keepthe original data in secret. It is very difficult to learn other information about network traffic from thesketches but traffic anomalies. The sketch computation and update can be done either at the local monitoror at the NOC with the consideration about the privacy protection and the communication cost. NOCcan run PCA-based detection methods by using collected sketches with O(m2 log n) running time andO(m log n) space. Our main contributions can be summarized as follows:

1) Our algorithm is efficient in both the running time and the space requirement.2) Our algorithm is flexible for ISP to balance the computation and the storage in a distributed

measurement and monitoring system.3) Our algorithm can detect both high-profile and coordinated low-profile traffic anomalies as an

outlier in the regular traffic patterns like the PCA-based methods.4) Our algorithm provides an additional benefit, i.e. the privacy protection of traffic measurements.5) Our algorithm can also keep the communication cost below a upper bound l when the NOC only

allows a long updating period due to the limited communication bandwidth.

1.2 Related work

Traffic anomaly detection has become an important issue for the network management in the Internet,which has obtained considerable research interests. For the high-profile traffic anomalies, researcherscan apply some signal analysis methods on the traffic measurements from a single monitor to detecttraffic anomalies [8]. To deal with the low-profile coordinated traffic anomalies, Lakhina et al. [2], [5]proposed a PCA-based detection method by utilizing traffic measurements from multiple links. Li et al.[7] aggregated traffic flows into sketch subspaces and detected traffic anomalies based on PCA. Due to thehigh communication cost in Lakhina’s method, Huang et al. [9] designed a local algorithm to filter dataat the local monitor in order to avoid excessive use of the network-wide communication. A local monitorwill send its data to the NOC only if the local error exceeds a user-specified tolerance. Furthermore, ageneral framework of detection methods [10] and a distributed multi-dimensional indexing system oftraffic measurements [11] have been proposed for the traffic anomaly detection problem. Recently, severalnew methods are introduced to detect traffic anomalies based on various features in the network traffic[12]–[14]. Chhabra et al. [13] used the generalized quantile sets (GQSs) to identify a set of candidateanomalies at each local monitor. Then a local monitor communicates its detection results with otherlocal monitors to finally detect traffic anomalies. Kline et al. [14] utilized Bayes Net to identify potentialanomalous traffic from traffic volumes and correlations between ingress/egress packet and bit rates.

3

The rest of this paper is structured as following. We formalize the traffic anomaly detection problemin Sec.2. Next we present two detection algorithms in Sec.3 and Sec.4, respectively, where the secondone is an advanced version of the first one that improves the algorithm at local monitors. The theoreticalanalysis of these two methods are also given in the same sections, respectively. We evaluate our algorithmby using the data from Abilene Observatory Data Collections [15] in Sec.6. We conclude our paper withfuture work in Sec.7.

2 PROBLEM DEFINITION

2.1 Background

The Internet is a global system of interconnected computer networks, which can provide data interchang-ing by using the standardized Internet Protocol Suite (TCP/IP). The computer networks are organizedinto several autonomous systems (AS), each of which is independently operated by an Internet ServiceProvider (ISP). The success of the Internet mainly owes to the end-to-end principle, which results in asimple network infrastructure. All the data transported by the Internet are divided into IP packets, andeach packet is forwarded hop-by-hop by routers. There are a source address and a destination addressin each packet’s header, which are used by routers to determine the forwarding path from the sourceto the destination. Inter-domain routing protocol (BGP) is used to forward IP packets among differentASs. And there are more than one intra-domain routing protocol that can be used in a single AS. Thecommunication between two computers is controlled by a transmission protocol like TCP, which createsan individual end-to-end traffic flow.

Due to the exponential increase in terms of the number of users and applications, it has become notfeasible to maintain statistics for each individual end-to-end traffic flow if not impossible. Thus, ISPs oftenaggregate end-to-end traffic flows at different levels, such as origin autonomous systems, ingress links,applications, etc. For example, ISPs can use the origin-destination (OD) flow, defined as all packets thatenter the network at one origin router and exits at another destination router.

Local monitors collect traffic measurements on aggregated traffic flows in real time. A traffic measure-ment, denoted by s, can contain one or several traffic features, e.g.

s = c1(IP), c2(Port), c3(Size),where ck(·) denotes a function on IP addresses, TCP/UDP ports, packet size, or other traffic features.And ck(·) is computed over packets within a time interval, which can be the count, the entropy, or otherquantities of the traffic features.

2.2 Traffic Anomaly Definition

Fig. 2. DoS Traffic Flood c©CAIDA

A high-profile traffic anomaly often means an unusual large volume of traffic from one or multiplesources to one or multiple destinations. As a simple example, we show the packet counts in a campus

4

link in Fig.2 from CAIDA [16]. There was a spike around 11:00 a.m. in both inbound and outboundtraffic, which was caused by a flood of incoming TCP ACK packets directed to a campus computer.Flash crowds is another example of common traffic anomalies, which refer to large unexpected trafficspikes towards particular web-sites due to associated user behaviors. Usually, sudden events of greatinterest trigger flash crowds, for example the CNN broadcast on the terrorist attacks of September 11,2001 [17]. The increase in the request rate is dramatic, but relatively short in duration. On September 11,CNN served over 132 million pages, and the traffic was almost doubling every 7 minutes.

Usually, a local monitor can detect such big changes in the traffic pattern, but it is difficult to identifypotential traffic anomalies that arise from small fluctuations in thousands of traffic flows. For the low-profile coordinated traffic anomalies, some computers attempt to compromise vulnerable hosts, propagatemarlacious software, or operate some botnets, which only involve small-volume traffic flows as shown inFig. 3 including the traffic volumes in four OD flows on the Abilene network [15]. Such traffic anomalieslike the Botnet [18] cannot be detected at a single local monitor and therefore require a network-widetraffic analysis.

5700 5750 5800 5850 59000

1

2x 10

8

Volume ATLA − CHIC

5700 5750 5800 5850 59000

5

10

15x 10

7

Volume CHIC − KANS

5700 5750 5800 5850 59000

2

4

6x 10

7

Volume CHIC − SALT

5700 5750 5800 5850 59000

5

10

15x 10

7

Volume

Time

SEAT − SALT

Fig. 3. An example of coordinated traffic anomalies from the Abilene network

In a large-scale network, ISPs can utilize traffic measurements from multiple monitors to detect trafficanomalies that cannot be detected at a single monitor. An observation, denoted by x = (s1, . . . , sm)T , isdefined as a vector of the measurements from multiple monitors, where sj denotes a measurement of thej-th flow and m denotes the number of traffic flows. Although each sj may be a normal measurementindividually, they can only be observed concurrently when there are some traffic anomalies occurring. Ingeneral, a traffic anomaly can be detected as an observation that doesn’t belong to an normal observationset Ω, which contains all observations corresponding to the normal traffic.

The easiest way to check whether an observation x belongs to Ω is to compare x with each elementin Ω, which may require a large amount of time to process all possible observations. In order to detecttraffic anomalies in nearly real time, we have to investigate the properties of normal traffic, and try to findsome quantities that are different enough to distinguish normal traffic and anomalous traffic. The anomalydistance is defined as such a distance norm that the normal observations in Ω are close to each other witha high probability. We can pick a typical normal observation x0 ∈ Ω, and rank every observation basedon the anomaly distance between itself and the typical observation x0. Then the probability distribution

5

of the anomaly distance on normal observations can be computed, which will be used to detect trafficanomalies with a statistical confidence.

Definition 1: Given an observation set Ω and an anomaly distance dΩ(·, ·), an observation x is a trafficanomaly at (1 − ) confidence level if

P (x′ : dΩ(x′,x0) > dΩ(x,x0)) ≤ (1)

where P (·) denotes the probability distribution of the anomaly distance, and is a real number between0 and 1.

In this paper, we take the detection of traffic volume anomalies as an example to verify our method,which is a specific case study of the general problem. In our algorithm, we do not specify any aggregationmethod, and assume that the aggregation method is given by ISPs who can balance the computationoverhead and their requirements. And we have no knowledge about the statistical properties of the traffic.As a consequence, we are given a previous measurement set Ω, and estimate the probability distributionof the anomaly distance based on it. The traffic volume is defined as the total size of all IP packets ina traffic flow within a single time interval. Here xij denotes the traffic volume of the j-th flow at thei-th time interval. The measurement set Ω contains all recent observations within a sliding window ofthe length n from m traffic flows. And the probability density distribution of the anomaly distance isdenoted by fd(x). For simplicity, we use dΩ(x) to denote the anomaly distance between x and x0.

Definition 2: Given the measurement set Ω, an observation xi measured at the i-th time interval is atraffic anomaly at 1 − confidence level, if

dΩ(xi) > δ (2)

where δ denotes the threshold of the anomaly distance assigned to the observation set Ω, which can beobtained by solving the following equation,

∫ ∞

δ

fd(x)dx = . (3)

2.3 Objective of this research

Our algorithm aims at detecting traffic anomalies with continuous traffic measurements updating withoutretrieving previous data in high-speed networks. To archive this goal, we must compute the anomalydistance for sampled observations based on the measurement set Ω with the consideration of computationand space constraints. Firstly, the computation should be very efficient at both the local monitors and theNOC. Secondly, the algorithm is implemented in a distributed system which requires careful considerationabout the storage and communication overhead. Thirdly, it should support continuous update to adjustthe detection method due to the evaluation of the traffic anomaly. Last but not the least, a privacyprotection mechanism makes ISPs be able to share traffic measurements with each other, which canprovide a network-wide detection of the traffic anomaly.

3 SIMPLE SKETCH METHOD (SSM)

Pa

cke

t Stre

am

Local Monitor Network Operation Center

Sketch

Computation

PCA

Computation

Alarm?

Ag

gre

ga

tion

Volume

Counter

Traffic

Volume

Sketch

Computing

Anomaly

Distance

Computing

Threshold

d > ?

Fig. 4. System model of simple sketch method

6

The system model of the simple sketch method is shown in Fig.4, which utilizes Random Projection[19]–[21] to reduce the computation complexity. In this method, the measurement set Ω is converted intoa compact representation, i.e. the sketches. The sketches require smaller storage space and therefore canbe maintained in the memory. The computation of the threshold δ and the anomaly distance dΩ(xi)

are based on the sketches rather than the original measurement set Ω. There are five modules and thealgorithm in each module is described in the following section. Then we give an example to showthe dection procedure of the SSM. At last, we provide the theory for the SSM method to detect trafficanomalies and the analysis of its computation complexity and the error bound.

3.1 Algorithm

3.1.1 Volume Counter

ISP implements a aggregation method and reports a pair (FlowID,Size) to the volume counter, whereSize denotes the packet size and FlowID is the index of the traffic flow. The volume counter maintainsa list of buckets for each flow. A bucket Uij stores the traffic volume xij of the j-th flow at the i-th timeinterval. An array of buckets contains all traffic volumes within a sliding window of the length n for eachflow. When a pair (FlowID,Size) with FlowID = j comes at the i-th time interval, the correspondingbucket Uij will be increased by Size. When a time interval ends, we just delete the last bucket and createa new bucket for the new time interval. The local monitor reports

yij = xij − xtj (4)

at the time interval t, which possibly contains anomalous traffic to the NOC, where xtj denotes the meanof traffic volumes within the sliding window at the current time interval t,

xtj =1

n

t∑

i=t−n+1

xij . (5)

3.1.2 Sketch Computation

The sketch is a compact representation of a sequence of traffic measurements in order to computeinterested statistics efficiently [22], [23]. It is an extension of the Random Projection which projects an-dimensional vector into a random-selected l-dimensional sketch with a set of random vectors. Afterthe projection, the distance between the original vectors can be preserved with the benefit that thesize l of the sketches is much less than the dimension n of the original vectors, i.e. l = O(log n). Ateach local monitor, we compute an l-dimensional sketch of the traffic volume for each flow. And allmonitors share the same random numbers for the sketch computation. At each time interval, we usesome pseudo random generators to compute l independent and identically-distributed (i.i.d.) randomnumbers, denoted by ri1, . . . , ril, from a specific probability distribution F . Here, F could be the standardnormal distribution, or some simple distributions that are easy to generate random numbers [24]. Theproperty of the probability distribution F is closely related to the detection accuracy. The sketch zkj iscomputed as,

zkj =1√l

t∑

i=t−n+1

rik(xij − xtj) =1√l

t∑

i=t−n+1

rikyij (6)

for i = t − n + 1, . . . , t, j = 1, . . . ,m, and k = 1, . . . , l.

3.1.3 Principle Component Analysis

NOC gets all sketches zkj : k = 1, . . . , l, j = 1, . . . ,m from local monitors, and organizes them into al×m matrix Z. PCA is applied on Z and treats each row as a point in Rm and each column as a variable.PCA performs a coordinate rotation that aligns the transformed axes with the directions that make theprojections of the row vectors on each axis get as large variance as possible. Principal components arethe unit vectors along these axes. The first principal component of Z, denoted by a1, can be found as,

a1 = arg max‖x‖=1

‖Zx‖ (7)

7

where arg max stands for the vector x = (x1, . . . , xm)T that satisfies ‖x‖ = 1 and makes the function‖Zx‖ get the maximum value. Here, ‖x‖ stands for the Euclidean norm

‖x‖ =√

x21 + · · · + x2

m. (8)

With the first r− 1 principal components, i.e. a1, . . . ,ar−1, the r-th principal component ar can be foundby subtracting the first r − 1 principal components from Z,

ar = arg max‖x‖=1

‖(Z −r−1∑

j=1

ZajaTj )x‖. (9)

The standard procedure to find principle components is the singular value decomposition (SVD). Apair of vectors a ∈ Rm and b ∈ Rl are singular vectors of Z, if Za = λb and bTZ = λaT , where λ is areal number denoting the corresponding singular value. And the principle components, i.e. a1, . . . ,am,are one of the pairs of singular vectors. Usually, the corresponding singular values of each principlecomponent are ordered, i.e. λ1 ≥ λ2 ≥ · · · ≥ λm ≥ 0. Then the matrix Z can be written as

Z =∑

j

λjbjaTj . (10)

3.1.4 Anomaly Distance Computation

Given the principle components, i.e. a1, . . . ,am, we choose the last few principle components to computethe anomaly distance of an observation yi = (yi1, . . . , yim)T ,

dZ(yi) =

√√√√

m∑

j=r+1

(aTj yi)2 (11)

where r is an integer, r < m. Because the first r principle components can capture the main pattern inthe traffic, a normal observation should reside in the subspace of a1, . . . ,ar, and the last m− r principlecomponents are assumed to contain only random fluctuations. Thus, the probability distribution of theanomaly distance can be approximated by a distribution of a sum of chi-square random variables, becauseaT

j yi follows the normal distribution.Before computing anomaly distance, we need to determine the number of principal components which

are corresponding to random fluctuations in the traffic volumes. The last m − r principle componentsare chosen to compute anomaly distance, where r usually refers as the size of the normal subspace.There are several techniques which can be applied to determine the size of the normal space, such askσ-heuristic, Cattell’s Scree Test, and so forth. Here, we give a brief introduction about the 3σ-heuristic[2]. The projection of the matrix Z on the j-th principle component, i.e. Zaj , is examined one by one.When a projection is found that the value of an element in Zaj exceeds 3σj from the mean, where σj isthe standard deviation, this and all remaining principle components are selected.

3.1.5 Threshold Computation

The threshold computation is based on the fault detection in multivariate process control [25]. Becausea1, . . . ,am is an orthonormal set of vectors, we get

‖yi‖ =

√√√√

m∑

j=1

(aTj yi)2. (12)

Let Q = [a1, . . . ,ar ], and then

dZ(yi) =

√√√√

m∑

j=r+1

(aTj yi)2 =

√√√√yT

i yi −r∑

j=1

(aTj yi)2 = ‖(I − QQT )yi‖. (13)

8

Therefore, the anomaly distance dZ(yi) equals to the squared prediction error (SPE) [25]. We can computea threshold δ based on the Q-statistic developed by Jackson and Mudholkar,

δ2 = ϕ1

[

c

√

2ϕ2h2

ϕ1+ 1 +

ϕ2h(h − 1)

ϕ21

]1/h

(14)

wherec = 1 − , (15)

h = 1 − 2ϕ1ϕ3

3ϕ22

, (16)

ϕk =1

(n − 1)k

m∑

j=r+1

λ2kj (k = 1, 2, 3). (17)

NOC first computes the anomaly distance according to Eq.(11). At the last step, if dZ(yi) > δ, NOCidentifies yi as a traffic anomaly.

3.2 Detection Example

5700 5750 5800 5850 59000

0.5

1

1.5

2x 10

8

ATLA − CHIC

Time

Volu

me

5700 5750 5800 5850 59000

1

2

3x 10

8

Anomaly Distance

Time

Dis

tance

Fig. 5. An example of traffic anomalies by SSM

Given an observation yi, we can identify yi as a traffic anomaly with a probability at least 1 − ifdZ(yi) > δ. Here, we use an example from the Abilene network [15] to show the detection procedure.There are 9 routers in the Abilene network, i.e. ATLA, CHIC, HOUS, KANS, LOSA. NEWY, SALT, SEAT,and WASH. We first get the traffic volume series y of each OD flow within 10 days from the VolumeCounter, e.g.

yATLA−CHIC = (−4.07 × 107, . . . , 4.54 × 107)︸︷︷︸

n=2880

. (18)

Then we compute the sketch zATLA−CHIC by using n × l random numbers rik according to Eq.(6),

zATLA−CHIC = (4.26 × 107, . . . , 8.73 × 107)︸︷︷︸

l=300

. (19)

NOC collects zO−D for all OD flow and organizes them into a matrix Z. By applying PCA on Z, we getprinciple components and the eigenvalues. Given an observation

y5857 = (2.24 × 107, . . . ,−2.42 × 108)︸︷︷︸

m=81

(20)

9

NOC computes the anomaly distance dZ(y5857) = 2.08 × 108 according to Eq.(11). The threshold δ =1.79 × 108 is computed as Eq.(14). Therefore, dZ(yi) > δ and NOC identifies the i = 5857 time intervalas a traffic anomaly.

3.3 Computation Complexity

SSM detects traffic anomalies based on the same principles as Lakhina’s method [2], which can detectlow-profile coordinated traffic anomalies. If there is an increase on several flows at the same time, we candetect that the anomaly distance will exceed the threshold. In fact, SSM is an approximation algorithmfor Lakhina’s method, which can improve the computation complexity.

Theorem 1: The computation complexity and the space requirement of SSM are both O(wn log n) atthe local monitor, where w denotes the maximum possible number of traffic flows. The computationcomplexity is O(m2 log n) and the space requirement is O(m log n) at the NOC.

Proof: At a local monitor, we first compute the mean of traffic volumes and then subtract it fromxij in order to get yij , which takes O(wn) running time. Then, yij is multiplied by the random numbersrik, which need to compute O(wn log n) productions. We calculate the sum of the productions as thesketch, that requires O(wn) running time. Therefore, the computation complexity is O(wn log n). Thelocal monitor needs to save the traffic volumes and random numbers, which requires O(wn) spaces.

At the NOC, the computation complexity of the SVD on a l × m matrix is O(m2l), which meansthat the computation complexity is at most O(m2 log n). In order to save the sketch matrix, NOC needsO(ml) = O(m log n) memory space. At each time step, NOC uses the observation y and pre-computedprinciple components to detect traffic anomalies, which only requires O(m2) running time. In general,the SSM requires O(m log2 n) running time and O(m log n) space at the NOC.

3.4 Error Bound Analysis

First of all, we give a brief introduction about PCA-based method, which was introduced by Lakhina todetect coordinated low-profile traffic anomalies. Next, we prove that SSM can bound the detection errorat a use-specified accuracy level.

3.4.1 PCA-based Anomaly Detection

In Lakhina’s method, all traffic volume xij from m traffic flows within the sliding window of the lengthn are organized into a n × m matrix X, which is adjusted to a matrix Y with zero column mean, i.e.yij = xij − xtj . PCA is applied on the matrix Y, and the SVD of Y is denoted by

Y =m∑

j=1

ηjujvTj , (21)

where vj is the principal component and ηj is the corresponding singular value.Lakhina found that the traffic volumes of multiple flows had a low intrinsic dimensionality, which

means that the normal traffic can effectively reside in a r-dimensional subspace with r ≪ m. An adjustedobservation yi = (yi1, . . . , yim)T can be decomposed into normal and abnormal subspaces,

yi = yi,normal + yi,anomaly (22)

whereyi,normal = PPT yi, yi,anomaly = (I − PPT )yi, (23)

with P = [v1, . . . ,vr ]. The size of the normal subspace, denoted by r, is determined by the 3σ-heuristic.The traffic observation is classified as a normal traffic if

‖yi,anomaly‖ ≤ Q, (24)

10

where Q is defined as the same as Eq.(14),

Q2 = φ1

c

√

2φ2h20

φ1+ 1 +

φ2h0(h0 − 1)

φ21

1/h0

, (25)

where

h0 = 1 − 2φ1φ3

3φ22

, (26)

φk =m∑

j=r+1

σ2kj (k = 1, 2, 3), (27)

and σj is the standard deviation of the projection of the measurements on the j-th principal component,which can be estimated as

σj =1√

n − 1‖Yvj‖ =

ηj√n − 1

. (28)

3.4.2 Error Bound for SSM

In this section, we explain why SSM can compute principal components based on the sketch matrix Z

and further detect anomalies based on them like Lakhina’s method. All proofs assume that the standardnormal distribution is used for the sketch computation. Let R be the n× l random matrix, which consistsof the random number rik from the standard normal distribution. According to the sketch computationin Eq.(6), we have

zj =1√lRT yj and Z =

1√lRT Y, (29)

where yj and zj are a column in Y and Z, respectively. The vector zj is also called the random projectionof yj , which has the following properties [20].

Lemma 1: Let R be a n× l random matrix from the standard normal distribution and zj = 1√lRTyj . We have

• E(‖zj‖2) = ‖yj‖2;

• P(|‖zj‖2 − ‖yj‖2| ≥ ε‖yj‖2) < 2e−(ε2−ε3) l

4 for ∀ε > 0.

Proof:

E(‖zj‖2) = E(l∑

k=1

z2kj)

= E

l∑

k=1

(1√l

t∑

i=t−n+1

rikyij)2

=1

l

l∑

k=1

t∑

i=t−n+1

y2ijE(r2

ik) +t−1∑

i=t−n+1

t∑

i′=i+1

2yijyi′jE(rik)E(ri′k)

=t∑

i=t−n+1

y2ij

= ‖yj‖2 (30)

Let Wk =√

l‖yj‖zkj = 1

‖yj‖∑t

i=t−n+1 rikyij , which is a standard normal variable. We define

W =l

‖yj‖2‖zj‖2 =

l∑

k=1

W 2k (31)

11

It follows the chi-square distribution χ2l . Therefore,

P (‖zj‖2 ≥ (1 + ε)‖yj‖2) = P (W ≤ (1 + ε)k)

= P (eθW ≥ e(1+ε)lθ)

≤ E(eθW )

e(1+ε)lθ(32)

using Markov’s inequality.

P (‖zj‖2 ≥ (1 + ε)‖yj‖2) ≤ Πlk=1E(eθWk)

e(1+ε)lθ

=

(

E(eθW 2

1 )

e(1+ε)θ

)l

(33)

Because W1 follows the standard normal distribution,

E(eθW 2

1 ) =1√

1 − 2θ. (34)

The above equation holds for any θ < 1/2. Thus we get,

P (W ≥ (1 + ε)l) ≤(

e−2(1+ε)θ

1 − 2θ

)k/2

(35)

The optimal choice of θ is ε/2(1 + ε). So we get,

P (W ≥ (1 + ε)l) ≤ (

(1 + ε)e−ε)k/2< e−(ε2−ε3) l

4 (36)

Similarly,

P (W ≤ (1 − ε)l) ≤ ((1 + ε)e−ε)k/2

< e−(ε2−ε3) l

4 (37)

According to the above two equations, we get P(|‖zj‖2 − ‖yj‖2| ≥ ε‖yj‖2) < 2e−(ε2−ε3) l

4

Besides the standard normal distribution, there are several probability distributions which have beenproposed for the random projection. Alon intorduced the tug-of-war algorithm [24], where the randommatrix R is generated from the probability distribution

rik =

−1 with probability 1/2+1 with probability 1/2

(38)

Later, Achlioptas [19] gave a more efficient algorithm, i.e. the sparse random projection with

rik =√

s

−1 with probability 1/2s0 with probability 1 − 1/s

+1 with probability 1/2s(39)

where s is an integer. In the sparse random projection, only 1s of the data need to be processed. Recently,

very sparse random projection has been recommended by Li et. al. [21], which uses R of entries in−1, 0, 1 with probability 1

2√

n, 1 − 1√

n, 1

2√

n. For the sparse random projection, we have the following

properties [19], [21],Lemma 2: Let R be a n× l random matrix with entries in −1, 0, 1 with probabilities 1/2s, 1 − 1/s, 1/2s

and zj = 1√lRTyj . For ∀ε > 0, we have

• E(‖zj‖2) = ‖yj‖2;

• P(|‖zj‖2 − ‖yj‖2| ≥ ε‖yj‖2) < 2e−(ε2/2−ε3/3) l

2 .

Proof: It is easy to verify thatE(‖zj‖2) = ‖yj‖2 (40)

The second part has not finished yet.

12

In the following part, we can use either convetional random projection or sparse random projection,both of which give the same result. We will not distingush two kinds of random projections. For theSVD, i.e. Z =

∑

j λjbjaTj and Y =

∑

j ηjujvTj , the singular values are approximately preserved.

Lemma 3: If l > C log nε2 for a large enough constant C and an arbitrary positive constant ε,

(1 − ε)r∑

j=1

η2j ≤

r∑

j=1

λ2j ≤ (1 + ε)

r∑

j=1

η2j (41)

for ∀r with the probability 1 − 2e−C

4log n .

Proof: Because λ21, . . . , λ

2r are the first r largest eigenvalues of the matrix ZTZ and v1, . . . ,vm are an

orthonormal set of vectors, we haver∑

j=1

λ2j ≥

r∑

j=1

vTj (ZTZ)vj

=r∑

j=1

1

lvT

j YTRRT Yvj

=r∑

j=1

1

lη2

j uTj RRT uj

=r∑

j=1

η2j ‖

1√lRTuj‖2. (42)

According to the properties of Random Projection, we have

r∑

j=1

λ2j ≥

r∑

j=1

η2j (1 − ε)‖uj‖2

= (1 − ε)r∑

j=1

η2j . (43)

We also know that

λ2j = aT

j ZTZaj

=1

laT

j YTRRT Yaj

= ‖ 1√lRT (Yaj)‖2

≤ (1 + ε)‖Yaj‖2. (44)

Because η21 , . . . , η

2r are the first r largest eigenvalues of the matrix YTY and a1, . . . ,am are an orthonormal

set of vectors, we have

r∑

j=1

η2j ≥

r∑

j=1

aTj (YTY)aj

=r∑

j=1

‖Yaj‖2

≥r∑

j=1

1

1 + ελ2

j . (45)

Therefore, we have2∑

j=1

λ2j ≤ (1 + ε)

r∑

j=1

η2j . (46)

13

Next, we want to bound the error of the covariance matrix in order to get a good approximation ofthe anomaly distance. According to the fact that principal components consists of a orthonormal set ofvectors and ‖Y‖2

F =∑m

j=1 η2j , we can easily easily get the following result.

Lemma 4: Let V = YTY and A = ZTZ. If l > C log nε2 for a large enough constant C , then with the probability

1 − 2e−C

4log n, we have

‖V − A‖F ≤√

2ε‖Y‖2F (47)

where ‖X‖F =√∑m

i=1

∑nj=1 x2

ij is the Frobenius norm of a matrix X ∈ Rn×m.

Proof: First, because a1, . . . ,am are an orthonormal set of vectors,

‖V − A‖2F =

m∑

j=1

‖(V − A)aj‖2. (48)

For each j = 1, . . . , r, we have

‖(V − A)aj‖2 = aTj VT Vaj + aT

j ATAaj

−aTj VTAaj − aT

j ATVaj

= ‖Vaj‖2 + λ4j − 2λ2

jaTj Vaj. (49)

Because

λ2j = aT

j Aaj

=1

laT

j YTRRT Yaj

= ‖ 1√lRT (Yaj)‖2. (50)


(1 − ε)‖Yaj‖2 ≤ ‖ 1√lRT (Yaj)||2 ≤ (1 + ε)‖Yaj‖2 (51)

with a high probability. According to ‖Yaj‖2 = aTj Vaj , we have

aTj Vaj ≥

1

1 + ελ2

j . (52)

Because a1, . . . ,am are an orthonormal set of vectors, we have∑m

j=1 ‖Vaj‖ =∑m

j=1 η4j . Therefore,

‖V − A‖2F =

m∑

j=1

(

η4j + λ4

j − 2λ2ja

Tj Vaj

)

≤m∑

j=1

(

η4j + λ4

j −2

1 + ελ4

j

)

. (53)

Second, v1, . . . ,vm are also an orthonormal set of vectors,

‖V − A‖2F =

m∑

j=1

‖(V − A)vj‖2. (54)

For each j = 1, . . . ,m,

‖(V − A)vj‖2 = vTj VTVvj + vT

j ATAvj

−vTj VT Avj − vjAvj

= η4j + ‖Avj‖2 − 2η2

j vTj Avj . (55)

14

We also have

vTj Av = vT

j ZTZvj =1

lvT

i YTRRT vj

= η2j ‖

1√lRTuj‖2. (56)


‖ 1√lRTuj‖2 ≥ (1 − ε)‖uj‖2 = (1 − ε). (57)

Because v1, . . . ,vm are also an orthonormal set of vectors, we have∑m

j=1 ‖Avj‖ =∑m

j=1 λ4j . Therefore,

we have

‖V − A‖2F =

m∑

j=1

(

η4j + λ4

j − 2η2j v

Tj Avj

)

≤m∑

j=1

(

η4j + λ4

j − 2(1 − ε)η4j

)

. (58)

Finally, based on Eq.(53) and Eq.(58), we get

‖V − A‖2F ≤

m∑

j=1

ε

(

η4j +

1

1 + ελ4

j

)

≤ εm∑

j=1

(

λ4j + η4

j

)

. (59)

Based on Lemma 3 and ‖Y‖2F =

∑mj=1 η2

j , we have the following result.

‖V −A‖2F ≤ ε

m∑

j=1

(

λ4j + η4

j

)

≤ ε

m∑

j=1

λ2j

2

+

m∑

j=1

η2j

2

≤ ε((1 + ε)2 + 1)

m∑

j=1

η2j

2

= 2ε(1 + ε + ε2/2)‖Y‖4F

≈ 2ε‖Y‖4F . (60)

Therefore,‖V − A‖F ≤

√2ε‖Y‖2

F

According to Eq. (47), we also get a perturbation error bound for the eigenvalues from the Mirsky’stheorem [26], √

√√√

1

m

m∑

j=1

(λ2j − η2

j )2 ≤ ‖V − A‖F ≤

√2ε‖Y‖2

F . (61)

Based on Lemma 3 and Lemma 4, we know that φk can be approximated by ϕk up to the multiplicativefactor (1 ± ǫk/2). Therefore, the threshold Q can be approximated by δ up to the multiplicative factor(1 ± ǫ1/2).

In the following part, we want to prove that dY (y) can be approximated by dZ(y). The column spaceof a matrix M is the subspace spanned by the columns, which is denoted by R(M) = Mx : x ∈ Rm.The set of all eigenvalues of the matrix M is denoted by L(M) = λ : Mx = λx,∃x 6= 0. And Θ denotes

15

the canonical angle between two subspaces, Θ(M,N ) = sin−1 Σ, where M and N are r-dimensionalsubspaces of Rm. The columns of their orthogonal bases can be transformed by a unitary matrix to

I

0

0

and

Γ

Σ

0

if 2r ≤ m, or

I 0

0 I

0 0

and

Γ 0

0 I

Σ 0

if 2r > m. (62)

The matrix permutation theorem can be written as following [27].Lemma 5: Matrix Perturbation: Let M have the spectral resolution

(

UT1

UT2

)

M (U1 U2) =

(

L1 0

0 L2

)

, (63)

where (U1,U2) is unitary with U1 ∈ Rn×r. Let B ∈ Rn×r have orthonormal columns, and for anysymmetric H of order r, let E = MB− BH. If ν = min |L(L2) − L(H)| > 0, then we have

‖ sin Θ[R(U1),R(B)]‖F ≤ ‖E‖F

ν. (64)

We apply the matrix permutation theorem to the covariance matrices A and V, and get an error boundfor the anomaly distance.

Theorem 2: If l > C log nε2 for a large enough constant C , then

|dZ(y) − dY (y)| ≤ 2√

ε

|η2r+1 − η2

r |‖Y‖2

F ‖y‖ (65)

with the probability 1 − 2e−C

4log n.

Proof: Let Q = [a1, · · · ,ar ], Qc = [ar+1, · · · ,am ], P = [v1, · · · ,vr ], and Pc = [vr+1, · · · ,vm ]. Wehave the following spectral resolutions,

(

QT

QTc

)

A (QQc) =

(

Λ1 0

0 Λ2

)

, (66)

(

PT

PTc

)

V (PPc) =

(

M1 0

0 M2

)

, (67)

where Λ1 = diag(λ21, . . . , λ

2r), Λ2 = diag(λ2

r+1, . . . , λ2m), M1 = diag(η2

1 , . . . , η2r ), and M2 = diag(η2

r+1, . . . , η2m).

Here diag(·) denotes a diagonal matrix. Let E = VQ − QΛ1. Because QΛ1 = AQ, we have

E = VQ − AQ. (68)

According to Lemma 5, we have

‖ sin Θ[R(P),R(Q)]‖F ≤ ‖E‖F

ν=

‖V −A‖F

ν(69)

where ν = |η2r+1−λ2

r| ≈ |η2r+1−η2

r |. The project matrices of R(P) and R(Q) are PPT and QQT , respectively.Then, according to Ref. [27], we have

‖PPT − QQT‖F =√

2‖ sin Θ[R(P),R(Q)]‖F

≤√

2‖V − A‖F

ν. (70)

16

Then we get

|dZ(y) − dY (y)| = |‖(I −QQT )y‖ − ‖(I − PPT )y‖|≤ ‖(I − QQT )y − (I − PPT )y‖= ‖(PPT − QQT )y‖≤ ‖PPT − QQT‖F ‖y‖

≤√

2‖V − A‖F

ν‖y‖

≤ 2√

ε

|η2r+1 − η2

r |‖Y‖2

F ‖y‖. (71)

Therefore, the anomaly distance can be approximated up to the multiplicative factor (1 ± ε1/2).

4 ADVANCED SKETCH METHOD (ASM)

Pa

cke

t Stre

am

Local Monitor Network Operation Center

Variance

Histograms

PCA

Computation

Alarm?

Ag

gre

ga

tion

Volume

Counter

Traffic

Volume

Sketch

Computing

Anomaly

Distance

Computing

Threshold

d > ?

Fig. 6. System model of advanced sketch method

If the length of the sliding window is so long that a local monitor can not hold all traffic measurementsin the memory, we need to find an alternative method to compute the sketch online. In the ASM, weutilize a variance estimation algorithm to maintain an approximation of the sketch in order to reduce thecomputation complexity and the space requirement at a local monitor. The architecture of the advancedsketch method is shown in Fig.6. We use Variance Histograms (VH) to maintain an approximation ofthe sketch for each flow, which is a modification of a variance estimation algorithm [28]. In this ASM,the Volume Counter module maintains only a bucket for each traffic flow at the current time interval.The Sketch Computation module is replaced by Variance Histograms. All other modules are the sameas the SSM. We first introduce the Variance Histograms and then give the sketch computation algorithmat each local monitor. At last, we use the same example as the SSM to show the detection procedure inthe ASM.

4.1 Algorithm

4.1.1 Variance Histograms

A Variance Histogram (VH) contains a list of buckets for each traffic flow, which are maintained bythe variance estimation algorithm in Fig. 7. The traffic volume xij at each time interval is treatedas a data element for the variance computation in this section. Given a sequence of data elementsx(t−n+1)j , . . . , xtj, the variance is defined as

Vtj =t∑

i=t−n+1

(xij − xtj)2 (72)

where xtj = 1n

∑ti=t−n+1 xij is the mean of data elements. A bucket Bpj contains the following statistics

information for a subsequence of traffic volumes xij .

• τpj : time stamp;• npj : total number of data elements in the subsequence;

17

Step1: Check the time stamp of the last bucket BNj

if τNj ≤ t − ndelete BNj ;

endifStep2: Create a new bucket B1j

τ1j = t; n1j = 1; µ1j = xtj ; V1j=0;for k = 1, . . . , l

Z1kj = xtjrtk; R1kj = rtk;endfor

Step3: Traverse the bucket list to merge bucketsp = 1; BB = B1j ;while B(p+2)j exists.

BA = B(p+1)j ∪ B(p+2)j ;if nA + nB > n/2

return

endifif nA ≤ ε

10nB and VA∪B − VB ≤ ε5VB

delete B(p+2)j ; B(p+1)j = BA;else

p = p + 1; BB = BB ∪ Bpj ;endif

endwhile

Fig. 7. Procedures for updating VH

• µpj : mean of data elements in the subsequence;• Vpj : variance of data elements in the subsequence;• Zpkj : sum of xijrik for all xij in the subsequence;• Rpkj : sum of the corresponding rik.

The algorithm starts with an empty list of buckets and updates the list of buckets with three steps asshown in Fig.7. First, when a new data element xtj comes, the current time stamp is updated to t. Wecheck the oldest bucket BNj and delete it if it is expired, where N denotes the number of buckets inthe list. Second, the new element constitutes a new bucket B1j and each old bucket Bpj becomes B(p+1)j

for p = 1, . . . , N . Last, we check whether there are qualified pairs of buckets that can be merged. LetBA = B(p+1)j ∪ B(p+2)j and BB = ∪p

q=1Bqj . We merge two adjacent buckets B(p+1)j and B(p+2)j if andonly if they satisfy the following merging rules.

• Rule 1: VA∪B − VB ≤ ε5VB .

• Rule 2: nA ≤ ε10nB .

• Rule 2: nA + nB ≤ n/2.

When two adjacent buckets Bpj and Bqj merge into a new bucket B(p∪q)j , the merged bucket’s timestamp is set to be the time stamp of the older one, and the merged bucket’s statistics information can becalculated as follow,

n(p∪q)j = npj + nqj (73)

µ(p∪q)j =npjµpj + nqjµqj

npj + nqj(74)

V(p∪q)j = Vpj + Vqj +npjnqj

npj + nqj(µpj − µqj)

2 (75)

Z(p∪q)kj = Zpkj + Zqkj (76)

R(p∪q)kj = Rpkj + Rqkj (77)

18

Header Header Header

Aggregation

(FlowID, Size)

Packet stream

(SrcIP, DstIP, Size,…)

rtk rtlrt1xtj

Uj

FlowID = j

Uj

xtjxtjxtjt 1 0

Z1kj Z1ljZ11jτ1j n1j µ1j V1j

rt1xtj rtkxtj rtlxtj

B1j B2j BNjBpj B(p+1)j B(p+2)j

R11j R1kj

BB BABAUB

R1lj

Fig. 8. Sketch computation with VH

Let Ball,j = ∪Np=1Bpj denote the bucket by merging all buckets together, and V = Vall,j be the estimated

variance. We get the following result [28].Lemma 6: Variance Histogram maintains a ε-approximate variance,

(1 − ε)V ≤ V ≤ V, (78)

with O(1ε log n) space and O(1) running time.

4.1.2 Sketch Computation Algorithm

At a local monitor, we implement a VH for each traffic flow and n pseudo random number generatorsshared by all traffic flows among local monitors. The architecture for the sketch computation at a localmonitor is shown in Fig. 8. The volume counter only uses a bucket to maintain the traffic volume at thecurrent time interval t for each traffic flow. When a time interval ends, the volume counter reports thetraffic volume xtj to the Variance Histogram V Hj . The V Hj updates its buckets as shown in Fig. 7. Ateach time interval, we can compute an approximation of the sketch as,

zkj =1√l(Zall,kj − nall,jµall,jRall,kj). (79)

where nall,j , µall,j , Zall,kj , and Rall,kj are the elements in Ball,j = ∪Np=1Bpj .

4.2 Detection Example

NOC collects the sketches zkj from local monitors, and follows the same procedure in the SSM toidentify traffic anomalies. Here, we use the same example in Section 3.2. The sketch of the measurementyATLA−CHIC in Eq.(18) is

zATLA−CHIC = (1.00 × 108, . . . ,−2.04 × 108) (80)

First, NOC organize zOrigin−Destanition into a matrix Z. Second, NOC uses the same method as the SSMto compute principle components and eigenvalues. Last, the anomaly distance and the threshold arecomputed according to Eq.(11) and Eq.(14), respectively.

δ = 1.82 × 108, dZ(y5857) = 2.11 × 108. (81)

Therefore, we get dZ(yi) > δ and thus identify i = 5857 as a time interval containing traffic anomalies,which gets the same result as the SSM.

19

xtjx(t-1)jx(t-n+1)j x(t- j+1)j

B1jB2jBNj

n

j

B(N+1)j

x(t-n)j

Fig. 9. An illustration of sketch approximation

4.3 Computation Complexity

ASM follows the same procedure as SSM to detect traffic anomalies at the NOC, which has the samecomputation complexity and the space requirement as SSM. Because the variance estimation algorithmonly needs O(1

ε log n) space and O(1) running time, we have the following theorem.Theorem 3: ASM requires O(w log n) running time and (w log2 n) space at the local monitor.

Proof: According to the variance estimation algorithm [28], VH only needs O(1) running time toupdate the variance buckets. But we also need to update the sketches Zpkj and the random numbersRpkj . Therefore, ASM needs O(l) running time to update the variance histograms. A bucket needs O(l)space to storage the statistics information and we have at most O(log n) buckets for each traffic flow. Ingeneral, the local monitor need O(w log2 n) space and O(w log n) running time, because l = O(log n).

4.4 Error Bound Analysis

The approximated sketch zkj is a sketch of a subsequence of the traffic volumes within the sliding windowas shown in Fig.9. We organize zkj into a l × m matrix Z. Then we have the following result.

Lemma 7: Let A = ZT Z. If l > C log nε2 for a large enough constant C , then

‖A −V‖F ≤ 2√

ε‖Y‖2F (82)


4log n.

Proof: The variance estimation algorithm maintain the variance V of a subsequence of the dataelements in the sliding window of the size n, which have the property,

(1 − ε)V < V < V (83)

according to Ref. [28]. Let

yj = (

n︷︸︸︷

0, . . . , 0, y(t−∆j+1)j , . . . , ytj︸︷︷︸

∆j

)T . (84)

Because V =∑t

i=t−n+1(xij − xtj)2 = ‖yj‖2 and V =

∑ti=t−∆j+1(xij − xtj)

2 = ‖yj‖2, then we have

‖yj − yj‖2 = |V − V | < εV = ε‖yj‖2. (85)

According to Eq.(79), we have

zkj =1√l

t∑

i=t−∆j+1

(xij − xtj)rik

=1√lrkyj. (86)

where rk = (r(t−n+1)k, . . . , rtk). Therefore,

zj − zj =1√lR(yj − y). (87)

where zj and zj are the j-th column in Z and Z, respectively. We apply the properties of the RandomProjection on the vector zj − zj ,

‖zj − z‖2 ≤ (1 + ε)‖yj − y‖2. (88)

20

Based on Eq.(85) and Eq.(88), we have

‖zj − zj‖2 ≤ ε‖yj‖2. (89)

‖Z − Z‖2F < ε‖Y‖2

F . (90)

Because A = ZTZ, we have

‖A − A‖2F = ‖ZT Z − ZTZ‖2

F

= ‖ZT Z − ZTZ + ZTZ − ZTZ‖2F

≤ ‖ZT Z − ZTZ‖2F + ‖ZT Z − ZTZ‖2

F

≤ 2ε‖Y‖4F . (91)

According to Lemma 4, we have

‖A− V‖2F = ‖A − A + A− V‖2

F

≤ ‖A − A‖2F + ‖A − V‖2

F

≤ 4ε‖Y‖4F (92)

Using Lemma 7, we can bound the threshold δ in ASM. The anomaly distance dZ(yi) can be boundedby similar method in Theorem 2.

Theorem 4: If l > C log nε2 for a large enough constant C , then

|dZ(y) − dY (y)| ≤ 2√

2ε

|η2r+1 − η2

r |‖Y‖2

F ‖y‖ (93)


4log n.

Proof: Let Z =∑

j λjbjaTj , and we have Q = [ a1, · · · , ar ], Qc = [ ar+1, · · · , am ]. The spectral resolution

of the matrix A can be written as,(

QT

QTc

)

A(

Q Qc

)

=

(

Λ1 0

0 Λ2

)

, (94)

where Λ1 = diag(λ21, . . . , λ

2r) and Λ2 = diag(λ2

r+1, . . . , λ2m).

Let E = VQ − QΛ1 where V = YTY as before. Because QΛ1 = AQ, we have

E = VQ − AQ. (95)

According to Lemma 5 (Matrix Permutation Theorem), we have

‖ sin Θ[R(P),R(Q)]‖F ≤ ‖E‖F

ν=

‖V − A‖F

ν(96)

where ν = |η2r+1−λ2

r| ≈ |η2r+1−η2

r |. The project matrices of R(P) and R(Q) are PPT and QQT , respectively.Then, according to Ref. [27], we have

‖PPT − QQT ‖F =√

2‖ sin Θ[R(P),R(Q)]‖F

≤√

2‖V − A‖F

ν. (97)

21

Then we get

|dZ(y) − dY (y)| = |‖(I − QQT )y‖ − ‖(I − PPT )y‖|≤ ‖(I − QQT )y − (I − PPT )y‖= ‖(PPT − QQT )y‖≤ ‖PPT − QQT‖F ‖y‖

≤√

2‖V − A‖F

ν‖y‖

≤ 2√

2ε

|η2r+1 − η2

r |‖Y‖2

F ‖y‖. (98)

5 DISCUSSIONS

We can bound the error of the estimated threshold and the anomaly distance in terms of ηj and Y.Therefore, our algorithms are an approximation of Lakhina’s method, and the accuracy of our algorithmsdepends on the properties of the covariance matrix V of the traffic measurements. In fact, the PCA-baseddetection method can have a high false alarm rate when all the eigenvalues of V are close to each other.Fortunately, Lakhina observed that the above situations were rare in the traffic volume measurements[6]. There are only a few eigenvalues which are much larger than 0, and the other are close to 0. In thissituation, the values of the threshold and the anomaly distance only have a small fluctuation in bothSSM and ASM. Therefore, our algorithms can detect the traffic anomalies with a low false negative rate.

Random projection method has be proposed for privacy preserving distributed data mining [29]. IfNOC doesn’t know the sequence of the random number rik for the sketch computation, there is no wayto guess the traffic volume series. Although NOC may know rik, NOC doesn’t know the length dueto the VH algorithm. Also, local monitors can introduce an additional permutation ǫ′ into the randomnumbers, e.g. r′ik = rik + ǫ′wik where wik is an independent random number from the standard normaldistribution. Then NOC needs to solve a set of linear equations,

z′j = Ryj (99)

where z′j = R′yj and R′ = R + ǫ′W (W is a matrix with entries wik). The local monitor can specify theparameter ǫ′ in order to increase the estimation error of the traffic volume yij .

For the communication cost, supposing the NOC set the period of updating traffic measurements asT , Lakhina’s method needs to send a vector of measurements of the length T , and our algorithm sendsa vector of the length l. Because l is regardless of T , our algorithm can reduce the communication costby l/T if T > l. If the NOC has enough bandwidth, the sketch computation can also be done at the NOCside. Then our algorithm can use the same communication cost as Lakhina’s method.

6 EXPERIMENTAL EVALUATION

In the evaluation, we want to determine the size of the sketches which is enough to get a good approx-imation of Lakhina’s method, because a smaller size of the sketches means that our algorithms requireless computation resource. Our algorithm is very useful when the time length of the traffic measurementsis so long that the local monitors don’t have enough space to save them. Thus we only implement theASM in the following evaluation.

Abilene Observatory Data Collections [15] are used as the data set to evaluate the performance ofour algorithm. Abilene is the Internet2 backbone network, which spans the continental USA. We use thedata collected by Juniper’s J-Flow tool between 06/09/2008 and 06/29/2008. We use both BGP and ISISrouting information to aggregate packets into OD flows.

When a packet arrives, we first update a temporary list which saves the total traffic volume of eachOD flow in every five-minutes interval. In this way, we can construct the traffic volume time series.We first apply Lakhina’s method to detect anomalies, and then use these detected anomalies as the real

22

anomalies to evaluate the detection accuracy of our method. We compute both Type I errors and TypeII errors with different size of the sketches.

Type I =number of false anomalies

total number of true normal observations, Type II =

number of false normal observations

total number of true anomalies(100)

0 200 400 600 800 10000

0.2

0.4

0.6

0.8

1

Length of the sketch

Err

ors

Type I error, n = 10 days

Type II error, n = 10 days

Type I error, n = 14 days

Type II error, n = 14 days

Fig. 10. Detection Errors in Abilene Data

We check the eigenvalues of the measurement matrix Y, and choose the size of the normal subspaceas r = 6 which is proper for our data. We check each observation just after the sliding window, andshow Type I errors and Type II errors of the last week in Fig.10. We find that both type I errors andtype II errors decrease quickly at the beginning and then reach a nearly optimal value. If the size l ofthe sketch is more than 100, there is no remarkable decrease in the mean of errors and only the varianceof the errors becomes smaller.

Because the number of true anomalies is much less than the number of true normal observations, theoptimal type II error is higher than the optimal type I error. Because there is always some randomnessin the data and the length of the sketch should be less than the length of the sliding window, we cannotreduce type I and type II errors to zero in practice.

7 CONCLUSION AND FUTURE WORK

In this paper, we study the network-wide traffic anomaly detection problem. Our algorithm archivesO(w log n) running time and O(w log2 n) space at local monitors. The NOC could run PCA-based detectionmethod with O(m2 log n) running time and O(m log n) space. Our algorithm also make the ISPs be ableto implement the detection method by paying careful consideration about the privacy protection, thecommunication cost, and other resources over a distributed computing environment. In the future, wewill study the detection of traffic anomalies by using various traffic features like the communicationpatterns.

ACKNOWLEDGMENT

This project has benefited from the use of measurement data collected on the Internet2 network as partof the Internet2 Observatory Project.

REFERENCES

[1] K. Park and W. Willinger, Self-Similar Network Traffic and Performance Evaluation. New York: John Wiley & Sons, Inc., 2000.[2] A. Lakhina, M. Crovella, and C. Diot, “Diagnosing network-wide traffic anomalies,” SIGCOMM Comput. Commun. Rev.,

vol. 34, no. 4, pp. 219–230, 2004.[3] H. Ringberg, A. Soule, J. Rexford, and C. Diot, “Sensitivity of pca for traffic anomaly detection,” SIGMETRICS ’07, pp.

109–120, 2007.

23

[4] Y. Huang, N. Feamster, A. Lakhina, and J. J. Xu, “Diagnosing network disruptions with network-wide analysis,”SIGMETRICS ’07, pp. 61–72, 2007.

[5] A. Lakhina, M. Crovella, and C. Diot, “Mining anomalies using traffic feature distributions,” SIGCOMM ’05, pp. 217–228,2005.

[6] A. Lakhina, K. Papagiannaki, M. Crovella, C. Diot, E. D. Kolaczyk, and N. Taft, “Structural analysis of network trafficflows,” SIGMETRICS ’04/Performance ’04, pp. 61–72, 2004.

[7] X. Li, F. Bian, M. Crovella, C. Diot, R. Govindan, G. Iannaccone, and A. Lakhina, “Detection and identification of networkanomalies using sketch subspaces,” IMC ’06, pp. 147–152, 2006.

[8] P. Barford, J. Kline, D. Plonka, and A. Ron, “A signal analysis of network traffic anomalies,” IMW ’02, pp. 71–82, 2002.[9] L. Huang, X. L. Nguyen, M. Garofalakis, J. Hellerstein, M. Jordan, A. Joseph, and N. Taft, “Communication-efficient online

detection of network-wide anomalies,” INFOCOM ’07, pp. 134–142, 2007.[10] Y. Zhang, Z. Ge, A. Greenberg, and M. Roughan, “Network anomography,” IMC ’05, pp. 30–30, 2005.[11] X. Li, F. Bian, H. Zhang, C. Diot, R. Govindan, W. Hong, and G. Iannaccone, “Mind: A distributed multi-dimensional

indexing system for network diagnosis,” INFORCOM ’06, pp. 1–12, 2006.[12] T. Ahmed, M. Coates, and A. Lakhina, “Multivariate online anomaly detection using kernel recursive least squares,”

INFOCOM ’07, pp. 625–633, 2007.[13] P. Chhabra, C. Scott, E. Kolaczyk, and M. Crovella, “Distributed spatial anomaly detection,” INFOCOM ’08, pp. 1705–1713,

2008.[14] J. Kline, S. Nam, P. Barford, D. Plonka, and A. Ron, “Traffic anomaly detection at fine time scales with bayes nets,” ICIMP

’08, pp. 37–46, 2008.[15] “Abilene observatory data collections,” www.internet2.edu/observatory/.[16] “Caida,” www.caida.org.[17] W. LeFebvre, “Cnn.com: Facing a world crisis,” http://www.tcsa.org/lisa2001/cnn.txt, 2001.[18] P. Barford and V. Yegneswaran, “An inside look at botnets,” Advances in Information Security, vol. 27, pp. 171–191, 2007.[19] D. Achlioptas, “Database-friendly random projections: Johnson-lindenstrauss with binary coins,” Journal of computer and

System Sciences, vol. 66, no. 4, pp. 671–687, 2003.[20] S. S. Vempala, The Random Projection Method. Rhode Island: American Mathematical Society, 2004.[21] P. Li, T. J. Hastie, and K. W. Church, “Very sparse random projections,” KDD’ 06, pp. 287–296, 2006.[22] B. Krishnamurthy, S. Sen, Y. Zhang, and Y. Chen, “Sketch-based change detection: methods, evaluation, and applications,”

IMC ’03, pp. 234–247, 2003.[23] C. C. Aggarwal, Data Streams: Models and Algorithms. New York: Springer, 2007.[24] N. Alon, P. B. Gibbons, Y. Matias, and M. Szegedy, “Tracking join and self-join sizes in limited storage,” PODS ’99, pp.

10–20, 1999.[25] J. E. Jackson and G. S. Mudholkar, “Control procedures for residuals associated with principal component analysis,”

Thechnometrics, pp. 341–349, 1979.[26] R. Bhatia, Perturbation Bounds for Matrix Eigenvalues. Philadelphia: SIAM, 2007.[27] G. Stewart and J. guang Sun, Matrix perturbation theory. Boston: Academic Press, 1990.[28] L. Zhang and Y. Guan, “Variance estimation over sliding windows,” PODS ’07, pp. 225–232, 2007.[29] K. Liu and J. Ryan, “Random projection-based multiplicative data perturbation for privacy preserving distributed data

mining,” IEEE Trans. on Knowl. and Data Eng., vol. 18, no. 1, pp. 92–106, 2006.

sketch-based network-wide trafï¬c anomaly detection

Documents