temporal structure learning for clustering massive data ... · zsouthern methodist university,...

12
Temporal Structure Learning for Clustering Massive Data Streams in Real-Time * Michael Hahsler Margaret H. Dunham Abstract This paper describes one of the first attempts to model the temporal structure of massive data streams in real-time us- ing data stream clustering. Recently, many data stream clus- tering algorithms have been developed which efficiently find a partition of the data points in a data stream. However, these algorithms disregard the information represented by the temporal order of the data points in the stream which for many applications is an important part of the data stream. In this paper we propose a new framework called Temporal Relationships Among Clusters for Data Streams (TRACDS) which allows us to learn the temporal structure while clus- tering a data stream. We identify, organize and describe the clustering operations which are used by state-of-the-art data stream clustering algorithms. Then we show that by defin- ing a set of new operations to transform Markov Chains with states representing clusters dynamically, we can efficiently capture temporal ordering information. This framework al- lows us to preserve temporal relationships among clusters for any state-of-the-art data stream clustering algorithm with only minimal overhead. To investigate the usefulness of TRACDS, we evaluate the improvement of TRACDS over pure data stream cluster- ing for anomaly detection using several synthetic and real- world data sets. The experiments show that TRACDS is able to considerably improve the results even if we intro- duce a high rate of incorrect time stamps which is typical for real-world data streams. Keywords: Data stream, clustering, temporal structure, Markov chain 1 Problem Specification Algorithms for clustering data streams [4, 5, 8, 13, 23, 24, 26–28] have focused on many characteristics of stream data (e.g., limited storage but potentially unbounded size of data stream, single pass over the data, near real- time processing, concept drift), but the fact that data arrives in a temporal order, which is perhaps one of the most important aspects of the data stream, is typically * To appear in the SIAM Conference on Data Mining 2011 (SDM11) Southern Methodist University, [email protected] Southern Methodist University, [email protected] 6 5 3 stream clustering 6 5 3 TRACDS (a) (b) 4/5 1/3 5/5 1/5 2/3 Figure 1: Stream Clustering: (a) Partitioning of a data stream using standard (data stream) clustering neglects the temporal aspect of the data. (b) With TRACDS temporal relationships between clusters are learned dy- namically as an evolving Markov chain (transitions be- tween clusters are represented by arcs). not directly used. Figure 1(a) illustrates what happens during data stream clustering. The data stream is represented by the ordered sequence of shapes in the upper half of the figure. Different shapes represent events that are similar enough to be put in the same cluster. In the left lower half, the larger symbols represent the clusters annotated with the number of events assigned to each cluster. Since the data volume produced by a data streams is typically unbounded, it is infeasible to store each event assigned to a cluster. Rather, cluster- wide summaries called cluster features or synopses that include descriptive statistics for a cluster (mean, variance, etc.) are stored. During this summarization, any timestamp available for events is either lost or treated as if it were any other attribute. For example, when in the stream in Figure 1 a Hexagon event occurs, the next event is a Circle event 4/5 = 80% of the time (ignoring last event); a Circle is followed by a Hexagon 100% of the time; and a Triangle always follows a Triangle. Although the temporal order of events is one of the salient concepts of stream data, this ordering relationship of the clusters is lost completely after clustering. One of the major applications using data stream

Upload: others

Post on 03-Apr-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Temporal Structure Learning for Clustering Massive Data ... · zSouthern Methodist University, mhd@lyle.smu.edu Figure 1: Stream Clustering: (a) Partitioning of a data stream using

Temporal Structure Learning for Clustering Massive Data Streams in

Real-Time∗

Michael Hahsler † Margaret H. Dunham ‡

Abstract

This paper describes one of the first attempts to model thetemporal structure of massive data streams in real-time us-ing data stream clustering. Recently, many data stream clus-tering algorithms have been developed which efficiently finda partition of the data points in a data stream. However,these algorithms disregard the information represented bythe temporal order of the data points in the stream which formany applications is an important part of the data stream.In this paper we propose a new framework called TemporalRelationships Among Clusters for Data Streams (TRACDS)which allows us to learn the temporal structure while clus-tering a data stream. We identify, organize and describe theclustering operations which are used by state-of-the-art datastream clustering algorithms. Then we show that by defin-ing a set of new operations to transform Markov Chains withstates representing clusters dynamically, we can efficientlycapture temporal ordering information. This framework al-lows us to preserve temporal relationships among clusters forany state-of-the-art data stream clustering algorithm withonly minimal overhead.

To investigate the usefulness of TRACDS, we evaluatethe improvement of TRACDS over pure data stream cluster-ing for anomaly detection using several synthetic and real-world data sets. The experiments show that TRACDS isable to considerably improve the results even if we intro-duce a high rate of incorrect time stamps which is typicalfor real-world data streams.

Keywords: Data stream, clustering, temporal structure,

Markov chain

1 Problem Specification

Algorithms for clustering data streams [4,5,8,13,23,24,26–28] have focused on many characteristics of streamdata (e.g., limited storage but potentially unboundedsize of data stream, single pass over the data, near real-time processing, concept drift), but the fact that dataarrives in a temporal order, which is perhaps one of themost important aspects of the data stream, is typically

∗To appear in the SIAM Conference on Data Mining 2011

(SDM11)†Southern Methodist University, [email protected]‡Southern Methodist University, [email protected]

6 53

stream clustering

6

53TRACDS

(a) (b)

4/51/3

5/5

1/5

2/3

Figure 1: Stream Clustering: (a) Partitioning of a datastream using standard (data stream) clustering neglectsthe temporal aspect of the data. (b) With TRACDStemporal relationships between clusters are learned dy-namically as an evolving Markov chain (transitions be-tween clusters are represented by arcs).

not directly used.Figure 1(a) illustrates what happens during data

stream clustering. The data stream is represented bythe ordered sequence of shapes in the upper half ofthe figure. Different shapes represent events that aresimilar enough to be put in the same cluster. Inthe left lower half, the larger symbols represent theclusters annotated with the number of events assignedto each cluster. Since the data volume produced by adata streams is typically unbounded, it is infeasible tostore each event assigned to a cluster. Rather, cluster-wide summaries called cluster features or synopsesthat include descriptive statistics for a cluster (mean,variance, etc.) are stored. During this summarization,any timestamp available for events is either lost ortreated as if it were any other attribute. For example,when in the stream in Figure 1 a Hexagon event occurs,the next event is a Circle event 4/5 = 80% of thetime (ignoring last event); a Circle is followed by aHexagon 100% of the time; and a Triangle alwaysfollows a Triangle. Although the temporal order ofevents is one of the salient concepts of stream data, thisordering relationship of the clusters is lost completelyafter clustering.

One of the major applications using data stream

Page 2: Temporal Structure Learning for Clustering Massive Data ... · zSouthern Methodist University, mhd@lyle.smu.edu Figure 1: Stream Clustering: (a) Partitioning of a data stream using

clustering is rare event (anomaly) detection. Whenduring cluster formation temporal order information isdiscarded, we might sacrifice important indicators todetect/predict rare event (e.g., intrusion in a computernetwork, flooding, heat waves, hurricanes, or eruptionsof volcanoes based on climate data). For example,for intrusion detection in a large computer network wemay observe a user change from behavior type A tobehavior type B, both represented by clusters labeledas non-suspicious behavior. However, the transitionfrom A to B might be extremely unusual and itselfindicate an intrusion event. This can only be detectedif the temporal structure of the data is preserved in theclustering.

We argue that temporal and ordering aspectsshould be considered as an integral part when perform-ing clustering events in data streams for many applica-tions. However, we are not aware of research that com-bines clustering with preserving the temporal structurein the data stream that meets the requirements for datastream processing (single-pass over the data, only storesynopses for clusters, etc.).

In this paper we present the TRACDS frameworkwhich provides a transparent way to add temporal or-dering information in the form of a dynamically chang-ing Markov Chain to data stream clustering algorithms(see Figure 1(b)). The framework generalizes the ideasby Dunham et al developed for the Extensible MarkovModel (EMM) [11] for data stream clustering. In thispaper we identify a set of clustering operations which issufficient to describe state-of-the-art data stream clus-tering algorithms and we develop a complete set ofTRACDS operations to efficiently manipulate MarkovChains to learn temporal information for clustering.This clean separation between clustering and TRACDSoperations makes the framework directly applicable toany state-of-the-art data stream clustering algorithm.We will show that the framework can be implemented toadd only minimal overhead to the data stream cluster-ing algorithm and thus it is suitable to handle massivedata streams.

While data stream clustering and Markov chainsare well know techniques, this paper provides two novelcontributions.

1. This paper is one of the first to attempt to modelthe temporal structure of massive data streams inreal-time.

2. We combine data stream clustering and Markovchains which are dynamically changed by a set ofoperations in a new and efficient way.

This paper is organized as follows. We start withrelated work in Section 2. In Section 3 we formally

introduce the TRACDS framework. In Section 4 weuse several synthetic and real-world data sets to ana-lyze the improvement of TRACDS over pure clusteringfor anomaly detection. Finally, Section 5 discusses di-rections for future work.

2 Related Work

2.1 Data Stream Clustering. Clustering, thegrouping of similar objects, has been around since thebeginning of human existence. It was not until the1800s, however, that formal algorithms were developed.Excellent surveys of clustering techniques have beenpublished that summarize these developments (e.g.,[14]). Traditional clustering is often viewed as a par-titioning or segmentation of a static data set where theorder of observation has no relevance. Dynamic cluster-ing techniques were developed to incrementally clusterdata and take into account the temporal nature of databy specifically looking at how the clusters change asdata arrives [15].

With the advent of the streaming data concept,many clustering researchers began to adapt clusteringtechniques to streaming data. The salient features ofdata streams are that data continues to arrive and thatit is impractical to keep all of the data. Data streammanagement techniques look at various strategies (suchas time windows and snapshots) to handle unboundedstreams. Clustering techniques must be incrementaland usually consider some sort of dynamic updatingof the clusters. Barbara [6] identified the requirementsfor clustering of data streams to include: compactness;“fast, incremental processing” and identification of out-liers.

An early algorithm called STREAM [13] partitionsthe data stream into segments, clusters each segment in-dividually by solving the k-medians problem and theniteratively reclusters the resulting centers to obtain a fi-nal clustering. Since data streams can potentially growunbounded, data stream clustering algorithms do notstore all data points but rather use a summary or syn-opsis consisting of aggregate statistics to store informa-tion about each cluster. The idea was introduced inthe non-data stream clustering algorithm BIRCH [30]where the summaries were called cluster feature vectors.The data stream clustering algorithm CluStream [4] in-troduces micro-clusters represented by summary statis-tics. Micro-clusters are handled by the algorithm online.However, to create a final clustering, micro-clusters haveto be merged based on their summary statistics result-ing in a simpler clustering. As this reclustering is per-formed offline, it can be done with any regular clusteringalgorithm.

An important feature of data streams is that their

Page 3: Temporal Structure Learning for Clustering Massive Data ... · zSouthern Methodist University, mhd@lyle.smu.edu Figure 1: Stream Clustering: (a) Partitioning of a data stream using

structure may change over time. Most of the followingclustering algorithms handle change by using an expo-nential fading model to reduce the weight of older datapoints. An exponential fading is used since it can easilybe applied directly to most summary statistics. Den-Stream [8] maintains micro-clusters in real time and usesa variant of the density-based GDBSCAN [21] to pro-duce a final clustering for users. HPStream [5] findsclusters that are well defined in different subsets of thedimensions of the data. WSTREAM [23] uses a kerneldensity estimation to find rectangular windows to repre-sent clusters. The windows can move, contract, expandand be merged over time. E-STREAM [27] adds clustersplitting by maintaining histograms for each cluster anddimension. While the popular density-based clusteringmethod OPTICS [18] is not suitable for data streams,a variant called OpticsStream [24] can be used to visu-alize the clustering structure and structure changes indata streams. Recently, the density-based data streamclustering algorithms D-Stream [26] and MR-Stream [28]were developed. D-Stream uses an online component tomap each data point into a predefined grid and thenuses an offline component to cluster the grid based ondensity. MR-Stream facilitates the discovery of clustersat multiple resolutions by using a grid of cells that candynamically be sub-divided into more cells using a treedata structure. Recently, data stream clustering algo-rithms for massive data [2] and uncertain data [3] havealso been introduced.

All these approaches center on finding clusters ofdata points given some notion of similarity but neglectthe temporal ordering structure of the data which mightbe crucial to understanding the underlying data.

2.2 Incorporating Temporal Order. There havebeen several research efforts to incorporate temporalorder information into clustering. We briefly reviewsome of these concepts prior to introducing what wepropose for TRACDS. The term temporal clustering isgenerally used to mean applying clustering to time seriesdata [29]. Typically either several time series are clus-tered to find sets of similar series or subsequences areclustered to find similar parts in time series [16]. Sincewe are interested in clustering individual data pointswhile preserving the temporal order, neither of theseapproaches is applicable. The temporal structure be-tween data points in time series is typically modeledby auto-regressive models (e.g., ARIMA) which is moredifficult for multivariate data [25] and typically does nothonor the restrictions for data streams (e.g., single passover the data). We are not aware of research whichcombines auto-regressive models with clustering mas-sive multivariate data streams while preserving tempo-

ral structure efficiently.Evolutionary clustering [9] considers the problem of

clustering data over time with the goal to trade off thetwo potentially conflicting criteria of representing thecurrent data as faithfully as possible by the clusteringwhile preventing dramatic shifts in the clustering fromone timestep to the next. Similarly, the MONICframework [22] uses the term cluster transitions to referto the fact that a cluster may change over time (e.g.,change its density, move or merge with another cluster).While these approaches deal with the fact that there isa need to detect and evaluate these changes, they stilllargely ignore the order information inherent in the datastream.

C-TREND [1] captures the temporal ordering con-cept among clusters with transitions between clustersobtained based on predefined temporal partitions. Ittries to tackle a problem similar to what we address inthis paper, however it suffers from many restrictions.Especially, C-TREND is not suited for data streamsas a fixed number of partitions must be created beforeclassical clustering algorithms which need all data areused. Also only transitions between partitions in timeare identified and all transitions between clusters withineach partition are ignored.

In the following we introduce TRACDS, a frame-work which is able to efficiently capture temporal orderinformation for data stream clustering.

3 Introduction to TRACDS

Although TRACDS does not implement data streamclustering itself, we have to introduce the conceptsbefore moving on to our framework. Clustering istypically thought of in terms of partitioning a dataset consisting of observations into several (typically k,a predefined number) groups of similar observationswhere observations in different groups are less similar.Formally a clustering can be defined as:

Definition 3.1. (Clustering) A clustering ζ is apartitioning of a data set D into k subsets C1, C2, . . . , Ckcalled clusters such that

(1) Ci ∩ Cj = ∅ for all i 6= j,

(2)⋃ki=1 Ci ⊆ D, and

(3) the value of a specified cost function fc(ζ) is mini-mized (typically by a heuristic).

The requirement that clusters do not share datapoints means that we deal with crisp and not soft par-titions where a data point could be assigned to severalpartitions typically with varying degree of membership.

Page 4: Temporal Structure Learning for Clustering Massive Data ... · zSouthern Methodist University, mhd@lyle.smu.edu Figure 1: Stream Clustering: (a) Partitioning of a data stream using

In the above definition we do not require that all datapoints in D are assigned to a cluster. Some data pointsmight be labeled as outliers and stay unassigned.

Data stream clustering algorithms take into accountthat for many applications data is arriving continuouslyand that the number of clusters may not be knownin advance. To deal with the fact that data streamsmay produce more data than is practical to store, datastream clustering algorithms work with synopses forclusters (sometimes called clustering features or CFs)instead of keeping all the data points.

Definition 3.2. (Data Stream Clustering) Ateach point in time t a data stream clustering ζt isa partitioning, as defined in Definition 3.1, of Dt,the data seen thus far, into k components. How-ever, instead of all data points assigned to clustersC1, C2, . . . , Ck only synopses ~c1,~c2, . . . ,~ck are stored andk is allowed to change over time. The synopses ~ci withi = 1, 2, . . . , k contain summary information about thesize, distribution and location of the data points in Ci.

Still, data stream clustering algorithms only parti-tion the data and temporal aspects (e.g., order or times-tamps) are not preserved in the clustering (with the ex-ception that old data might be removed or weighted tobe able to reflect concept drift). We propose to modelthe temporal structure between clusters as an evolvingMarkov Chain (MC) which at each point in time rep-resents a regular time-homogeneous MC, but which isupdated using a set of well defined operations when newdata is available. In the following we will restrict thediscussion to first order evolving MCs. First order MCswork well as an approximation for many applications,however, as for regular MCs, it is possible to extend theidea to higher order models [17].

A (first order) discrete parameter Markov Chain[19] is a special case of a Markov Process in discrete timeand with a discrete state space. It is characterized by asequence of random variables {Xt} = 〈X1, X2, X3, . . . 〉with t being the time index. All random variables sharethe same domain dom(Xt) = S = {s1, s2, . . . , sk}, a setcalled the state space. The Markov property states thatfor a first order model the next state is only dependenton the current state. Formally,

P (Xt+1 = st+1 | Xt = st, . . . , X1 = s1) =

P (Xt+1 = st+1 | Xt = st),

where s1, . . . , st, st+1 ∈ S.For simplicity we use for transition probabilities

the notation aij = P (Xt+1 = sj | Xt = si), i, j =1, 2, . . . , k, where it is appropriate. Time-homogeneousMC can be represented as a k×k transition matrix A =

(aij) containing the transition probabilities from eachstate to all other states. Another representation is asa graph with the states as vertices and the arcs labeledwith transition probabilities. Transition probabilitiescan be easily estimated from the observed transitioncounts cij (transitions from state si to sj) using themaximum likelihood method by aij = cij/ni where

ni =∑kj=1 cij .

MCs are very useful to keep track of temporalinformation using the Markov property as a relaxation.With a MC it is easy to predict the probability of futurestates or predict missing values based on the temporalstructure of the data. It is also easy to calculate theprobability of a new sequence of length l given a MC asthe product of transition probabilities:

P (Xl = sl, Xl−1 = sl−1, . . . , X1 = s1) =

P (X1 = s1)

l−1∏i=1

P (Xi+1 = si+1 | Xi = si)

Data streams typically contain dimensions withcontinuous data and/or have discrete dimensions witha large number of domain values [2]. In addition, thedata may continue to arrive resulting in a large numberof observations. For the MC we use in TRACDS, datapoints have to be mapped onto a manageable number ofstates. This mapping is already done by the used datastream clustering algorithm. The clustering algorithmassigns each point to a cluster (or micro-cluster) whichis represented by a state in the MC.

Definition 3.3. (TRACDS) TRACDS is defined ateach point in time t as a tuple T = (S,C, sc) (we omitsubscript t for readability), where S is the set of k states,C is the k×k transition count matrix specifying the MCover the states in S and sc ∈ S keeps track of the currentstate, i.e., the state to which the last observation wasassigned to. Given a data stream clustering ζ, TRACDShas the following properties:

(1) At each point in time t there is a one-to-onecorrespondence between the clusters in ζ and thestates S.

(2) The transition probability aij estimated by the tran-sition counts in C represents the probability thatgiven a data point in cluster i, the next data pointin the data stream will belong to cluster j withi, j = 1, 2, . . . , k.

(3) T is created online in parallel to the data streamclustering ζ.

(4) An empty clustering with no data points is repre-sented by an empty C with S = ∅ and we define scto be ε to indicate that there is no current state.

Page 5: Temporal Structure Learning for Clustering Massive Data ... · zSouthern Methodist University, mhd@lyle.smu.edu Figure 1: Stream Clustering: (a) Partitioning of a data stream using

In order to satisfy the properties in Definition 3.3T has to evolve over time to reflect all changes to theclustering. Note that since k in the clustering canchange over time also the number of states in T hasto adapt. In the following we will identify all operationsthat data stream clustering algorithms perform andpresent a set of well defined operations, called TRACDSoperations, which are used to update T accordingly.

3.1 Data Stream Clustering Operations. TheMONIC framework [22] deals with the evolution ofclusters over time. It is not suitable for data streamsas it is not online and does not support relationshipsbetween clusters at a given point in time. However,it presents a technique which is independent of theused clustering algorithm and in that sense is similarto TRACDS. MONIC identifies so-called external andinternal cluster transitions reflecting the change froma clustering at time point t to a later clustering att + 1 (e.g., cluster survives, cluster is absorbed, clustermoves). Although these cluster transitions reflect thechanges of clusters between two points in time, theyare a good starting point to identify typical buildingblocks (we call clustering operations) of data streamclustering algorithms. Such building blocks are, forexample, adding a new incoming data point to anexisting cluster or creating a new cluster. Formally aclustering operation is defined as:

Definition 3.4. (Clustering Operation) A clus-tering operation is defined as a function

ζt+1 = q(ζt, x),

which is used by the data stream clustering algorithm toupdate the clustering given additional information x (anew data point, the index of the cluster to be deleted,etc.).

Any data stream clustering algorithm can be de-scribed as a sequence of such clustering operations thatare triggered by the specifics of the clustering algorithmitself. We identify the necessary clustering operationsfor state-of-the-art data stream clustering algorithms inTable 1. In the following we will define these opera-tions. Some clustering operations are triggered by thedata stream clustering algorithm when a new data pointis available for the data stream. The two typical oper-ations are:

• qassign(ζ, x): Assign the new data point x to anexisting cluster. The clustering algorithm usesthe cluster summaries in ζ to find the appropriatecluster i and then updates ~ci.

• qcreate(ζ, x): Create a new cluster. At some point(e.g., if assigning a new data point to an existingcluster is not appropriate) a new empty cluster rep-resented by ~ck+1 is added to ζ and k is incrementedby one.

Several operations can be triggered by other events.For example by a clean-up process which is scheduledat regular intervals or by the clustering algorithm whenit runs out of memory. These include:

• qremove(ζ, x): Remove a cluster. Here x is i, theindex of the cluster to be removed. In this case theassociated summary ~ci is removed from ζ and k isdecremented by one.

• qmerge(ζ, x): Merge two clusters. Here input xcontains i and j, the indices of two clusters tobe merged. First a new merged cluster is createdby combining the two summaries ~ci and ~cj in anappropriate way. Then the two merged clusters areremoved.

• qfade(ζ, x): Fade the cluster structure. This adaptsthe cluster structure over time by reducing theinfluence of older data points. The input x is emptyin this operation as it is a clustering wide function.Since only a summary and not the original datapoints are available, fading has to be done on thissummary information by updating each summary~ci for i = 1, 2, . . . , k. Typically (see [5, 8, 23, 24]) adecay function f(t) = 2−λt is used to specify theweight of data points added t timesteps in the past.This fading can be done iteratively on summarystatistics if they exhibit the properties of additivityand temporal multiplicity defined in [5].

• qsplit(ζ, x): Split a cluster. Given i, an input clusterindex, two new clusters, j and l with appropriatesummaries, ~cj and ~cl, are created. Subsequently iis removed.

Note that the actions of merging and fading implythat the clustering summaries themselves should beadditive so that they can be combined during the mergeoperation. This was a salient feature of the clusteringfeatures proposed in BIRCH. If not additive, then someknown function must exist to combine them. Alsonotice that a reclustering operation, which is used inmany stream clustering algorithms, is accomplished bya series of merges. The split operation is currently onlysupported by E-Stream [27] to split a large cluster upinto several smaller and denser clusters.

The exact procedure of how and why operationslike deleting and merging are executed is defined by the

Page 6: Temporal Structure Learning for Clustering Massive Data ... · zSouthern Methodist University, mhd@lyle.smu.edu Figure 1: Stream Clustering: (a) Partitioning of a data stream using

Operation CluStream HPStream DenStream WSTREAM OpticsStream E-Stream D-Stream MR-Stream

qassign x x x x x x x xqcreate x x x x x x x xqremove x x x x x x x xqmerge offline offline x offline x offline offlineqfade x x x x x x xqsplit x

Table 1: Clustering operations used by data stream clustering algorithms (marked with x). Offline in row qmerge

indicates that merging is only used as an offline reclustering step.

used data stream clustering algorithm. For TRACDS itis only important that we can define operations to keepT consistent with the clustering.

3.2 TRACDS Operations. As indicated earlier,TRACDS operations are triggered by the stream clus-tering operations stated above. Each clustering oper-ation triggers a unique TRACDS operation which up-dates T . As x is used to indicate the input to the streamclustering operations, we use y to indicate the input tothe TRACDS operations. y is uniquely determined bythe clustering operation.

Definition 3.5. (TRACDS Operation) ATRACDS operation is a function

Tt+1 = r(Tt, y),

which is used to update the TRACDS data structureusing information y provided by a clustering operation(defined above).

In the following we will define for each clusteringoperation q a unique TRACDS operation r. We use thesame subscript to identify which TRACDS operationcorresponds to which clustering operation.

• rassign(T , y): Record a state transition. y identifiessi, the state corresponding to the cluster the newdata point was assigned to in ζ. If the currentstate is known (sc 6= ε), update C by settingcsc,si ← csc,si + 1. Finally, we set the current stateto the new state sc ← si.

• rcreate(T , y): Create a new state. y is empty in thiscase. To represent the new cluster, we have to adda state sk+1 to T by S ← S ∪ {sk+1}. We enlargeC by a row and a column for this state, and finally,we set the current state to the newly added statesc ← sk+1.

• rremove(T , y): Remove state. To remove state si,identified in y, let S ← S \ {si} and remove the

row i and column i in the transition count matrixC. This deletes the state. If sc is si, then set thecurrent state to the no state sc ← ε.

• rmerge(T , y): Merge two states. To merge twostates si and sj , input in y, into a new state sm,the following steps are performed:

1. Create new state sm (see rcreate above).

2. Create the outgoing and incoming arcs for smby for all l ∈ {1, 2, . . . , k} let cml ← cil + cjland clm ← cli + clj .

3. Delete the old states si and sj (see rremove

above).

4. If sc is either si or sj , then set the currentstate to the new state sc ← sm.

• rfade(T , y): Fade the transition probabilities whichrepresent the cluster structure. y is empty in thiscase. The fading strategy used on the cluster syn-opses by the data stream clustering algorithm mustalso be used on the transition count matrix C re-sulting in a fading effect consistent with the clus-tering. Cluster algorithms typically use exponen-tial decay f(t) = 2−λt which is multiplicative andusing repeatedly

Ct+1 = 2−λ Ct

results in the desired compounded fading effect.

• rsplit(T , y): Split states. y contains si, where i isthe index of the cluster to be split. As with fading,the splitting strategy used must be consistent withthe one implemented by the clustering algorithm.Since only synopses information is available for theclustering as well as for the transition counts onlysome heuristic can be used here (e.g., assign thetransition counts proportionally to the number ofobservations assigned to each new cluster).

The following example illustrates the use of some ofthe TRACDS clustering operations.

Page 7: Temporal Structure Learning for Clustering Massive Data ... · zSouthern Methodist University, mhd@lyle.smu.edu Figure 1: Stream Clustering: (a) Partitioning of a data stream using

Table 2: Sequence of operations for Example 1

Cluster TRACDS Manipulation scassignment operation of C

initial C is 0× 0 ε1 rnew cluster expand C to 1× 1

rassign point no manipulation 12 rnew cluster expand C to 2× 2

rassign point c1,2 ← c1,2 + 1 23 rnew cluster expand C to 3× 3

rassign point c2,3 ← c2,3 + 1 32 rassign point c3,2 ← c3,2 + 1 23 rassign point c2,3 ← c2,3 + 1 34 rnew cluster expand C to 4× 4

rassign point c3,4 ← c3,4 + 1 44 rassign point c4,4 ← c4,4 + 1 42 rassign point c4,2 ← c4,2 + 1 23 rassign point c2,3 ← c2,3 + 1 34 rassign point c3,4 ← c3,4 + 1 4

Example 1. (Create a TRACDS) A data streamclustering algorithm starts with an empty clustering ζand we also start with an empty TRACDS data struc-ture T with no states S = ∅ and the starting state scset to ε indicating that there is no state yet. We assumethat the clustering algorithm assigns ten incoming datapoints the clusters 1, 2, 3, 2, 3, 4, 4, 2, 3, 4. Thissequence of assignments triggers the 14 TRACDS oper-ations shown in Table 2. The table shows the assumedcluster assignments by the clustering algorithm, the exe-cuted TRACDS operations and manipulations to C andsc. We have 10 operations to assign points to clustersand 4 operations to create the 4 needed clusters/states.Creating new clusters/states increases the size of C andadding a data point increases the counts inside the ma-trix. The transition count matrix C resulting from the14 operations is shown in Figure 2(a). A graph rep-resentation of the transition count matrix is shown inFigure 2(b). The nodes are labeled with the cluster/statelabels 1–4 and larger nodes represent clusters with moreobservations. The arcs represent transitions with heav-ier arcs indicating a higher transition count.

3.3 Implementation and Complexity ofTRACDS. TRACDS operations can be imple-mented separately from the clustering operations.We only need a very light-weight interface throughwhich we can observe what operations the clusteringalgorithm executes plus minimal additional information(e.g., the cluster index for a new data point). Adding

1 2 3 41 0 1 0 02 0 0 3 03 0 1 0 24 0 1 0 1

(a)

1

2

3

4

(b)

Figure 2: (a) Transition count matrix C and (b) graphrepresentation of the Markov Chain for Example 1

such an interface to existing clustering algorithms isstraight forward.

The space and time complexity of maintainingTRACDS depends on the data structure used to storethe transition count matrix C. A suitable data structureis a two-dimensional array with k′ ≥ k columns/rowswhere we allow columns and the corresponding rows tobe marked as currently unused. On such a data struc-ture the most often used operation of assigning a newdata point can be done in constant time. Deleting clus-ters, creating new clusters and merging clusters takesO(k). We only have to reorganize the data structurewith O(k2) if no more unused rows/columns are avail-able. Note that most data stream clustering algorithmsmake sure that k does not increase unbounded whichreduces or might even avoid the need for the more ex-pensive reorganization operation. Fading is also a moreexpensive operation since we have to go through all k2

cells. However, a strategy similar to the one used in [28]can be used to significantly reduce the burden by us-ing timestamps when the last fading was performed ona count and then only do compounded fading when atransition is updated.

Space requirements are O(k′2) where k′ ≥ k is thechosen size for C. However, since for certain types ofdata C might be very sparsely populated it is possibleto use a list-based data structure to reduce the spacerequirements at the expense of the time complexity ofthe operations.

The computational needs of TRACDS directly de-pend on the number and type of clustering operationsexecuted by the used data stream clustering algorithm.Our experiments show that the time spent on TRACDSoperations is negligible compared to the clustering op-erations (see next section).

4 Evaluation

In this section we evaluate the improvement ofTRACDS compared to standard stream clustering. Al-

Page 8: Temporal Structure Learning for Clustering Massive Data ... · zSouthern Methodist University, mhd@lyle.smu.edu Figure 1: Stream Clustering: (a) Partitioning of a data stream using

though TRACDS is potentially useful for other appli-cations we focus here on the improvement for anomalydetection. Anomaly detection is an important and wellresearched stream mining task. It is the basis for manyapplications like anomaly detection in weather relateddata and intrusion detection in computer networks. Al-though many anomaly detection techniques were pro-posed (see [10] for a current survey), we only considera data stream clustering based approach here since weare only interested in analyzing if TRACDS can improveover standard clustering.

As the baseline for unsupervised anomaly detectionvia clustering, we follow the simple approach by Eskinet al [12]. Clustering with a fixed distance thresholdaround a center is used to get local density estimatesand all members assigned to a cluster i are classified asoutliers if the density in the cluster is low, i.e., ni < δcwhere ni is the number of points assigned to cluster iand δc is a suitable threshold. This approach was foundto perform favorably compared to more complicated andcomputationally expensive approaches using supportvector machines and k-Nearest Neighbor [12].

For TRACDS we only use temporal informationgiven by transition probabilities in the MC in T .TRACDS also has access to cluster size which might fur-ther improve results, but we ignore it here since we wantto concentrate only on the captured temporal order in-formation. We classify each data point as an anomalyif the transition probability from cluster i, the clusterfor the previous data point in the stream, to the clusterof the current point j is below a pre-specified threshold,i.e., aij < δT .

As the data stream clustering algorithm we usethe threshold nearest neighbor (tNN) algorithm weimplemented together with TRACDS in the R-extensionpackage rEMM 1 This simple data stream clusteringalgorithm is very similar to the one used in [12]. Itrepresents clusters as synopses containing the positionand the size of each cluster. It assigns a new data pointto an existing cluster if it is within a fixed thresholdof its center. If a data point cannot be assignedto any existing cluster, a new cluster is created forthe data point. After assignment the data point isdiscarded. Note that the clustering algorithm usedhere is only of secondary importance. TRACDS onlyuses the clustering results as input and we evaluate ifTRACDS can improve the results of pure clustering.Other data stream clustering algorithms which mightproduce better results for anomaly detection will alsoimprove the accuracy of TRACDS.

1http://CRAN.R-project.org/package=rEMM

4.1 Synthetic data. We first use several syntheticdata sets to evaluate our approach and analyze sensi-tivity to data dimensionality and imperfections in thetemporal structure, e.g., caused by data points arrivingout-of-order, incorrect time stamps or by a weak tem-poral structure in the data. To make the experimentsin this paper reproducible, we included the used datagenerator in package rEMM. The data generation pro-cess creates a data set consisting of k clusters in roughly[0, 1]d. For simplicity, the data points for each clusterare drawn from a multivariate normal distribution givena random mean and a random variance/covariance ma-trix for each cluster. The temporal aspect is modeled bya fixed subsequence through the k clusters of length n.In each step we have a transition probability pt thatthe next data point is in the same cluster or in a ran-domly chosen other cluster, thus we can create slowlyor fast changing data. For the complete sequence, thesubsequence is repeated l times creating a fixed tem-poral structure. The data set of size N = nl is gen-erated by drawing a data point from the cluster corre-sponding to each position in the sequence. To introduceimperfections in the temporal sequence (i.e., incorrecttime stamps) we swap two consecutive observations withprobability ps. Note that each swap of two observationsinfluences three transitions and a rather low setting ofps can distort the temporal structure significantly.

Anomalies are introduced by replacing data pointsin the data set with probability pa by randomly chosendata points in [0, 1]d. These data points potentiallylie far away from clusters and can be easily detected.However, they might also fall within existing clustersand therefore are hard to detect. Since these anomaliesviolate the temporal structure of the data, TRACDScan be used for detection.

An example of synthetic data generated by theprocedure described above and used in the experimentsbelow is shown in Figure 3. The clusters have differentshapes, densities and most are not well separated. Mostof the anomalies are far away from the clusters butseveral are close or even within clusters of regular dataso it is expected that detecting these anomalies, ifpossible at all, is a hard task. Figure 4 shows theReceiver Operator Characteristics (ROC) curves [20]with false positive rate (FPR) and true positive rate(TPR) for the two anomaly detection approaches. TheROC curve is formed by connecting the FPR/TPRcombinations obtained using different values for the twoalgorithms thresholds, δc and δT . It can clearly beseen that TRACDS improves the results over simpleclustering resulting in a larger area under the ROCcurve (AUC).

For the following experiments we create several data

Page 9: Temporal Structure Learning for Clustering Massive Data ... · zSouthern Methodist University, mhd@lyle.smu.edu Figure 1: Stream Clustering: (a) Partitioning of a data stream using

●●

●●

●●

●●

●●●

● ●●

●●

●●●

●●

●●

●●

● ●

●●

● ●●

●●

●●

●●

●●● ●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●●

● ●

●●

●●●

●●

●●

●●● ●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●●●●

●● ●

●●

●●

●●●

●● ●

● ●

●●

●●

●●

●●

● ●

●● ●●

●●

●●

●●

●●

●●

●●●

●●

●●

●● ●●

●●●

●●

●●

●●● ●

●●

●●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●●

●●●●●

● ●● ● ●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●● ●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●● ●

●●

●●● ●

●●

●●●

●●

●●●●

●●

●● ● ●

●●

●●

●●●●

●●

●●

●●

●●

●●●

●●● ●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

● ●●

●●●

●●

●●●

●●

●●

● ●

●●●

●●

●●

●●

●●● ●●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●●

●●

●●

●●●●

●● ●

●●

●●

●●●

●●●

●●

●●●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●●●

●●

●●

●●

●●

●●●●

●● ●●

●●●●●

●● ●● ●

● ●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●●

●●●●●

●●

● ●●●

●●●

●●

●●

● ●

● ●●

●●

●●

●●

● ●●

● ●●

●●

●●

●●

●●

●●●

●●

●●

●●

● ●●

●●

●● ●●

●●

●●

●●

●●

●●●●●

●●

● ●

●●

●●●

●●

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

x

y

Figure 3: Example of synthetic data (k = 10, d = 2)used in the first experiment below (first row in Table 3).Anomalies are shown as Xs.

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

FPR

TP

R

●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●

● ClusteringTRACDS

Figure 4: Averaged ROC curves for 10 samples used inthe first experiment below (first row in Table 3).

sets with k = 10 clusters and an anomaly probabilitypa = 0.01. The other parameters (n, N , d, pt and ps) arevaried in the study. We will always create 10 data setswith identical parameters, run clustering and TRACDSon each, average the ROC curves and then report theAUC and the improvement (in percent) of TRACDSover clustering. For the tNN clustering we use Euclideandistance and chose the threshold automatically as thefirst quartile of the distance distribution in a sample ofthe data set.

Table 3 summarizes the result for slowly changingdata. The probability pt that a new data point willnot belong to the cluster of the previous data point is50%. For d = 2, the AUC for TRACDS is between

Table 3: Slowly changing data (n = 100 and pt = 0.5).d ps N AUCclust AUCTRACDS Improvement

2 0.0 10,000 0.751 0.855 13.8%2 0.2 10,000 0.753 0.836 11.0%5 0.0 10,000 0.859 0.936 9.0%5 0.2 10,000 0.875 0.929 6.1%10 0.0 10,000 0.904 0.957 5.9%10 0.2 10,000 0.897 0.942 5.0%

Table 4: Fast changing data (n = 100, pt = 1).d ps N AUCclust AUCTRACDS Improvement

2 0.0 10,000 0.723 0.795 9.9%2 0.2 10,000 0.744 0.774 3.9%5 0.0 10,000 0.845 0.890 5.4%5 0.2 10,000 0.846 0.873 3.2%10 0.0 10,000 0.877 0.909 3.7%10 0.2 10,000 0.870 0.881 1.2%

Table 5: No temporal structure (n = N and pt = 1).d ps N AUCclust AUCTRACDS Improvement

2 0.0 10,000 0.728 0.746 2.4%5 0.0 10,000 0.816 0.826 1.2%10 0.0 10,000 0.841 0.845 0.4%

Table 6: Real-world data sets.Name N outliers d AUCclust AUCTRACDS Impr.

KDD-99 250,000 10 38 0.893 0.972 8.9%16S 402 43 64 0.516 0.696 34.9%

13.8% larger than for clustering. Introducing temporalerrors by swapping data points with a probability of20% reduces the improvement to 11%. Increasing thedimensionality of the data to d = 5 and d = 10 resultsin overall higher AUC for clustering and TRACDS.However, the relative advantage of TRACDS is reduced.This happens since by increasing the dimensionality thevolume of the unit hypercube increases exponentiallyand thus the chance that a random point falls insidea cluster and thus is hard to identify by clustering isreduced.

For Table 4 we changed pt to one. So it is less likelythat two consecutive data points come from the samecluster resulting in faster changing data. This changeonly affects the temporal structure and thus has littleinfluence on the results from clustering alone. However,it reduces the advantage of TRACDS significantly to9.9% (3.9% for ps = 0.2) at d = 2. Again AUC increaseswith the dimensionality of the data but the relativeadvantage of TRACDS is decreasing.

Finally, in Table 5, we set the subsequence lengthn = N which means that there is no repeating temporalstructure in the data. TRACDS performs very similarto clustering. Outlier clusters are still detected byTRACDS since the probability of a transition to these

Page 10: Temporal Structure Learning for Clustering Massive Data ... · zSouthern Methodist University, mhd@lyle.smu.edu Figure 1: Stream Clustering: (a) Partitioning of a data stream using

clusters is smaller than to the regular clusters.

4.2 Real-world data. To investigate if the temporalstructure learned by TRACDS is useful for real-worldapplications we use two data sets. The first datasetis the well known 1999 KDD Cup intrusion detectiondataset (KDD-99) obtained from the UCI MachineLearning Repository2. It contains network traffic dataaggregated at the connection level. We consider thefirst 250,000 connections in the dataset with roughly85% normal connections and the remaining connectionsconstitute 10 intrusions of different types (e.g., bufferoverflow, perl, smurf, loadmodule, neptune). We use the38 numeric attributes and set the clustering thresholdequal to the median distance between a sample of thedata points. Then we used the same procedure asfor the synthetic data to create ROC curves. TheROC curves are shown in Figure 5 and the area underthe ROC curves is reported in Table 6. TRACDSimproves the area by almost 9% indicating that thelearned temporal structure preserves important signalsfor intrusion detection.

The second data set consists of genetic sequences.Such sequences are not temporal data streams, howeverdue to the large volume of data that has to be processed(human DNA contains about 3 billion base pairs), usingefficient data stream mining techniques which only needone pass over the data is becoming increasingly impor-tant. Here we use 16S rRNA sequences, an importantmarker for phylogenetic studies, for the two phyloge-netic classes Alphaproteobacteria and Mollicutes. Theused data is preprocessed by counting the occurrenceof triplets of nucleotides (4 distinct bases) in windowsof length 100 resulting in data points with 64 counts.The preprocessed data set is included in package rEMM.From a set of 30 sequences for each class we randomlychoose the data points for 24 sequences for Alphapro-teobacteria (normal cases) and then add 3 sequences forMollicutes (anomalies) at random positions. We clus-ter the data and compare again the AUC. We repeatthis procedure 10 times and report the averages in Fig-ure 6 and Table 6. Finding the anomalies here is a veryhard task since many subsequences for both phyloge-netic classes are very similar. This results in an AUCfor simple clustering close to .5. For some thresholds theclustering approach even produces worse results thanpicking anomalies randomly. However the sequence in-formation retained by TRACDS improves the resultssignificantly by almost 35%.

Finally, we look at the execution time of TRACDS.We report here the execution time on the KDD-99 data.

2http://archive.ics.uci.edu/ml/

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

FPR

TP

R

●●●●●●●

●●●

● ClusteringTRACDS

Figure 5: ROC curves for the KDD-99 data set.

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

FPR

TP

R

●●●●●

●●●●●

●●●● ●●●●●●●●

●●● ●●●

●●

●● ●

●●●● ●●●●●●

●●●●●●●

●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

● ClusteringTRACDS

Figure 6: Averaged ROC curves for 10 runs of the 16SrRNA data set.

We run the experiment using our R implementation onan Intel Core 2 (1.5 GHz) machine. Figure 7 shows thatthe execution time for TRACDS (including the under-lying clustering algorithm) increases linearly with thenumber of data points. Compared with the executiontime of just the clustering algorithm, TRACDS only in-creases the execution time by a negligible amount. Asdiscussed above, the most important operations (insertdata point, create new state) take for TRACDS O(1)while the used clustering algorithm has to compute thedistances of each new data point to all clusters usingO(kd) time, where d is the dimensionality of the data.

Page 11: Temporal Structure Learning for Clustering Massive Data ... · zSouthern Methodist University, mhd@lyle.smu.edu Figure 1: Stream Clustering: (a) Partitioning of a data stream using

0 10000 20000 30000 40000 50000

0

50

100

150

200

N

Tim

e (s

ec.)

ClusteringTRACDSk

0125250

k

Figure 7: Execution time on KDD-99 data set.

5 Conclusion and Future Work

Current work in the temporal clustering field concen-trates on the evolution of data streams and how theclustering can follow or detect such changes. In this pa-per we addressed a completely different problem. Wedeal with the temporal (or order) relationship betweenclusters. We presented TRACDS, a framework whichcan be used to efficiently learn online a model of a mas-sive data stream’s temporal structure.

We systematically evaluated TRACDS for anomalydetection on synthetic data with the result that if thereis a strong temporal structure in the data, TRACDS canimprove accuracy significantly. Experiments on real-world data support these findings and show that theTRACDS framework only minimally increases runtime.

Since this is one of the first papers dealing withmodeling the temporal order structure of massive datastreams, there are many directions for further researchand applications:

• We will study the use of different data stream clus-tering algorithms as the base for TRACDS. Sincethe algorithms have different strategies for assign-ing data points, and for merging, removing andfading clusters, we need to evaluate the impact onperformance in terms of runtime and accuracy fordifferent applications and data structure choices.

• We plan to investigate the use of higher orderMarkov models to learn a more detailed temporalstructure. The space complexity for storing thecomplete transition matrix increases exponentiallywith the order of the chain. Also difficulties withgetting reliable transition probability estimates forhigher order models are expected. However, wecan deal with these problems. For example, fadingtransition counts and then removing low countscan help with avoiding the estimation problem. Itcan be seen as removing noise and the resulting

transition matrix can be relatively sparse and thuscan be stored in a more compact way. Also we canresort to use lower order transition probabilities fortransitions where not enough data is available toreliably estimate the higher order probabilities (seevariable order Markov chains [7]).

• It is straight forward to calculate the probability offuture states given the current state and a Markovchain. The Markov chain learned by TRACDS canthus be used to predict the cluster a future datapoint will belong to. An example application is toimpute missing values in a stream by identifyingthe cluster which a data point most likely wouldhave been assigned to given the temporal structureand then to use the cluster’s center as the imputedvalue.

• Since TRACDS learns a temporal model in form ofa Markov chain, we can evaluate dissimilarities be-tween data streams by using dissimilarities betweenthe learned models. Applied to genetic sequences,this will lead to new computationally efficient ap-proaches of sequence clustering and sequence clas-sification for high volume genetic sequence databased on TRACDS models.

Acknowledgments

This work is supported in part by the U.S. NationalScience Foundation under contract number IIS-0948893.

References

[1] G. Adomavicius and J. Bockstedt. C-TREND: Tempo-ral cluster graphs for identifying and visualizing trendsin multiattribute transactional data. IEEE Transac-tions on Knowledge and Data Engineering, 20(6):721–735, June 2008.

[2] C. Aggarwal. A framework for clustering massive-domain data streams. In IEEE 25th InternationalConference on Data Engineering (ICDE ’09), pages102–113, March 29 2009-April 2 2009.

[3] C. Aggarwal and P. Yu. A framework for clusteringuncertain data streams. In IEEE 24th InternationalConference on Data Engineering (ICDE 2008), pages150–159, 2008.

[4] C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu. Aframework for clustering evolving data streams. InProceedings of the International Conference on VeryLarge Data Bases (VLDB ’03), pages 81–92, 2003.

[5] C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu. Aframework for projected clustering of high dimensionaldata streams. In Proceedings of the Thirtieth Interna-tional Conference on Very Large Data Bases (VLDB’04), pages 852–863, 2004.

Page 12: Temporal Structure Learning for Clustering Massive Data ... · zSouthern Methodist University, mhd@lyle.smu.edu Figure 1: Stream Clustering: (a) Partitioning of a data stream using

[6] D. Barbara. Requirements for clustering data streams.SIGKDD Explorations, 3(2):23–27, 2002.

[7] R. Begleiter, R. El-Yaniv, and G. Yona. On predic-tion using variable order markov models. Journal ofArtificial Intelligence Research, 22:385–421, 2004.

[8] F. Cao, M. Ester, W. Qian, and A. Zhou. Density-based clustering over an evolving data stream withnoise. In Proceedings of the 2006 SIAM InternationalConference on Data Mining, pages 328–339. SIAM,2006.

[9] D. Chakrabarti, R. Kumar, and A. Tomkins. Evo-lutionary clustering. In KDD ’06: Proceedings ofthe 12th ACM SIGKDD international conference onKnowledge discovery and data mining, pages 554–560.ACM, 2006.

[10] V. Chandola, A. Banerjee, and V. Kumar. Anomalydetection: A survey. ACM Computing Surveys,41(3):1–58, 2009.

[11] M. H. Dunham, Y. Meng, and J. Huang. Extensiblemarkov model. In Proceedings IEEE ICDM Confer-ence, pages 371–374. IEEE, November 2004.

[12] E. Eskin, A. Arnold, M. Prerau, L. Portnoy, andS. Stolfo. A geometric framework for unsupervisedanomaly detection: Detecting intrusions in unlabeleddata. In Data Mining for Security Applications.Kluwer, 2002.

[13] S. Guha, A. Meyerson, N. Mishra, R. Motwani, andL. O’Callaghan. Clustering data streams: Theory andpractice. IEEE Transactions on Knowledge and DataEngineering, 15(3):515–528, 2003.

[14] A. Jain, M. Murty, and P. Flynn. Data clustering:A review. ACM Computing Surveys, 31(3):264–323,September 1999.

[15] Y. Kambayashi, T. Hayashi, and S. Yajima. Dynamicclustering procedures for bibliographic data. In Pro-ceedings of the ACM SIGIR Conference, pages 90–99,June 1981.

[16] E. Keogh, J. Lin, and W. Truppel. Clustering of timeseries subsequences is meaningless: Implications forprevious and future research. In ICDM ’03: Proceed-ings of the Third IEEE International Conference onData Mining, page 115. IEEE Computer Society, 2003.

[17] M. Kijima. Markov Processes for Stochastic Modeling.Stochastic Modeling Series. Chapman & Hall/CRC,1997.

[18] H.-P. Kriegel, P. Kroger, and I. Gotlibovich. Incre-mental OPTICS: Efficient computation of updates in ahierarchical cluster ordering. In Data Warehousing andKnowledge Discovery, volume 2737 of Lecture Notes inComputer Science, pages 224–233. Springer, 2003.

[19] E. Parzen. Stochastic Processes. Society for IndustrialMathematics, 1999.

[20] F. Provost and T. Fawcett. Analysis and visualizationof classifier performance: Comparison under impreciseclass and cost distributions. In D. Heckerman, H. Man-nila, and D. Pregibon, editors, Proceedings of the 3rdInternational Conference on Knowledge Discovery andData Mining, pages 43–48, Newport Beach, CA, Au-

gust 1997. AAAI Press.[21] J. Sander, M. Ester, H.-P. Kriegel, and X. Xu. Density-

based clustering in spatial databases: The algorithmGDBSCAN and its applications. Data Minining andKnowledge Discovery, 2(2):169–194, 1998.

[22] M. Spiliopoulou, I. Ntoutsi, Y. Theodoridis, andR. Schult. MONIC: Modeling and monitoring clustertransitions. In Proceedings of the 12th ACM SIGKDDInternational Conference on Knowledge Discovery andData Mining, Philadelphia, PA, USA, pages 706–711,2006.

[23] D. Tasoulis, N. Adams, and D. Hand. Unsupervisedclustering in streaming data. In IEEE InternationalWorkshop on Mining Evolving and Streaming Data.Sixth IEEE International Conference on Data Mining(ICDM 2006), pages 638–642, Dec. 2006.

[24] D. K. Tasoulis, G. Ross, and N. M. Adams. Visualisingthe cluster structure of data streams. In Advancesin Intelligent Data Analysis VII, Lecture Notes inComputer Science, pages 81–92. Springer, 2007.

[25] R. S. Tsay, D. Pea, and A. E. Pankratz. Outliers inmultivariate time series. Biometrika, 87(4):789–804,2000.

[26] L. Tu and Y. Chen. Stream data clustering basedon grid density and attraction. ACM Transactions onKnowledge Discovery from Data, 3(3):1–27, 2009.

[27] K. Udommanetanakit, T. Rakthanmanon, andK. Waiyamai. E-stream: Evolution-based techniquefor stream clustering. In ADMA ’07: Proceedings ofthe 3rd international conference on Advanced DataMining and Applications, pages 605–615. Springer-Verlag, Berlin, Heidelberg, 2007.

[28] L. Wan, W. K. Ng, X. H. Dang, P. S. Yu, andK. Zhang. Density-based clustering of data streams atmultiple resolutions. ACM Transactions on KnowledgeDiscovery from Data, 3(3):1–28, 2009.

[29] T. Warren Liao. Clustering of time series data–a sur-vey. Pattern Recognition, 38(11):1857–1874, November2005.

[30] T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH:An efficient data clustering method for very largedatabases. In Proceedings of the 1996 ACM SIGMODInternational Conference on Management of Data,pages 103–114. ACM, 1996.