warping the time on data streams

21
Warping the time on data streams Paolo Capitani, Paolo Ciaccia * DEIS, University of Bologna, Viale Risorgimento, 2, 40136 Bologna, Italy Received 12 August 2006; accepted 12 August 2006 Available online 17 October 2006 Abstract Continuously monitoring through time the correlation/distance of multiple data streams is of interest in a variety of applications, including financial analysis, video surveillance, and mining of biological data. However, distance measures commonly adopted for comparing time series, such as Euclidean and Dynamic Time Warping (DTW), either are known to be inaccurate or are too time-consuming to be applied in a streaming environment. In this paper we propose a novel DTW-like distance measure, called Stream-DTW (SDTW), which unlike DTW can be efficiently updated at each time step. We formally and experimentally demonstrate that SDTW speeds up the monitoring process by a factor that grows linearly with the size of the window sliding over the streams. For instance, with a sliding window of 512 samples, SDTW is about 600 times faster than DTW. We also show that SDTW is a tight approximation of DTW, errors never exceeding 10%, and that it consistently outperforms approximations developed for the case of static time series. Ó 2006 Elsevier B.V. All rights reserved. Keywords: Data streams; Dynamic time warping; Sliding window 1. Introduction Management of data streams has recently emerged as one of the most challenging extensions of database technology. The proliferation of sensor networks as well as the availability of massive amounts of streaming data related to telecommunications traffic monitoring, web-click logs, geophysical measurements and many others, has motivated the investigation of new methods for their modelling, storage, and querying. In partic- ular, continuously monitoring through time the correlation of multiple data streams is of interest in order to detect similar behaviors of stock prices, for video surveillance applications, synchronization of biological sig- nals and, more in general, mining of temporal patterns [1,2]. Previous works dealing with the problem of detecting when two or more streams exhibit a high correlation in a certain time interval have tried to extend techniques developed for (static) time series to the streaming environment. In particular, Zhu and Shasha [3], by adopting a sliding window model and the Euclidean dis- tance as a measure of correlation (low distance = high correlation), have been able to monitor in real-time 0169-023X/$ - see front matter Ó 2006 Elsevier B.V. All rights reserved. doi:10.1016/j.datak.2006.08.012 * Corresponding author. E-mail addresses: [email protected] (P. Capitani), [email protected] (P. Ciaccia). Data & Knowledge Engineering 62 (2007) 438–458 www.elsevier.com/locate/datak

Upload: paolo-capitani

Post on 26-Jun-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Warping the time on data streams

Data & Knowledge Engineering 62 (2007) 438–458

www.elsevier.com/locate/datak

Warping the time on data streams

Paolo Capitani, Paolo Ciaccia *

DEIS, University of Bologna, Viale Risorgimento, 2, 40136 Bologna, Italy

Received 12 August 2006; accepted 12 August 2006Available online 17 October 2006

Abstract

Continuously monitoring through time the correlation/distance of multiple data streams is of interest in a variety ofapplications, including financial analysis, video surveillance, and mining of biological data. However, distance measurescommonly adopted for comparing time series, such as Euclidean and Dynamic Time Warping (DTW), either are knownto be inaccurate or are too time-consuming to be applied in a streaming environment. In this paper we propose a novelDTW-like distance measure, called Stream-DTW (SDTW), which unlike DTW can be efficiently updated at each time step.We formally and experimentally demonstrate that SDTW speeds up the monitoring process by a factor that grows linearlywith the size of the window sliding over the streams. For instance, with a sliding window of 512 samples, SDTW is about600 times faster than DTW. We also show that SDTW is a tight approximation of DTW, errors never exceeding 10%, andthat it consistently outperforms approximations developed for the case of static time series.� 2006 Elsevier B.V. All rights reserved.

Keywords: Data streams; Dynamic time warping; Sliding window

1. Introduction

Management of data streams has recently emerged as one of the most challenging extensions of databasetechnology. The proliferation of sensor networks as well as the availability of massive amounts of streamingdata related to telecommunications traffic monitoring, web-click logs, geophysical measurements and manyothers, has motivated the investigation of new methods for their modelling, storage, and querying. In partic-ular, continuously monitoring through time the correlation of multiple data streams is of interest in order todetect similar behaviors of stock prices, for video surveillance applications, synchronization of biological sig-nals and, more in general, mining of temporal patterns [1,2].

Previous works dealing with the problem of detecting when two or more streams exhibit a high correlationin a certain time interval have tried to extend techniques developed for (static) time series to the streamingenvironment. In particular, Zhu and Shasha [3], by adopting a sliding window model and the Euclidean dis-tance as a measure of correlation (low distance = high correlation), have been able to monitor in real-time

0169-023X/$ - see front matter � 2006 Elsevier B.V. All rights reserved.

doi:10.1016/j.datak.2006.08.012

* Corresponding author.E-mail addresses: [email protected] (P. Capitani), [email protected] (P. Ciaccia).

Page 2: Warping the time on data streams

P. Capitani, P. Ciaccia / Data & Knowledge Engineering 62 (2007) 438–458 439

up to 10,000 streams on a PC. However, it is known that for time-varying data a much better accuracy can beobtained if one uses the Dynamic Time Warping (DTW) distance [4]. Since the DTW can compensate forstretches along the temporal axis, it provides a way to optimally align time series that matches user’s intuitionof similarity much better than Euclidean distance does, as demonstrated several times (see, e.g., [5] for somenovel DTW applications and [6] for a recent DTW-based approach to shape matching). Further, although nota metric, DTW can be indexed [7], which allows this distance to be applied also in the case of large time seriesarchives.

Unfortunately, when considering streams the benefits of DTW seem to vanish since, unlike Euclidean dis-tance, it cannot be efficiently updated. The basic reason is that the (optimal) alignment one has established attime t is not guaranteed to be still optimal at time t + 1, thus forcing the DTW to be recomputed fromscratch at each time step, with a complexity that, for a sliding window of size n, varies between O(n)and O(n2), depending on how specific global constraints on valid alignments are set (see Section 2 fordetails).

Given this unpleasant state of affairs, in this paper we propose the novel SDTW (Stream-DTW) distancemeasure to be used in place of DTW for the purpose of continuously monitoring the distance between twoor more streams. SDTW preserves the ability of DTW to compensate for stretches along the temporal axis,yet, unlike DTW, is efficiently updatable. More precisely, the work required at each time step can vary froma minimum of O(1) to a maximum of O(n), depending on the global constraints on alignments. In both casesthis represents a remarkable O(n) speed-up over DTW computation. Our experiments on five real-world data-sets show that SDTW is indeed a very good approximation of DTW, the error never exceeding 10%. We alsodemonstrate that other approximations of DTW, proposed for the static case, incur much higher errors andhave a highly variable behavior over the datasets.

Besides being applicable as a valid alternative to DTW for monitoring purposes, SDTW can also be used asan effective filtering tool to speed-up the computation of the result of DTW-based queries in a streaming envi-ronment. This holds since we prove that SDTW always lower bounds the DTW. For this querying scenario weprovide some preliminary results showing that SDTW is particularly effective in limiting the number of falsealarms.

The rest of the paper is organized as follows. In Section 2 we provide the necessary background on theDTW distance. Section 3 highlights the limits of applying the DTW distance in a streaming environmentand presents a simple extension of the well-known LBKeogh lower-bounding DTW approximation. In Section4 we introduce the novel SDTW distance measure and prove some of its basic properties. Section 5 providesexperimental evidence of the efficiency and the accuracy of SDTW. Section 6 reviews the relevant relatedworks and Section 7 concludes and suggests directions for future research activity.

2. Dynamic time warping

We start with some basic definitions related to the static case, i.e., for real-valued time series of finite length.Let R � Rn

1 and S � Sn1 be two time series of length n and let Ri (Si) be the ith sample of R (resp. S). A common

way to compare two time series is by computing their Euclidean distance (L2) [8]:

1 Ob

L2ðRn1; S

n1Þ ¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiXn

i¼1

Ri � Sið Þ2s

ð1Þ

It is important to realize that L2 (and similar metrics as well, such as the Manhattan (L1) and the ‘‘max’’ (L1)distances) only compares corresponding samples, thus it does not allow for local non-linear stretches along thetemporal axis.1 As a consequence, two time series might lead to a high L2 value even when they are very sim-ilar, which can have negative effects on common mining tasks, such as classification and clustering [10].Thiswell-known problem is solved by the Dynamic Time Warping (DTW) distance [4]. The key idea of DTW is thatany point of a series can be (forward and/or backward) aligned with multiple points of the other series that liein different temporal positions, so as to compensate for temporal shifts.

serve that the problem of global or uniform stretching is quite different, see [9].

Page 3: Warping the time on data streams

t

0

1

2

3

4

5

6

S

R

R

4 9 16

1 4 9 1

1 4 9 1

9 16 4

1

4 1

25 9

4

9 4 9

4 1 4 9

4 1 4 9

0 1 4

3 4 5 3 3 2 3 43 4 5 3 3 2 3 4

1

2

2

1

0

1

1

2

1

2

2

1

0

1

1

2

Sd

R

13 29

5 18

6 9 17

15 25 21 22

40 30

34

30 29

34 28 37

38 29 37

26

4

8 17

18 19

20

24

25

26

27 31

3 4 5 3 3 2 3 43 4 5 3 3 2 3 4

1

2

2

1

0

1

1

2

DS

a b c

Fig. 1. Computing the DTW distance with a Sakoe–Chiba band of width b = 2: (a) optimal alignments; (b) the d matrix of pairwise sampledistances; (c) the D cumulative distance matrix; highlighted cells constitute the optimal warping path Wopt.

440 P. Capitani, P. Ciaccia / Data & Knowledge Engineering 62 (2007) 438–458

The definition of DTW is based on the notion of warping path. Let d be the n · n matrix of pairwise squareddistances between samples of R and S, d[i, j] = (Ri � Sj)

2. A warping path W = hw1,w2, . . . ,wKi is a sequence ofK (n 6 K 6 2n � 1) matrix cells, wk = [ik, jk] (1 6 ik, jk 6 n), such that:

Boundary conditions: w1 = [1,1] and wK = [n,n], i.e., W starts in the lower-left cell and ends in the upper-right cell;Continuity: given wk�1 = [ik�1, jk�1] and wk = [ik, jk], then ik � ik�1 6 1 and jk � jk�1 6 1. This ensures thatthe cells of the warping path are adjacent;Monotonicity: given wk�1 = [ik�1, jk�1] and wk = [ik, jk], then ik � ik�1 P 0 and jk � jk�1 P 0, with at leastone strict inequality. This forces W to progress over time.

Any warping path W defines an alignment between R and S and, consequently, a cost to align the two ser-ies. The (squared) DTW distance is the minimum of such costs, i.e., the cost of the optimal warping path, Wopt:

DTW ðRn1; S

n1Þ

2 ¼ minW

X½ik ;jk �2W

d½ik; jk�( )

¼X

½ik ;jk �2W opt

d½ik; jk� ð2Þ

The DTW distance can be recursively computed using an O(n2) dynamic programming approach that fills thecells of a cumulative distance matrix D using the following recurrence relation (see also Fig. 1):

D½i; j� ¼ d½i; j� þmin D½i� 1; j� 1�;D½i� 1; j�;D½i; j� 1�f g ð1 6 i; j 6 nÞ ð3Þ

and then setting DTW ðRn

1; Sn1Þ ¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiD½n; n�

p.

Note thatffiffiffiffiffiffiffiffiffiffiffiffiD½i; j�

p¼ DTW ðRi

1; Sj1Þ, "i, j, since the DTW distance is well-defined even when the two series

have different lengths [4].In practical applications, warping paths are commonly subject to some global constraints, in order to pre-

vent pathological alignments. The most commonly used of such constraints is the Sakoe–Chiba and [11] ofwidth b, that forces warping paths to deviate no more than b steps from the matrix diagonal (see Fig. 1(b)). It is worth noting that, besides reducing the complexity of computing DTW to O(nb), rather surprisinglya band of limited width b leads to better results in classification tasks, even when b is a small fraction (say, 3–4%) of n [5].

3. Data streams: what is the problem?

Let us now turn to a streaming environment, in which R and S are possibly infinite sequences of sam-ples. The problem we address here is how to efficiently and accurately monitor through time the distancebetween R and S. In doing this we make two basic assumptions, that serve us to precisely define thescenario:

Page 4: Warping the time on data streams

P. Capitani, P. Ciaccia / Data & Knowledge Engineering 62 (2007) 438–458 441

(1) We assume that at each time step a new sample for both streams becomes available, i.e., streams are syn-chronized. This hypothesis is not really restrictive and can be relaxed at the price of unnecessarily com-plicating the presentation. In case one of the two streams, say R, is sampled at a higher frequency thanthe other, i.e., S, we might either undersample R or interpolate S’s samples.

(2) We consider that, at any time step t, only the last n points of the streams, namely the subsequences Rtt�nþ1

and Stt�nþ1, are significative. This is the well-known sliding window model, which allows the monitoring

activity to focus only on more recent elements of the streams. Indeed, in the majority of real-world appli-cations in which one needs to produce results in real-time, looking only at what has recently happened isfar more important than tracking the whole streams’ history [12,13].

A naıve approach to continuously monitor the distance between R and S would be to compute at time t thedistance between Rt

t�nþ1 and Stt�nþ1 from scratch, that is, without considering the computation performed at

previous time steps. This approach has a limited applicability, since it immediately breaks down in computa-tionally-demanding environments, where sampling rates are high, sliding windows are large, and/or many(hundreds, thousands) streams have to be monitored [3]. Indeed, in all these scenarios one needs a methodthat is not only accurate in comparing streams, yet it is also efficient, that is, able to exploit previously per-formed computations.

As anticipated, the DTW distance is not efficiently updatable, which is also to say that the work to be per-formed at each time step is O(n b), for a window of size n and a band of width b. Let Wopt(t) be the optimalwarping path for subsequences Rt

t�nþ1 and Stt�nþ1 and Wopt(t + 1) be the one for Rtþ1

t�nþ2 and Stþ1t�nþ2. Since, in the

limit case, the two optimal warping paths might share no common cell (alignment) at all, it is clear that Wopt(t)is of no help in determining Wopt(t + 1), which implies that Wopt(t + 1) has to be computed from scratch. Thisis true for each time step.

Given this unpleasant state of things, we are left with the problem of devising a method for continuous datastreams that satisfies both following requirements:

Accuracy: It should be a good approximation of DTW. This is because of the success that DTW has alreadydemonstrated for the static case (i.e., time series).Efficiency: It should be fast to update, i.e., its complexity should be at most O(b). Note that this is the bestone can achieve with a DTW-like distance measure, since at each time step 2b + 1 distances between sam-ples have to be necessarily computed. In particular, if b does not vary with n, update complexity should beindependent of the sliding window size, n.

Note that the DTW distance trivially satisfies the first requirement but not the second, whereas the oppositeis true for the Euclidean distance and similar metrics.

3.1. Extending approximation techniques proposed for the static case

One possible approach to satisfy both above-stated requirements is to extend to the streaming environment(lower-bounding) approximations of the DTW distance that have been developed for the static case, wherethey are used to speed-up DTW-based queries over large time series databases [14,15,7]. In this setting weare given a (query) time series Q and a search threshold �, and want to retrieve all the time series S in the data-base for which DTW(Q,S) 6 � holds. The idea of such methods is to introduce a lower-bounding distancemeasure, dlb, and to use it to discard all series S such that dlb(Q,S) > �, thus without performing the (costly)DTW(Q,S) computation. This works since dlb(Q,S) > � implies DTW(Q,S) > � if dlb lower bounds DTW.The DTW computation needs to be executed only for those series that survive the filtering step. The effective-ness of this filter & refine approach critically depends on two contrasting factors [16,17]: dlb should be cheap tocompute and, at the same time, it should have a good accuracy, so as to limit the number of false alarms, i.e.,the number of series for which both dlb(Q,S) 6 � and DTW(Q,S) > � hold.

The best-so-far known method to approximate from below the DTW is due to Keogh [7], and consists inconstructing an envelope, Env(Q), around Q, after that an Euclidean-like distance between Env(Q) and S iscomputed. More in detail, let Up(Q) and Low(Q) be two time series defined as follows (see Fig. 2(a))

Page 5: Warping the time on data streams

Fig. 2. (a) The envelope of Q when b = 4; (b) Keogh’s lower-bounding distance. In the example: LBKeogh(Env(Q),S) = 0.416.

Fig. 3.

442 P. Capitani, P. Ciaccia / Data & Knowledge Engineering 62 (2007) 438–458

UpiðQÞ ¼ maxfQjjj 2 ½maxf1; i� bg; minfn; iþ bg�gLowiðQÞ ¼ minfQjjj 2 ½maxf1; i� bg; minfn; iþ bg�g

Thus, Upi(Q) (Lowi(Q)) is simply the maximum (respectively, the minimum) of Qj values in the interval[i � b,i + b], centered in i and of width 2b + 1. The definition is then completed so as to properly take intoaccount border effects. Consequently, Env(Q) = (Up(Q),Low(Q)) encloses all the allowed stretches of Q. Final-ly, the (squared) LBKeogh distance between Env(Q) and S is defined as

LBKeoghðEnvðQÞ; SÞ2 ¼Xn

i¼1

ðSi � UpiðQÞÞ2 if Si > UpiðQÞ

0 if Si 2 ½LowiðQÞ;UpiðQÞ�ðLowiðQÞ � SiÞ2 if LowiðQÞ > Si

8><>: ð4Þ

and corresponds to the shaded area in Fig. 2(b).We can easily adapt Keogh’s method to deal with streams as follows. Let LBKeoghðEnvðRt

t�nþ1Þ; Stt�nþ1Þ be

the lower-bounding distance between the envelope of Rtt�nþ1 and St

t�nþ1. Since in our scenario the two streamsR and S play a symmetric role, we can similarly compute LBKeoghðEnvðSt

t�nþ1Þ;Rtt�nþ1Þ and then, in order to

improve accuracy, define the symmetric LBKeogh lower-bounding distance as the maximum of the two mea-sures (see Fig. 3):

LBsymmKeoghðRt

t�nþ1; Stt�nþ1Þ ¼ maxfLBKeoghðEnvðRt

t�nþ1Þ; Stt�nþ1Þ; LBKeoghðEnvðSt

t�nþ1Þ;Rtt�nþ1Þg ð5Þ

Note that when updating LBsymmKeogh, only the first b and the last b values of the two envelopes can possibly

change upon arrival of the new stream samples.Depending on the width b of the band and on the specific dataset, the quality of approximation of LBKeogh

with respect to DTW is known to wildly vary (see, e.g., [7,18]).Thus, as also our experiments confirm (seeSection 5), using LBKeogh (as well as its symmetric version) in place of DTW for monitoring purposes is

The symmetric LBsymmKeogh lower-bounding distance: (a) the envelope of R; (b) the envelope of S. In the example: LBsymm

KeoghðR; SÞ ¼ 1:032.

Page 6: Warping the time on data streams

P. Capitani, P. Ciaccia / Data & Knowledge Engineering 62 (2007) 438–458 443

not advisable, since it only satisfies the 2nd of our requirements (fast to update) but not the first one (goodapproximation of DTW).

4. The Stream-DTW distance

The key to satisfy both requirements of accuracy and efficiency starts from a couple of simple observations.First, in order to obtain a high accuracy, a DTW-like style of computation should be preserved. Note that thisis not the case with LBsymm

Keogh, since the original LBKeogh from which it is derived was designed as a lower boundalso able to support exact indexing of DTW. Second, in order to save computational resources, one shouldrealize that the major source of difficulty in updating the DTW distance lies in its boundary conditions. Tosee why this is the case, consider again the optimal warping paths Wopt(t) and Wopt(t + 1). Since Wopt(t + 1)has necessarily to start from cell [t � n + 2,t � n + 2], in order to reuse at least part of the computation per-formed to determine Wopt(t), this path should pass through such cell. However, this is very unlikely to be thecase.

The first step towards the definition of a new, DTW-like, distance measure suitable for continuous streams,which we call Stream-DTW (SDTW), consists in observing that, even if one cannot be confident in ensuringthat an optimal warping path passes through a single cell, yet it is easy to guarantee that it passes through oneof a (suitable chosen) set of cells. This is formalized by the following preliminary definition (see also Fig. 4).

Definition 1 (Frontier). A frontier F is any set of cells of the cumulative distance matrix such that for any warpingpath W it is W \ F 5 ;. The x-frontier (or lower-left frontier) anchored in cell [i, i] is the set of 2b + 1 cells

x½i; i� ¼ f½i; i�; ½i; iþ 1�; . . . ; ½i; iþ b�; ½iþ 1; i�; . . . ; ½iþ b; i�g:

The q-frontier (or upper-right frontier) anchored in cell [i,i] is the set of 2b + 1 cells

q½i; i� ¼ f½i; i�; ½i; i� 1�; . . . ; ½i; i� b�; ½i� 1; i�; . . . ; ½i� b; i�g:

Now we define two relaxed versions of DTW, which will be used in the definition of our new SDTW dis-

tance measure.

Definition 2 (Boundary-relaxed DTW). Given streams R and S, let ts and te be two generic time instants. Wedefine

• The start-relaxed DTW (DTWx) between subsequences Rtets

and Stets

is the value of Dx[te,te], where Dx is the

start-relaxed cumulative distance matrix initialized as follows:

Dx½ts; tsþj� ¼ d½ts; tsþj�; Dx½tsþj; ts� ¼ d½tsþj; ts� ð0 6 j 6 bÞ ð6Þ

and then completed using the recurrence relation defined in Eq. (3).• The start-end-relaxed DTW (DTWxq) between subsequences Rte

tsand Ste

tsis

DTW xqðRtets; Ste

tsÞ ¼ minfDTW xðRi

ts; Sj

tsÞ j ½i; j� 2 q½te; te�g ð7Þ

R

4 9 16

1 4 9 1

1 4 9 1

9 16 4

1

4 1

25 9

4

9 4 9

4 1 4 9

4 1 4 9

0 1 4

3 4 5 3 3 2 3 43 4 5 3 3 2 3 4

1

2

2

1

0

1

1

2

1

2

2

1

0

1

1

2

Sd

R

3 4 5 3 3 2 3 4

1

2

2

1

0

1

1

2

D

5 14

15 16

17

21

22

23

24 28

1 5 14

15 16

17

21

22

23

24 28

1

S

15

5 14

10 21 18 19

35 27

31

27 26

31 25 34

35 26 34

23

4 9 16

1

15

5 14

10 21 18 19

35 27

31

27 26

31 25 34

35 26 34

23

4 9 16

1

3 4 5 3 3 2 3 4

1

2

2

1

0

1

1

2

R

5 14

15 16

17

21

22

23

1

23

5 14

15 16

17

21

22

23

1

23

15

5 14

10 21 18 19

35 27

31

27 26

31 25

35 26

4 9 16

1

34

34

24 28

15

5 14

10 21 18 19

35 27

31

27 26

31 25

35 26

4 9 16

1

34

34

24 28

SD

a b c

Fig. 4. (a) Distance matrix; (b) start-relaxed DTW; (c) start-end-relaxed DTW.

Page 7: Warping the time on data streams

α

β

γ

δ

Wopt

Dk

Dk+1

kn(k-1)n+1 (k-1)n+1+i kn+i

Fig. 5. How SDTW works (SDTW2 = a � (b � c) + d).

444 P. Capitani, P. Ciaccia / Data & Knowledge Engineering 62 (2007) 438–458

Note that DTWx computation proceeds as with DTW, yet all the cells in the lower-left frontier anchored ints have as value the distance between the corresponding samples (see Fig. 4(b)). This is to say that warpingpaths can start from any cell in x[ts,ts].When te < ts we conventionally set DTWxðRte

ts; Ste

tsÞ ¼ 0.

For computing DTW xqðRtets; Ste

tsÞ one also relaxes the end-boundary condition. In practice one can still use the

same start-relaxed Dx matrix used for computing DTW xðRtets; Ste

tsÞ, and then just look at the minimum value on

the upper-right frontier anchored in [te, te] (see Fig. 4(c)).The basic rationale underlying the computation of the SDTW distance is to split, using frontiers, the

optimal warping path of the DTW into 2 distinct pieces (one of them possibly null at some time instants):The first piece starts by spanning the whole current window of size n, and then, at each time step,progressively reduces; the second piece starts empty and then progressively grows. After exactly n timesteps everything starts again. For each of the two pieces of Wopt the SDTW provides an accurateapproximation.

More in detail, assume without loss of generality that streams’ analysis starts at time t0 = 1, i.e., when thefirst streams’ samples are available.2 The computation of SDTW is performed by segmenting the positive timeaxis into a sequence of non-overlapping time intervals, [(k � 1)n + 1:kn], k = 1,2, . . . , and then associating tothe kth interval a corresponding start-relaxed cumulative distance matrix, denoted Dxk . In other terms, Dxk is thematrix that is initialized with the values in the lower-left frontier anchored in [(k � 1)n + 1,(k � 1)n + 1] (seeDefinition 2).

In order to understand how such start-relaxed matrices are exploited in SDTW computation, it has first tobe observed that, for any active window [ts:te], with te = kn + i (k P 1,0 6 i < n) and ts = te � n + 1 =(k � 1)n + 1 + i, the window intersects either only time interval [(k � 1)n + 1:kn] (if i = 0) or the two timeintervals [(k � 1)n + 1:kn] and [(kn + 1:(k + 1)n] (if i 5 0) (see also Fig. 5). Intuitively, as the following defi-nition makes precise, for computing the value of SDTW ðRte

ts; Ste

tsÞ only values from matrices Dxk and Dxkþ1 are

needed, in the general case.

Definition 3 (Stream-DTW). Let te = k n + i, where k P 1 and 0 6 i < n, and let ts = te � n + 1 =(k � 1)n + 1 + i. The Stream-DTW (SDTW) distance between subsequences Rte

tsand Ste

tsis defined as

2 In

SDTW Rknþiðk�1Þnþ1þi; S

knþiðk�1Þnþ1þi

� �¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffia� ðb� cÞ þ d

pð8Þ

where

Section 4.2 we relax this assumption and consider arbitrary values of t0.

Page 8: Warping the time on data streams

P. Capitani, P. Ciaccia / Data & Knowledge Engineering 62 (2007) 438–458 445

• a ¼ DTW xqðRknðk�1Þnþ1; S

knðk�1Þnþ1Þ (from matrix Dxk)

• b ¼ DTW xðRðk�1Þnþ1þiðk�1Þnþ1 ; Sðk�1Þnþ1þi

ðk�1Þnþ1 Þ (from matrix Dxk)

• c = d(R(k�1)n+1+i,S(k�1)n+1+i))

• d ¼ DTW xðRknþiknþ1; S

knþiknþ1Þ (from matrix Dxkþ1).

4.1. Formal results

In this section we prove three basic facts about the SDTW distance, namely: SDTW lower bounds the DTWdistance (Theorem 1), it has optimal update complexity (Theorem 2), and its space overhead is linear in thewindow size (Theorem 3).

Theorem 1 (Lower bound). The SDTW distance is a lower bound of DTW.

Proof. Refer to Fig. 5. Let Wopt be the optimal warping path for aligning Rknþiðk�1Þnþ1þi and Sknþi

ðk�1Þnþ1þi, and con-sider the frontiers q[kn,kn] and x[kn + 1,kn + 1]. Consider the first part of Wopt, call it Wopt,k, that ends in acell of q[kn,kn], and the 2nd part of Wopt, call it Wopt,k+1, that starts from a cell in x[kn + 1,kn + 1]. We claimthat a � (b � c) lower bounds the DTW contribution, call it DTWk, corresponding to Wopt,k and that d lowerbounds the component, call it DTWk+1, corresponding to Wopt,k+1. Since the squared DTW distance betweenthe two subsequences is PDTWk + DTWk+1 this will prove the result.

[a � (b � c) 6 DTWk]. Consider the optimal start-relaxed path, call it Wb, corresponding to b and endingin cell [ts,ts], and the path Wopt,k, which shares cell [ts,ts] with Wb. Counting just once the contribution of cell[ts,ts], i.e., c, we end up with a total cost given by b � c + DTWk for going from x[(k � 1)n + 1,(k � 1)n + 1] toq[kn,kn]. From the definition of DTWxq, this cannot be less than a, which proves the assertion.

[d 6 DTWk+1]. Immediate from the definition of start-relaxed DTW. h

The above proof details how we can approximate from below the DTW by splitting the optimal warpingpath into two pieces. The two frontiers we use to this purpose are by no means the only possible ones; in par-ticular, as Fig. 5 shows, a part of Wopt could traverse some cells, after leaving q[kn,kn] and before enteringx[kn + 1,kn + 1], that SDTW does not consider at all. We can prove that our arguments are still applicableshould we replace x[kn + 1,kn + 1] with the q[kn + 1,kn + 1] frontier.

Unlike simpler approximations such as LBKeogh, the computation of SDTW shares with DTW the need ofcomputing an optimal warping path, even if on a start-relaxed cumulative distance matrix. This alone suggeststhat the very first time we compute the SDTW distance we have to pay a cost of O(nb). However, as the fol-lowing theorem shows, SDTW is amenable to be efficiently updatable.

Theorem 2 (Update complexity). The SDTW distance can be updated in time O(b) at any time step, which is

optimal for any distance that must compare streams’ samples using a band of width b.

Proof. Denote with a 0,b 0,c 0 and d 0 the new values, at time step te0 ¼ te þ 1 ¼ knþ iþ 1, of the four terms inEq. (8). Recall that Dxk is the start-relaxed cumulative distance matrix for time interval [(k � 1)n + 1:kn], andthat Dxkþ1 is the one for the interval [kn + 1:(k + 1)n]. Let dk and dk+1 be the corresponding matrices storingdistances between samples of R and S.

The steps needed to update the value of SDTW include the computation of distance values for the two newsamples of R and S, and the extension of matrix Dxkþ1 up to frontier q½te0 ; te0 � ¼ q½te þ 1; te þ 1�. Both stepsrequire O(b) time.

Consider now the case when te0 ¼ knþ iþ 1, with 0 6 i < n � 1. We have

• a 0 = a, since this term does not depend on i.• By definition of start-relaxed DTW and of Dxk , it is b0 ¼ Dxk ½ðk � 1Þnþ 1þ iþ 1 : ðk � 1Þnþ 1þ iþ 1�, i.e.,

computing b 0 costs O(1). The same is clearly true for c 0 = d(R(k�1)n+1+i+1,S(k�1)n+1+i+1).• Finally, d0 ¼ Dxkþ1½knþ iþ 1 : knþ iþ 1�, again with cost O(1).

Page 9: Warping the time on data streams

446 P. Capitani, P. Ciaccia / Data & Knowledge Engineering 62 (2007) 438–458

When i = n � 1, it is te0 ¼ knþ ðn� 1Þ þ 1 ¼ ðk þ 1Þn and we have b 0 = c 0 and d 0 = 0 (by definition of start-relaxed DTW). The new SDTW value reduces to a0 ¼ DTW xqðRðkþ1Þn

knþ1 ; Sðkþ1Þnknþ1 Þ, which, given matrix Dxkþ1, can

be computed in O(2b + 1) = O(b) time. h

Next theorem concerns the memory requirements of SDTW.

Theorem 3 (Space complexity). The SDTW distance can be computed using O(n + b) = O(n) space.

Proof. Refer to Fig. 5 and observe that, starting from time te = kn + 1, all information in matrices dz and Dxz ,with 1 6 z < k, is not needed anymore to determine future SDTW values. This immediately bounds the spacerequired to O(nb). In other terms, from the very definition of SDTW and as also Fig. 5 suggests, at each timestep a naıve implementation would require at most two distance matrices and 2 cumulative distance matrices.

Consider now Definition 3 and the proof of Theorem 2. To reduce the space to O(n) it is sufficient toobserve that, when te 2 [kn + 1:(k + 1)n], all one needs to know about the dk and Dxk matrices in order tocorrectly compute the SDTW and to update it is just the difference of corresponding diagonal elements (i.e.,the (b � c) terms) and the (single) a value. This accounts for the n term in O(n + b). Finally, the b term is dueto the need of storing the last q-frontier of matrix Dxkþ1, which is used to compute and update the d term. h

It should be observed that the storage required by SDTW is of the same order of that needed for updatingthe Euclidean distance. On the other hand, continuously updating the DTW distance would require O(nb)space, which is O(n2) if b grows linearly with n.

4.2. Shifting the start of computation

Unlike other distance measures, the value of the SDTW distance between any subsequences of R and S

depends on the time the computation started. Intuitively, this is the case since the values of the four terms thatmake up the SDTW distance (see Eq. (8)) vary according to how the current window [ts:te] is split by the twointervals [(k � 1)n + 1:kn] and [kn + 1:(k + 1)n] (corresponding to matrices Dxk and Dxkþ1, respectively). The fol-lowing definition helps to characterize this phenomenon.

Definition 4 (Offset and split time). Let t0 be the time step at which the SDTW analysis begins. The offset ofthe [ts:ts + n � 1] window of size n with respect to t0 is defined as

Offsetð½ts : ts þ n� 1�; t0Þ ¼ ðts � t0Þmodn: ð9Þ

The split time, st, of window [ts:ts + n � 1] is the unique time value such that

Offsetð½st : st þ n� 1�; t0Þ ¼ 0 st 2 ½ts : ts þ n� 1�: ð10Þ

Notice that split times are exactly those instants in which the SDTW algorithm begins to compute a new start-relaxed cumulative distance matrix. Consequently, the window [ts:ts + n � 1] has a 0-offset with respect to t0 iffits split time coincides with ts, i.e., st = ts.

To ease understanding, in Section 4 we set t0 = 1, so that the split times were 1,n + 1,2n + 1, and so on. Thisis the reference case depicted in Fig. 6(a), where we also highlight how the four components of SDTW, namelya,b,c, and d, depend on the position of the split time within the current window.

In the general case in which the SDTW computation starts at time t0 = z n + j, with z P 0 and0 6 j < n � 1, two relevant cases might arise, depending on the value of j.When j = 1, from Eq. (9) and theproperties of the mod operator, we immediately have that

Offsetð½ts : te�; znþ 1Þ ¼ Offsetð½ts : te�; 1Þ 8z;8 ts ð11Þ

which means that the split times do not change with respect to the reference case t0 = 1 (see also Fig. 6(b)).This is no longer true when j 5 1 (see Fig. 6(c)), i.e.,

Offsetð½ts : te�; znþ jÞ 6¼ Offsetð½ts : te�; 1Þ 8 z; 8 ts ð12Þ

If the split times do not change, then it follows from the very definition of SDTW that, for any position of thesliding window, the values of the four components of the SDTW distance will not change as well with respect
Page 10: Warping the time on data streams

sliding windowsliding window

t0=1 ts tet

δ

α - (β - γ)

R

S

n

n

i

sliding windowsliding window

t0=zn+1 ts tet

δ

α - (β - γ)

R

S

n

n

i

sliding windowsliding window

ts tet

δ

α - (β - γ)

R

S

n

n ij

t0=zn+j

a

b

cFig. 6. (a) Reference case: The SDTW computation starts as soon as data points are available (t0 = 1); (b) the SDTW computation startsat time t0 = zn + 1; (c) the SDTW computation starts at t0 = zn + j, j 5 1. In (a) and (b) the split times, represented by the vertical bars,are the same, whereas this is no longer true in (c).

P. Capitani, P. Ciaccia / Data & Knowledge Engineering 62 (2007) 438–458 447

to the reference case t0 = 1. This is also to say that SDTW ðRtets; Ste

tsÞ is not influenced at all by a delay of zn time

steps in the beginning of the computation. On the other hand, when the split times do change, it is possible thatSDTW ðRte

ts; Ste

tsÞ might change as well. In the limit case, SDTW ðRte

ts; Ste

tsÞ might assume n distinct values, one for

each position of the split time. In Section 5 we will show that this does not influence the good accuracy thatSDTW attains in approximating the DTW distance.

5. Experimental results

In this section we present experimental results aiming to evaluate the actual performance of SDTW, both interms of efficiency and of quality of approximation. For reference purpose, we contrast SDTW to the LBsymm

Keogh

method presented in Section 3.1. For fairness, we note that LBKeogh (of which LBsymmKeogh is a direct extension)

was not created to be a tight approximation of DTW, since it was designed, unlike our SDTW, to be a lowerbound allowing index-based processing of DTW queries on time series databases.

All experiments are performed on a 1.60 GHz Intel Pentium 4 CPU with 512 MB of main memory and run-ning Windows 2000 OS.

We use five different real-world datasets, obtained from the UCR Time Series Data Mining Archive(TSDMA) [19] (see Fig. 7 for samples from such datasets):

Page 11: Warping the time on data streams

-152

-151

-150

-149

-148

-147

-146

-145

0 10 20 30 40 50 60 70 80 90 100 110 120

RW

-1-0.8-0.6-0.4-0.2

0

0.20.40.60.8

1

0 10 20 30 40 50 60 70 80 90 100 110 120

FD

0

500

1000

1500

2000

2500

3000

3500

4000

0 10 20 30 40 50 60 70 80 90 100 110 120

EEG

0

0.5

1

1.5

2

2.5

0 10 20 30 40 50 60 70 80 90 100 110 120

BI

60

62

64

66

68

70

72

74

76

78

0 10 20 30 40 50 60 70 80 90 100 110 120

NET

Fig. 7. Samples from the five datasets used in the experiments.

448 P. Capitani, P. Ciaccia / Data & Knowledge Engineering 62 (2007) 438–458

• Random walk (RW): classical random walk data, generated according to the equation Si = Si�1 +rand(�0.5,0.5);

• Burstin (BI): this dataset includes 12 hours of data from a small region of the sky, where Gamma Ray burstswere reported during that period;

• Fluid dynamics (FD): this stream represents one of the signals collected at a frequency of 500 Hz fromboundary layer experiments in the fluid dynamics research domain;

• Network (NET): this dataset reports the Round Trip Time (RTT) delay of a sequence of packets sent fromUniversity of California at Riverside to Carnegie Mellon University;

• EEG (EEG): 128 Hz-electroencephalographic data acquired at the Department of Physiology (University ofBologna).

For each of the above datasets we simulate acquisition of new samples by reading from memory a new datasample at fixed time intervals (sampling rate). In order to generate multiple streams of a same kind, we par-tition each dataset into non-overlapping parts, and then take each of such parts as a different stream. Unlessotherwise stated, all results we present are averaged over 32 streams, that is, 496 pairs of streams for eachdataset.

5.1. Quality of approximation

Our first experiments aim to verify that SDTW is indeed a good approximation of the DTW distance, andthat it consistently outperforms the LBsymm

Keogh bound. We define the tightness of a DTW approximation as theratio [7]

Page 12: Warping the time on data streams

TableTightn

SDTW

LBsymmKeog

Windo

0.

0.

0.

0.

0.

1.

Tig

htn

ess

a

P. Capitani, P. Ciaccia / Data & Knowledge Engineering 62 (2007) 438–458 449

Tightness ¼ value of the approximation

value of theDTW distance

Since both SDTW and LBsymmKeogh are lower bounds of the DTW distance, their tightness stays in the [0, 1] range,

with values closer to 1 denoting a higher quality of the approximation. As it can be observed in Table 1, thetightness of SDTW is very stable over the different datasets, and always higher than 0.9.This is not the case forLBsymm

Keogh, whose tightness never reaches 0.8, and is as low as 0.277 for the Burstin dataset. Averaging over all thefive datasets, the tightness of SDTW is 56.7% higher than that of LBsymm

Keogh.Fig. 8 shows a sample trace of DTW, SDTW, and LBsymm

Keogh distances on a pair of EEG streams, considering128 different positions of the sliding window. Note that, among the five considered datasets, EEG is the one onwhich the tightness of SDTW is the worst one. Nonetheless, SDTW is able to closely track DTW variationsalong all the time course of the monitoring process. As to LBsymm

Keogh, it is apparent that it consistently underes-timates the true DTW distance, and that its output trace contains much less details than those of DTW andSDTW.

In Fig. 9(a) and (b) it is shown how the tightness of SDTW varies when either n or b change, respectively.Note that in both figures the tightness axis starts from 0.75. As expected, when the ratio b/n becomes smallerthe tightness improves, since the approximation introduced by the relaxation of the boundary conditions has adiminishing influence. However, even when b is as high as 0.25 n, the tightness of SDTW marks a remarkable80% (see the RW curves in Fig. 9(a) and (b)).

1ess of SDTW and LBsymm

Keogh

Datasets Average

RW BI FD NET EEG

0.948 0.945 0.937 0.936 0.907 0.935

h 0.778 0.277 0.720 0.671 0.536 0.596

w size n = 128, band width b = 8.

0

500

1000

1500

2000

2500

3000

3500

1 11 21 31 41 51 61 71 81 91 101 111 121

DTW

SDTW

LBKeoghSymm

Fig. 8. Sample traces of DTW, SDTW, and LBsymmKeogh. EEG dataset, window size n = 128, band width b = 8.

75

80

85

90

95

00

0 100 200 300 400 500 600Window size (n)

RW

BI

FD

NET

EEG

0.75

0.80

0.85

0.90

0.95

1.00

0 5 10 15 20 25 30 35

Band width (b)

Tig

htn

ess

RW

BI

FD

NET

EEG

b

Fig. 9. Tightness of SDTW: (a) vs. window size. b = 8; (b) vs. band width. n = 128.

Page 13: Warping the time on data streams

450 P. Capitani, P. Ciaccia / Data & Knowledge Engineering 62 (2007) 438–458

The intuition that the tightness mostly depends on the b/n ratio, rather than on the absolute values of b andn, is confirmed by the results shown in Fig. 10, in which the ratio b/n is kept constant to 1/16th whileincreasing the sliding window size. Note that on each dataset tightness values vary in a range of at most 0.03.

0.75

0.80

0.85

0.90

0.95

1.00

0 100 200 300 400 500 600

Window size (n)

Tig

htn

ess

RW

BI

FD

NET

EEG

Fig. 10. Tightness of SDTW; the ratio b/n is kept constant (b/n = 1/16).

0.75

0.8

0.85

0.9

0.95

1

0 20 40 60 80 100 120Offset

Tig

htn

ess

FD

0.75

0.8

0.85

0.9

0.95

1

0 20 40 60 80 100 120Offset

ssent

hgi

T

EEG

0.75

0.8

0.85

0.9

0.95

1

0 20 40 60 80 100 120Offset

ssent

hgi

TBI

0.75

0.8

0.85

0.9

0.95

1

0 20 40 60 80 100 120Offset

Tig

htn

ess

NET

0.75

0.8

0.85

0.9

0.95

1

0 20 40 60 80 100 120Offset

Tig

htn

ess

RW

Fig. 11. How the SDTW tightness varies with respect to the offset of the sliding window for the data sets under consideration. n = 128,b = 8.

Page 14: Warping the time on data streams

Table 2Tightness variabilty of SDTW due to different offset values

Datasets Average

RW BI FD NET EEG

Average 0.954 0.944 0.934 0.934 0.915 0.936Max 0.981 0.964 0.974 0.966 0.954 0.968Min 0.931 0.935 0.893 0.889 0.902 0.910Max–min 0.051 0.029 0.081 0.078 0.051 0.058

Window size n = 128, band width b = 8.

P. Capitani, P. Ciaccia / Data & Knowledge Engineering 62 (2007) 438–458 451

5.1.1. Offset influence

In the following experiment we investigate the influence of the window offset on the tightness of SDTW. Asexplained in Section 4.2, changing the start time of computation, t0, might influence the values of the fourcomponents making up the SDTW distance measure, and as such this can lead to different values for a samepair of subsequences Rte

tsand Ste

ts.The results we present in Fig. 11 are obtained by repeating n times the tight-

ness experiments for the setting n = 128 and b = 8, keeping fixed the position of the sliding window and eachtime varying the value of t0 (t0 = 1, . . . ,n � 1). This covers all possible offset cases.

As expected, SDTW values, and consequently SDTW tightness, are influenced by the offset. However, thisintrinsic variability stays always limited and does not degrade the overall quality of approximation. Forinstance, in the worst case, corresponding to the FD dataset, the tightness varies at most 0.081, ranging from0.974 to 0.893. Table 2 shows details for all the five datasets. Each row in the table is a specific tightness sta-tistics taken over all offset values, whereas the last column shows corresponding statistics averaged over all thedatasets.

5.1.2. Classification results

The following experiments aim to demonstrate that the high tightness of SDTW also reflects into a highquality of results. In particular, we consider using SDTW for classification tasks. We start with three labelleddatasets from UCR TSDMA:3

• Trace: this 4-class dataset is a subset of the Transient Classification Benchmark (trace project) [20]. It is asynthetic dataset designed to simulate instrumentation failures in a nuclear power plant. The full datasetconsists of 16 classes (50 instances in each class) and each instance has 4 features. The Trace subset onlyuses the second feature of classes 2 and 6, and the third feature of classes 3 and 7. Hence, this dataset con-tains 200 instances, 50 for each class. All instances are linearly interpolated to have the same length of 275data points and are z-normalized;

• Trace-2: this dataset is a 2-class subset of the Trace dataset, as used in [18], since it only uses the third fea-ture of classes 3 and 6;

• Gun-Point: this dataset comes from the video surveillance domain and has two classes, each containing 100instances. All instances were created using one female actor and one male actor in a single session. The twoclasses are:– Gun-Draw: the actors have their hands by their sides. They draw a replicate gun from a hip-mounted

holster, point it at a target for approximately one second, then return the gun to the holster, and theirhands to their sides;

– Point: the actors have their hands by their sides. They point with their index fingers to a target forapproximately one second, and then return their hands to their sides.

Each instance has a length of 150 data points and is z-normalized.

For each dataset we perform a ‘‘leave-one-out’’ 1 nearest neighbor classification, using SDTW, DTW, andthe Euclidean (L2) distance. Since in this case we are not dealing with streaming data, what we show for

3 The description of the datasets is taken almost verbatim from the index file accompanying the TSDMA distribution CD.

Page 15: Warping the time on data streams

452 P. Capitani, P. Ciaccia / Data & Knowledge Engineering 62 (2007) 438–458

SDTW are just the effects of using a start-relaxed cumulative distance matrix (on which SDTW computation isbased) in place of the exact cumulative distance matrix (which is the one used by the DTW distance).

Fig. 12(a) clearly shows that using SDTW in place of DTW introduces no appreciable variation in classi-fication performance. In particular, on both Trace and Trace-2 datasets both distance measures are able tocorrectly classify all instances, while for the Gun-Point dataset the SDTW only introduces one classificationerror more than DTW (3 rather than 2 errors over the 200 instances).

Having shown that SDTW performs similarly to DTW on ‘‘easy’’ classification tasks, we now consider itsbehavior over four ‘‘harder’’ datasets (namely 50Words, OSU Leaf, Lightning-7, and Fish), which we down-loaded from the UCR Time Series Classification/Clustering page [21]. For each dataset, we classify elements inthe testing set by using the label of the closest element in the training set. For both DTW and SDTW we use theoptimal band width values reported in [21], where further details on these datasets are available.Performanceof the three distances over individual datasets is shown in Fig. 12(b). Even here, the performance of SDTWstays quite close to that of DTW, with both methods significantly outperforming Euclidean distance. On theaverage, the classification accuracy of DTW is 69.5% and that of SDTW is 67.9%, whereas L2 can correctlyclassify only 57.4% elements.

5.2. Efficiency

Our first experiment concerning efficiency aspects aims to measure how many streams we can monitor inreal time for the methods under analysis. More precisely, for a given number of pairs of streams to be com-pared to each other, we vary the sampling rate and plot the value beyond which we cannot report resultsbefore the next stream samples have to be read. Thus, working, say, at 1000 Hz means that all the distancevalues between the pairs of streams are reported no later than 1 ms.

Fig. 13 shows results when the sliding window includes n = 128 samples and the width of the Sakoe–Chibaband is b = 8. It is evident that SDTW outperforms DTW by about two orders of magnitude. For instance, at100 Hertz SDTW can monitor up to about 100 pairs of streams, whereas DTW can only handle 1 pair.

80%

85%

90%

95%

100%

Trace-2 Trace Gun-Point

Dataset

Cla

ssif

icat

ion

acc

ura

cy

L2

DTW

SDTW

0%

20%

40%

60%

80%

100%

50Words OSU Leaf Lightning-7 FishDataset

Cla

ssifi

catio

n a

ccu

racy

L2

DTW

SDTW

a b

Fig. 12. Classification accuracy over three ‘‘easy’’ (a) and four ‘‘hard’’ (b) datasets.

0.1

1

10

100

1000

10000

100000

1 10 100 1000Pairs of streams

Max

sam

plin

g r

ate

(Hz)

DTW

LB-Keogh-symm

SDTW

Fig. 13. Maximum sampling rate vs. number of pairs of streams; n = 128, b = 8. Results are averaged over all the 5 datasets.

Page 16: Warping the time on data streams

P. Capitani, P. Ciaccia / Data & Knowledge Engineering 62 (2007) 438–458 453

Next figures aim to better analyze how the SDTW efficiency gain depends on the size of the sliding windowand on the width of the Sakoe–Chiba band. For each experiment we show two tightly related measures,namely, the maximum sampling rate and the speed-up of SDTW with respect to DTW, defined as

1

10

Max

sam

plin

g r

ate

(Hz)

a

a

Max

sam

plin

g r

ate

(Hz)

Speed-up ¼ cost of a single DTW computation

cost of a single SDTW computation

In all the experiments we consider 6 pairs of streams to be monitored.Fig. 14 refers to the case where the size of the sliding window grows, whereas b is kept constant. Since the

update complexity of SDTW is O(b), thus independent of n, in this scenario the maximum sampling rate ofSDTW stays constant. On the other hand, DTW computation pays the price of starting from scratch at eachtime step, thus its maximum sampling rate is inversely proportional to n. Overall, the speed-up of SDTW

grows linearly with n, as expected.In Fig. 15 we keep fixed the size of the sliding window and increase the band width. In this case the speed-up

stays almost constant at a value that only depends on the window size (n = 128 in the figure). Consequently,for both distance measures the maximum sampling rate decreases as 1/b.

Finally, Fig. 16 shows the case in which the size of the sliding window grows and so b does, in a way that theratio b/n stays constant (in the figure it is b/n = 1/16). In this scenario the update complexity of DTW is O(nb) = O(n2), whereas that of SDTW is O(b) = O(n). This perfectly reflects into the experimental results, inwhich again we observe a linear speed-up. Note that although both DTW and SDTW have to reduce theirmaximum sampling rate when the window size increases, the reduction of SDTW goes as 1/n, whereas thatof DTW goes as 1/n2.

1

10

100

000

000

0 100 200 300 400 500 600

Window size (n)

DTW

SDTW

0

100

200

300

400

500

600

700

0 100 200 300 400 500 600

Window size (n)

pu-

deep

S

RW

BI

FD

NET

EEG

b

Fig. 14. Variable window size, b = 8. (a) Maximum sampling rate (averaged over all the datasets); (b) SDTW speed-up.

0

40

80

120

160

0 5 10 15 20 25 30 35

Band width (b)

Sp

eed

-up RW

BI

FD

NET

EEG

b

1

10

100

1000

10000

0 5 10 15 20 25 30 35

Band width (b)

DTW

SDTW

Fig. 15. Variable band width, n = 128. (a) Maximum sampling rate; (b) SDTW speed-up.

Page 17: Warping the time on data streams

0

100

200

300

400

500

600

700

0 100 200 300 400 500 600Window size (n)

Sp

eed

-up

RW

BI

FD

NET

EEG

1

10

100

1000

10000

100000

0 100 200 300 400 500 600Window size (n)

Max

sam

plin

g r

ate

(Hz)

DTW

SDTW

a b

Fig. 16. Variable window size, b = 1/16 n. (a) Maximum sampling rate; (b) SDTW speed-up.

454 P. Capitani, P. Ciaccia / Data & Knowledge Engineering 62 (2007) 438–458

5.3. Pruning power

Our last experiment aims to shed some light on the effects of the tightness in the case one is interested inperforming DTW-based queries on streams. More precisely, we consider the task to determine whether theDTW distance falls below a given threshold value �, and use either LBsymm

Keogh or SDTW to filter out irrelevantsubsequences. Recall that, according to Theorem 1, SDTW can be safely used to this purpose since it is a lowerbound of DTW. To measure the performance of DTW approximations we consider their pruning power,defined as [7]

TablePrunin

SDTW

LBsymmKeog

Windo

Pruning Power ¼ 1� number of DTW computations with the filter

number of DTW computations without the filter

Clearly, the higher the pruning power the less the number of (costly) DTW computations that have to be per-formed, the better the overall performance. Please note that for a fixed query selectivity, which measures thefraction of cases in which the DTW distance falls under the threshold, increasing the pruning power also re-duces the number of false alarms.

Table 3 reports the pruning power of LBsymmKeogh and SDTW for the different datasets. The experimental

setting is the same as in Table 1 (496 pairs of streams, n = 128 and b = 8) and � is chosen so as to have a1% selectivity, that is, in 1% of the cases the DTW distance is under the threshold. Note that this limitsthe maximum pruning power to 0.99. It is evident that the good tightness of SDTW pays off, in that forall datasets SDTW prunes almost all irrelevant subsequences. In particular, for the Fluid dynamics (FD)dataset no false alarm at all occurs. This and Random walk (RW) are the only datasets for whichLBsymm

Keogh exhibits pruning capabilities somewhat comparable to those of SDTW, whereas in the other casesits effectiveness is seriously compromised. Overall, the pruning power of SDTW translates into a negligible0.8% average percentage of false alarms, whereas this value jumps to 41.1% for LBsymm

Keogh.We conclude this section with some observations concerning this filter & refine query processing scenario.

We recall that this is just a possible application of the SDTW distance, which in this paper we have introducedas an alternative to DTW for monitoring purposes.

3g power of SDTW and LBsymm

Keogh

Datasets Average

RW BI FD NET EEG

0.982 0.974 0.990 0.984 0.978 0.982

h 0.929 0.000 0.946 0.740 0.280 0.579

w size n = 128, band width b = 8, DTW selectivity = 1%.

Page 18: Warping the time on data streams

P. Capitani, P. Ciaccia / Data & Knowledge Engineering 62 (2007) 438–458 455

Although above results indeed show that SDTW is able to consistently limit the number of falsealarms, one should be aware that long bursts of streams’ subsequences for which the DTW distance needsto be computed are always possible, even with an ‘‘ideal’’ approximation that attains the maximum pos-sible pruning power (i.e., no false alarms at all). What one should expect in this scenario? To make thingsprecise, assume that streams are sampled at a frequency that does not allow DTW computation at eachtime step for all pairs of streams, whereas the sampling rate stays below the maximum one tolerated bySDTW. In particular, SDTW computation on all pairs of streams takes only a fraction, say load < 1, ofthe period between two time steps. For instance, for a sampling rate of 100 Hz a load = 0.7 means that allSDTW computations require 0.7/100 s (7 ms), and that for 3 ms the system is idle. If at time t we need toperform one or more DTW computations, we can use the idle time. This will introduce no delay at all ifthe number of subsequences surviving the filtering step does not exceed a certain value, which depends onboth the cost of a single DTW computation and on 1-load. If this is not the case, some DTW distancewill be computed during the next time interval, thus in turn delaying the upcoming SDTW computations,and so on.

Although the analysis of this scenario is beyond the scope of the paper, it is important to observe that, forany fixed amount of memory available to store unprocessed streams’ samples, one might experience burstswhose length leads to memory saturation, thus forcing the adoption of some load shedding technique [22].It is therefore an interesting problem to precisely understand how the number and the characteristics of thestreams might affect the distribution of delays and the probability of memory saturation.

6. Related works

There is a large body of literature on data streams systems that is only marginally relevant here, for whichwe refer the reader to [12] and [23]. The works most closely related to ours are [3] and [24]. The StatStreamsystem by Zhu and Shasha [3] was designed to speed-up real time computation of statistics over a large num-ber of streams. In particular, for monitoring the correlation between pairs of streams, StatStreams uses anapproximate Euclidean distance based on the first m (m� n) coefficients of the Discrete Fourier Transform(DFT) of streams’ subsequences [16]. From the results in [25], the DFT coefficients can be updated in con-stant-time at each time step,4 which is a key to the high performance level achieved by StatStream. As arguedand demonstrated several times, the Euclidean distance can lead to inaccurate results. Further, the DFTapproximation only works well for streams whose energy concentrates in the low frequencies, which is notalways the case [26]. Similar considerations apply to the Stardust system by Bulut and Singh [24], whichuses a wavelet-based approximate Euclidean distance and improves over StatStream for detecting correlationsabove a given threshold.

Monitoring data streams to look for the occurrence of specific patterns is also an important problem thathas been the subject of some recent works [24,27]. The base case in which each pattern is independentlychecked against incoming data is considered in [24], whereas [27] defines an envelope (much similar to thatused by LBKeogh) over a set of (similar) patterns. Since the distance between the envelope and the currentstream subsequence lower bounds the Euclidean distance between the subsequence and any of the patternsin the envelope, this method can quickly discard many irrelevant patterns at nce.

As far as we know, the usage of Dynamic Time Warping distance in a streaming environment has not beenconsidered before. Although many works adopt the DTW distance (see, e.g., [14,15,7,28,29]), all of them con-sider the problem of searching for close matches to a query in large, static, time series archives. Interestingly,extending the technique developed in [29] for speeding-up DTW-based nearest neighbors queries to work overstreams is left as ‘‘a promising but very challenging direction’’ (quoted from [29]).

The work by Wong and Wong [30] presents a method for supporting DTW-based subsequence queries,where one wants to find all the subsequences Sj

i in the database within distance � from a query Q. Thestarting point of the method in [30] is that a point of Sj

i must be aligned with at least one point of a sub-sequence Qy

x of Q. This is somewhat similar to our ‘‘frontier’’ concept (Definition 1). However, while we use

4 Here the assumption is that the number of DFT coefficients, m, is independent of n, the size of the sliding window.

Page 19: Warping the time on data streams

456 P. Capitani, P. Ciaccia / Data & Knowledge Engineering 62 (2007) 438–458

the frontier to split the optimal warping path of DTW into two parts, Wong and Wong build on the aboveobservation in a completely different way, which results in a generalization of LBKeogh to the case of sub-sequence queries.

7. Conclusions

In this paper we have proposed a novel method, called SDTW (Stream-DTW), to compare data streams.SDTW, unlike the well-known Dynamic Time Warping (DTW) distance, can be efficiently updated when awindow slides over the streams. Our experimental results confirm the theoretical analysis and show that theimprovement in efficiency of SDTW over DTW linearly grows with the size of the sliding window, thusmaking it possible to compute a DTW-like distance measure even on (very) large sliding windows. We havealso shown that extensions of methods developed for the static case, i.e., when time series are known inadvance and there used to speed-up the evaluation of DTW-based queries, cannot cope with SDTW interms of accuracy.

While in this paper we have focused on the problem of monitoring the distance of streams and proposedSDTW as an alternative to DTW, we have also shown that the good accuracy of SDTW, together with itslower-bounding property, makes SDTW an ideal candidate for implementing the filtering step if one wantsto perform DTW-based queries on streams.

Our results open a series of interesting research directions and possibilities. An efficiently updatable DTW-like distance measure makes it possible to perform on streams several analysis developed only for the staticcase, such as those related to mining tasks. For instance, SDTW could be compared to the Euclidean distancein clustering and classification tasks, as already done with static time series. Our preliminary results along thisdirection indeed show that using SDTW in place of DTW leads to no appreciable difference in the quality ofclassification. It would also be interesting to investigate if the basic techniques we have used to approximatethe DTW distance, while preserving its very spirit, could be applied to other complex distances proposed forthe comparison of time-varying signals that cannot be easily extended to a streaming scenario. One of them isthe Frechet distance [31].

Finally, a major issue that remains to be addressed concerns the definition of a method able to compute aDTW-like distance on streams that need to be normalized on-line. Indeed, working with unnormalized datamay have negative effects on the significance of any similarity measure, as demonstrated for static time series[32].Extending our method to normalized data does not seem to be an easy task. Intuitively, the problem withnormalization is that now a data sample has no longer a fixed value, since at each time step the normalizationconstants (mean and standard deviation) may change. Reusing computations performed at previous time stepsthen becomes a major challenge which is worth addressing.

References

[1] W. Lin, M.A. Orgun, G.J. Williams, An overview of temporal data mining, in: Proceedings of the first Australian Data MiningWorkshop (ADM’02), Canberra, Australia, 2002, pp. 83–90.

[2] J.F. Roddick, K. Hornsby, M. Spiliopoulou, An updated bibliography of temporal, spatial, and spatio-temporal data miningresearch, in: Temporal, Spatial, and Spatio-Temporal Data Mining: First International Workshop (TSDM 2000), Lyon, France,2000, pp. 147–164.

[3] Y. Zhu, D. Shasha, StatStream: statistical monitoring of thousands of data streams in real time, in: Proceedings of the 28thInternational Conference on Very Large Data Bases (VLDB’02), Hong Kong, China, 2002, pp. 358–369.

[4] D.J. Berndt, J. Clifford, Using dynamic time warping to find patterns in time series, in: AAAI 1994 Workshop on KnowledgeDiscovery in Databases, Seattle, WA, USA, 1994, pp. 359–370.

[5] C.A. Ratanamahatana, E.J. Keogh, Making Time-Series Classification More Accurate Using Learned Constraints, in: Proceedings ofthe Fourth SIAM International Conference on Data Mining (SDM’04), Lake Buena Vista, FL, USA, 2004.

[6] I. Bartolini, P. Ciaccia, M. Patella, WARP: accurate retrieval of shapes using phase and time warping distance, IEEE Transactions onPattern Analysis and Machine Intelligence 27 (1) (2005) 142–147.

[7] E. Keogh, Exact indexing of dynamic time warping, in: Proceedings of the 28th International Conference on Very Large Data Bases(VLDB’02), Hong Kong, China, 2002, pp. 406–417.

[8] B.-K. Yi, C. Faloutsos, Fast time sequence indexing for arbitrary Lp norms, in: Proceedings of the 26th International Conference onVery Large Data Bases (VLDB 2000), Cairo, Egypt, 2000, pp. 385–394.

Page 20: Warping the time on data streams

P. Capitani, P. Ciaccia / Data & Knowledge Engineering 62 (2007) 438–458 457

[9] E. Keogh, T. Palpanas, V.B. Zordan, D. Gunopulos, M. Cardie, Indexing large human-motion databases, in: Proceedings of the 30thInternational Conference on Very Large Data Bases (VLDB’04), Toronto, Canada, 2004, pp. 780–791.

[10] S. Chu, E.J. Keogh, D. Hart, M. Pazzani, Iterative deepening dynamic time warping for time series, in: Proceedings of the SecondSIAM International Conference on Data Mining (SDM’02), Arlington, VA, USA, 2002.

[11] H. Sakoe, S. Chiba, A dynamic programming algorithm optimization for spoken word recognition, IEEE Transactions on Acoustics,Speech, and Signal Processing 26 (1) (1978) 43–49.

[12] B. Babcock, S. Babu, M. Datar, R. Motwani, J. Widom, Models and issues in data stream systems, in: Proceedings of the 21stSymposium on Principles of Database Systems (PODS’02), Madison, WI, USA, 2002, pp. 1–16.

[13] S. Babu, J. Widom, Continuous queries over data streams, SIGMOD Record 30 (3) (2001) 109–120.[14] B.-K. Yi, H.V. Jagadish, C. Faloutsos, Efficient retrieval of similar time sequences under time warping, in: Proceedings of the 14th

International Conference on Data Engineering (ICDE’98), Orlando, FL, USA, 1998, pp. 201–208.[15] S.-W. Kim, S. Park, W.W. Chu, An index-based approach for similarity search supporting time warping in large sequence databases,

in: Proceedings of the 17th International Conference on Data Engineering (ICDE’01), Heidelberg, Germany, 2001, pp. 607–614.[16] R. Agrawal, C. Faloutsos, A. Swami, Efficient similarity search in sequence databases, in: Proceedings of the 4th International

Conference on Foundations of Data Organizations and Algorithms (FODO’93), Chicago, IL, USA, 1993, pp. 69–84.[17] P. Ciaccia, M. Patella, Searching in metric spaces with user-defined and approximate distances, ACM Transactions on Database

Systems 27 (4) (2002) 398–437.[18] E.J. Keogh, C.A. Ratanamahatana, Exact indexing of dynamic time warping, Knowledge and Information Systems 7 (3) (2005) 358–386.[19] E.J. Keogh, T. Folias, The UCR Time Series Data Mining Archive, Available at URL <http://www.cs.ucr.edu/~eamonn/

TSDMA/index.html>, 2002.[20] D. Roverso, Multivariate temporal classification by windowed wavelet decomposition and recurrent neural networks, in: Proceedings

of the 3rd ANS International Topical Meeting on Nuclear Plant Instrumentation, Control and Human-Machine InterfaceTechnologies, Washington, DC, USA, 2000, pp. 527–538.

[21] E.J. Keogh, X. Xi, L. Wei, C.A. Ratanamahatana, The UCR Time Series Classification/Clustering Homepage, Available from:<http://www.cs.ucr.edu/~eamonn/time_series_data/>, 2006.

[22] N. Tatbul, U. Cetintemel, S.B. Zdonik, M. Cherniack, M. Stonebraker, Load shedding in a data stream manager, in: Proceedings of29th International Conference on Very Large Data Bases (VLDB’03), Berlin, Germany, 2003, pp. 309–320.

[23] L. Golab, M.T. Ozsu, Issues in data stream management, SIGMOD Record 32 (2) (2003) 5–14.[24] A. Bulut, A.K. Singh, A unified framework for monitoring data streams in real time, in: Proceedings of the 21st International

Conference on Data Engineering (ICDE’05), Tokyo, Japan, 2005, pp. 44–55.[25] D.Q. Goldin, P.C. Kanellakis, On similarity queries for time-series data: constraint specification and implementation, in: Proceedings

of the first International Conference on Principles and Practice of Constraint Programming (CP’95), Cassis, France, 1995, pp. 137–153.

[26] R. Cole, D. Shasha, X. Zhao, Fast window correlations over uncooperative time series, in: Proceedings of the 11th InternationalConference on Knowledge Discovery and Data Mining (KDD’05), Chicago, IL, USA, 2005, pp. 743–749.

[27] L. Wei, E.J. Keogh, H.V. Herle, A. Mafra-Neto, Atomic wedgie: efficient query filtering for streaming times series, in: Proceedings ofthe 5th IEEE International Conference on Data Mining (ICDM’05, Houston, TX, USA, 2005, pp. 490–497.

[28] Y. Zhu, D. Shasha, Warping indexes with envelope transforms for query by humming, in: Proceedings of the 2003 ACM SIGMODInternational Conference on Management of Data, San Diego, CA, USA, 2003, pp. 181–192.

[29] Y. Sakurai, M. Yoshikawa, C. Faloutsos, FTW: fast similarity search under the time warping distance, in: Proceedings of the 24thACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS’05), Baltimore, MD, USA, 2005, pp.326–337.

[30] T.S.F. Wong, M.H. Wong, Efficient subsequence matching for sequences databases under time warping, in: Proceedings of the 7thInternational Database Engineering and Applications Symposium (IDEAS 2003), Hong Kong, China, 2003, pp. 139–148.

[31] S. Brakatsoulas, D. Pfoser, R. Salas, C. Wenk, On map-matching vehicle tracking data, in: Proceedings of the 31st VLDB Conference(VLDB’05), Trondheim, Norway, 2005, pp. 853–864.

[32] E. Keogh, S. Kasetty, On the need for time series data mining benchmarks: a survey and empirical demonstration, Data Mining andKnowledge Discovery 7 (4) (2003) 349–371.

Paolo Capitani got a ‘‘Laurea’’ degree in Biology from the University of Bologna. Afterward, he focused hisresearch interests in the field of computer science and he received a Ph.D. in Electronics, Computer Science andTelecommunications working on data streams and data mining of biological signals. He is currently collaboratingwith the Department of Electronics, Computer Science and Systems and with the Department of Human andGeneral Physiology at the University of Bologna.

Page 21: Warping the time on data streams

Paolo Ciaccia has a ‘‘Laurea’’ degree in Electronic Engineering (1985) and a Ph.D. in Electronic and ComputerEngineering (1992), both from the University of Bologna, Italy, where he has been Full Professor of Information

Systems since 2000. He has published more than 80 refereed papers in major international journals (includingIEEE TKDE, IEEE TSE, IEEE TPAMI, ACM TODS, ACM TOIS, IS, DKE, and Biological Cybernetics) andconferences (including VLDB, ACM-PODS, EDBT, ICDE, and CIKM). He was PC chair of the 2002 edition ofthe SEBD italian conference on advanced data bases. His current research interests include similarity and pref-erence-based query processing, image retrieval, P2P systems, and data streams. He is one of the designers of the M-tree, an index for metric data used by many multimedia and data mining research groups in the world. He is amember of IEEE.

458 P. Capitani, P. Ciaccia / Data & Knowledge Engineering 62 (2007) 438–458