warping the time on data streams
TRANSCRIPT
Data & Knowledge Engineering 62 (2007) 438–458
www.elsevier.com/locate/datak
Warping the time on data streams
Paolo Capitani, Paolo Ciaccia *
DEIS, University of Bologna, Viale Risorgimento, 2, 40136 Bologna, Italy
Received 12 August 2006; accepted 12 August 2006Available online 17 October 2006
Abstract
Continuously monitoring through time the correlation/distance of multiple data streams is of interest in a variety ofapplications, including financial analysis, video surveillance, and mining of biological data. However, distance measurescommonly adopted for comparing time series, such as Euclidean and Dynamic Time Warping (DTW), either are knownto be inaccurate or are too time-consuming to be applied in a streaming environment. In this paper we propose a novelDTW-like distance measure, called Stream-DTW (SDTW), which unlike DTW can be efficiently updated at each time step.We formally and experimentally demonstrate that SDTW speeds up the monitoring process by a factor that grows linearlywith the size of the window sliding over the streams. For instance, with a sliding window of 512 samples, SDTW is about600 times faster than DTW. We also show that SDTW is a tight approximation of DTW, errors never exceeding 10%, andthat it consistently outperforms approximations developed for the case of static time series.� 2006 Elsevier B.V. All rights reserved.
Keywords: Data streams; Dynamic time warping; Sliding window
1. Introduction
Management of data streams has recently emerged as one of the most challenging extensions of databasetechnology. The proliferation of sensor networks as well as the availability of massive amounts of streamingdata related to telecommunications traffic monitoring, web-click logs, geophysical measurements and manyothers, has motivated the investigation of new methods for their modelling, storage, and querying. In partic-ular, continuously monitoring through time the correlation of multiple data streams is of interest in order todetect similar behaviors of stock prices, for video surveillance applications, synchronization of biological sig-nals and, more in general, mining of temporal patterns [1,2].
Previous works dealing with the problem of detecting when two or more streams exhibit a high correlationin a certain time interval have tried to extend techniques developed for (static) time series to the streamingenvironment. In particular, Zhu and Shasha [3], by adopting a sliding window model and the Euclidean dis-tance as a measure of correlation (low distance = high correlation), have been able to monitor in real-time
0169-023X/$ - see front matter � 2006 Elsevier B.V. All rights reserved.
doi:10.1016/j.datak.2006.08.012
* Corresponding author.E-mail addresses: [email protected] (P. Capitani), [email protected] (P. Ciaccia).
P. Capitani, P. Ciaccia / Data & Knowledge Engineering 62 (2007) 438–458 439
up to 10,000 streams on a PC. However, it is known that for time-varying data a much better accuracy can beobtained if one uses the Dynamic Time Warping (DTW) distance [4]. Since the DTW can compensate forstretches along the temporal axis, it provides a way to optimally align time series that matches user’s intuitionof similarity much better than Euclidean distance does, as demonstrated several times (see, e.g., [5] for somenovel DTW applications and [6] for a recent DTW-based approach to shape matching). Further, although nota metric, DTW can be indexed [7], which allows this distance to be applied also in the case of large time seriesarchives.
Unfortunately, when considering streams the benefits of DTW seem to vanish since, unlike Euclidean dis-tance, it cannot be efficiently updated. The basic reason is that the (optimal) alignment one has established attime t is not guaranteed to be still optimal at time t + 1, thus forcing the DTW to be recomputed fromscratch at each time step, with a complexity that, for a sliding window of size n, varies between O(n)and O(n2), depending on how specific global constraints on valid alignments are set (see Section 2 fordetails).
Given this unpleasant state of affairs, in this paper we propose the novel SDTW (Stream-DTW) distancemeasure to be used in place of DTW for the purpose of continuously monitoring the distance between twoor more streams. SDTW preserves the ability of DTW to compensate for stretches along the temporal axis,yet, unlike DTW, is efficiently updatable. More precisely, the work required at each time step can vary froma minimum of O(1) to a maximum of O(n), depending on the global constraints on alignments. In both casesthis represents a remarkable O(n) speed-up over DTW computation. Our experiments on five real-world data-sets show that SDTW is indeed a very good approximation of DTW, the error never exceeding 10%. We alsodemonstrate that other approximations of DTW, proposed for the static case, incur much higher errors andhave a highly variable behavior over the datasets.
Besides being applicable as a valid alternative to DTW for monitoring purposes, SDTW can also be used asan effective filtering tool to speed-up the computation of the result of DTW-based queries in a streaming envi-ronment. This holds since we prove that SDTW always lower bounds the DTW. For this querying scenario weprovide some preliminary results showing that SDTW is particularly effective in limiting the number of falsealarms.
The rest of the paper is organized as follows. In Section 2 we provide the necessary background on theDTW distance. Section 3 highlights the limits of applying the DTW distance in a streaming environmentand presents a simple extension of the well-known LBKeogh lower-bounding DTW approximation. In Section4 we introduce the novel SDTW distance measure and prove some of its basic properties. Section 5 providesexperimental evidence of the efficiency and the accuracy of SDTW. Section 6 reviews the relevant relatedworks and Section 7 concludes and suggests directions for future research activity.
2. Dynamic time warping
We start with some basic definitions related to the static case, i.e., for real-valued time series of finite length.Let R � Rn
1 and S � Sn1 be two time series of length n and let Ri (Si) be the ith sample of R (resp. S). A common
way to compare two time series is by computing their Euclidean distance (L2) [8]:
1 Ob
L2ðRn1; S
n1Þ ¼
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiXn
i¼1
Ri � Sið Þ2s
ð1Þ
It is important to realize that L2 (and similar metrics as well, such as the Manhattan (L1) and the ‘‘max’’ (L1)distances) only compares corresponding samples, thus it does not allow for local non-linear stretches along thetemporal axis.1 As a consequence, two time series might lead to a high L2 value even when they are very sim-ilar, which can have negative effects on common mining tasks, such as classification and clustering [10].Thiswell-known problem is solved by the Dynamic Time Warping (DTW) distance [4]. The key idea of DTW is thatany point of a series can be (forward and/or backward) aligned with multiple points of the other series that liein different temporal positions, so as to compensate for temporal shifts.
serve that the problem of global or uniform stretching is quite different, see [9].
t
0
1
2
3
4
5
6
S
R
R
4 9 16
1 4 9 1
1 4 9 1
9 16 4
1
4 1
25 9
4
9 4 9
4 1 4 9
4 1 4 9
0 1 4
3 4 5 3 3 2 3 43 4 5 3 3 2 3 4
1
2
2
1
0
1
1
2
1
2
2
1
0
1
1
2
Sd
R
13 29
5 18
6 9 17
15 25 21 22
40 30
34
30 29
34 28 37
38 29 37
26
4
8 17
18 19
20
24
25
26
27 31
3 4 5 3 3 2 3 43 4 5 3 3 2 3 4
1
2
2
1
0
1
1
2
DS
a b c
Fig. 1. Computing the DTW distance with a Sakoe–Chiba band of width b = 2: (a) optimal alignments; (b) the d matrix of pairwise sampledistances; (c) the D cumulative distance matrix; highlighted cells constitute the optimal warping path Wopt.
440 P. Capitani, P. Ciaccia / Data & Knowledge Engineering 62 (2007) 438–458
The definition of DTW is based on the notion of warping path. Let d be the n · n matrix of pairwise squareddistances between samples of R and S, d[i, j] = (Ri � Sj)
2. A warping path W = hw1,w2, . . . ,wKi is a sequence ofK (n 6 K 6 2n � 1) matrix cells, wk = [ik, jk] (1 6 ik, jk 6 n), such that:
Boundary conditions: w1 = [1,1] and wK = [n,n], i.e., W starts in the lower-left cell and ends in the upper-right cell;Continuity: given wk�1 = [ik�1, jk�1] and wk = [ik, jk], then ik � ik�1 6 1 and jk � jk�1 6 1. This ensures thatthe cells of the warping path are adjacent;Monotonicity: given wk�1 = [ik�1, jk�1] and wk = [ik, jk], then ik � ik�1 P 0 and jk � jk�1 P 0, with at leastone strict inequality. This forces W to progress over time.
Any warping path W defines an alignment between R and S and, consequently, a cost to align the two ser-ies. The (squared) DTW distance is the minimum of such costs, i.e., the cost of the optimal warping path, Wopt:
DTW ðRn1; S
n1Þ
2 ¼ minW
X½ik ;jk �2W
d½ik; jk�( )
¼X
½ik ;jk �2W opt
d½ik; jk� ð2Þ
The DTW distance can be recursively computed using an O(n2) dynamic programming approach that fills thecells of a cumulative distance matrix D using the following recurrence relation (see also Fig. 1):
D½i; j� ¼ d½i; j� þmin D½i� 1; j� 1�;D½i� 1; j�;D½i; j� 1�f g ð1 6 i; j 6 nÞ ð3Þ
and then setting DTW ðRn1; Sn1Þ ¼
ffiffiffiffiffiffiffiffiffiffiffiffiffiD½n; n�
p.
Note thatffiffiffiffiffiffiffiffiffiffiffiffiD½i; j�
p¼ DTW ðRi
1; Sj1Þ, "i, j, since the DTW distance is well-defined even when the two series
have different lengths [4].In practical applications, warping paths are commonly subject to some global constraints, in order to pre-
vent pathological alignments. The most commonly used of such constraints is the Sakoe–Chiba and [11] ofwidth b, that forces warping paths to deviate no more than b steps from the matrix diagonal (see Fig. 1(b)). It is worth noting that, besides reducing the complexity of computing DTW to O(nb), rather surprisinglya band of limited width b leads to better results in classification tasks, even when b is a small fraction (say, 3–4%) of n [5].
3. Data streams: what is the problem?
Let us now turn to a streaming environment, in which R and S are possibly infinite sequences of sam-ples. The problem we address here is how to efficiently and accurately monitor through time the distancebetween R and S. In doing this we make two basic assumptions, that serve us to precisely define thescenario:
P. Capitani, P. Ciaccia / Data & Knowledge Engineering 62 (2007) 438–458 441
(1) We assume that at each time step a new sample for both streams becomes available, i.e., streams are syn-chronized. This hypothesis is not really restrictive and can be relaxed at the price of unnecessarily com-plicating the presentation. In case one of the two streams, say R, is sampled at a higher frequency thanthe other, i.e., S, we might either undersample R or interpolate S’s samples.
(2) We consider that, at any time step t, only the last n points of the streams, namely the subsequences Rtt�nþ1
and Stt�nþ1, are significative. This is the well-known sliding window model, which allows the monitoring
activity to focus only on more recent elements of the streams. Indeed, in the majority of real-world appli-cations in which one needs to produce results in real-time, looking only at what has recently happened isfar more important than tracking the whole streams’ history [12,13].
A naıve approach to continuously monitor the distance between R and S would be to compute at time t thedistance between Rt
t�nþ1 and Stt�nþ1 from scratch, that is, without considering the computation performed at
previous time steps. This approach has a limited applicability, since it immediately breaks down in computa-tionally-demanding environments, where sampling rates are high, sliding windows are large, and/or many(hundreds, thousands) streams have to be monitored [3]. Indeed, in all these scenarios one needs a methodthat is not only accurate in comparing streams, yet it is also efficient, that is, able to exploit previously per-formed computations.
As anticipated, the DTW distance is not efficiently updatable, which is also to say that the work to be per-formed at each time step is O(n b), for a window of size n and a band of width b. Let Wopt(t) be the optimalwarping path for subsequences Rt
t�nþ1 and Stt�nþ1 and Wopt(t + 1) be the one for Rtþ1
t�nþ2 and Stþ1t�nþ2. Since, in the
limit case, the two optimal warping paths might share no common cell (alignment) at all, it is clear that Wopt(t)is of no help in determining Wopt(t + 1), which implies that Wopt(t + 1) has to be computed from scratch. Thisis true for each time step.
Given this unpleasant state of things, we are left with the problem of devising a method for continuous datastreams that satisfies both following requirements:
Accuracy: It should be a good approximation of DTW. This is because of the success that DTW has alreadydemonstrated for the static case (i.e., time series).Efficiency: It should be fast to update, i.e., its complexity should be at most O(b). Note that this is the bestone can achieve with a DTW-like distance measure, since at each time step 2b + 1 distances between sam-ples have to be necessarily computed. In particular, if b does not vary with n, update complexity should beindependent of the sliding window size, n.
Note that the DTW distance trivially satisfies the first requirement but not the second, whereas the oppositeis true for the Euclidean distance and similar metrics.
3.1. Extending approximation techniques proposed for the static case
One possible approach to satisfy both above-stated requirements is to extend to the streaming environment(lower-bounding) approximations of the DTW distance that have been developed for the static case, wherethey are used to speed-up DTW-based queries over large time series databases [14,15,7]. In this setting weare given a (query) time series Q and a search threshold �, and want to retrieve all the time series S in the data-base for which DTW(Q,S) 6 � holds. The idea of such methods is to introduce a lower-bounding distancemeasure, dlb, and to use it to discard all series S such that dlb(Q,S) > �, thus without performing the (costly)DTW(Q,S) computation. This works since dlb(Q,S) > � implies DTW(Q,S) > � if dlb lower bounds DTW.The DTW computation needs to be executed only for those series that survive the filtering step. The effective-ness of this filter & refine approach critically depends on two contrasting factors [16,17]: dlb should be cheap tocompute and, at the same time, it should have a good accuracy, so as to limit the number of false alarms, i.e.,the number of series for which both dlb(Q,S) 6 � and DTW(Q,S) > � hold.
The best-so-far known method to approximate from below the DTW is due to Keogh [7], and consists inconstructing an envelope, Env(Q), around Q, after that an Euclidean-like distance between Env(Q) and S iscomputed. More in detail, let Up(Q) and Low(Q) be two time series defined as follows (see Fig. 2(a))
Fig. 2. (a) The envelope of Q when b = 4; (b) Keogh’s lower-bounding distance. In the example: LBKeogh(Env(Q),S) = 0.416.
Fig. 3.
442 P. Capitani, P. Ciaccia / Data & Knowledge Engineering 62 (2007) 438–458
UpiðQÞ ¼ maxfQjjj 2 ½maxf1; i� bg; minfn; iþ bg�gLowiðQÞ ¼ minfQjjj 2 ½maxf1; i� bg; minfn; iþ bg�g
Thus, Upi(Q) (Lowi(Q)) is simply the maximum (respectively, the minimum) of Qj values in the interval[i � b,i + b], centered in i and of width 2b + 1. The definition is then completed so as to properly take intoaccount border effects. Consequently, Env(Q) = (Up(Q),Low(Q)) encloses all the allowed stretches of Q. Final-ly, the (squared) LBKeogh distance between Env(Q) and S is defined as
LBKeoghðEnvðQÞ; SÞ2 ¼Xn
i¼1
ðSi � UpiðQÞÞ2 if Si > UpiðQÞ
0 if Si 2 ½LowiðQÞ;UpiðQÞ�ðLowiðQÞ � SiÞ2 if LowiðQÞ > Si
8><>: ð4Þ
and corresponds to the shaded area in Fig. 2(b).We can easily adapt Keogh’s method to deal with streams as follows. Let LBKeoghðEnvðRt
t�nþ1Þ; Stt�nþ1Þ be
the lower-bounding distance between the envelope of Rtt�nþ1 and St
t�nþ1. Since in our scenario the two streamsR and S play a symmetric role, we can similarly compute LBKeoghðEnvðSt
t�nþ1Þ;Rtt�nþ1Þ and then, in order to
improve accuracy, define the symmetric LBKeogh lower-bounding distance as the maximum of the two mea-sures (see Fig. 3):
LBsymmKeoghðRt
t�nþ1; Stt�nþ1Þ ¼ maxfLBKeoghðEnvðRt
t�nþ1Þ; Stt�nþ1Þ; LBKeoghðEnvðSt
t�nþ1Þ;Rtt�nþ1Þg ð5Þ
Note that when updating LBsymmKeogh, only the first b and the last b values of the two envelopes can possibly
change upon arrival of the new stream samples.Depending on the width b of the band and on the specific dataset, the quality of approximation of LBKeogh
with respect to DTW is known to wildly vary (see, e.g., [7,18]).Thus, as also our experiments confirm (seeSection 5), using LBKeogh (as well as its symmetric version) in place of DTW for monitoring purposes is
The symmetric LBsymmKeogh lower-bounding distance: (a) the envelope of R; (b) the envelope of S. In the example: LBsymm
KeoghðR; SÞ ¼ 1:032.
P. Capitani, P. Ciaccia / Data & Knowledge Engineering 62 (2007) 438–458 443
not advisable, since it only satisfies the 2nd of our requirements (fast to update) but not the first one (goodapproximation of DTW).
4. The Stream-DTW distance
The key to satisfy both requirements of accuracy and efficiency starts from a couple of simple observations.First, in order to obtain a high accuracy, a DTW-like style of computation should be preserved. Note that thisis not the case with LBsymm
Keogh, since the original LBKeogh from which it is derived was designed as a lower boundalso able to support exact indexing of DTW. Second, in order to save computational resources, one shouldrealize that the major source of difficulty in updating the DTW distance lies in its boundary conditions. Tosee why this is the case, consider again the optimal warping paths Wopt(t) and Wopt(t + 1). Since Wopt(t + 1)has necessarily to start from cell [t � n + 2,t � n + 2], in order to reuse at least part of the computation per-formed to determine Wopt(t), this path should pass through such cell. However, this is very unlikely to be thecase.
The first step towards the definition of a new, DTW-like, distance measure suitable for continuous streams,which we call Stream-DTW (SDTW), consists in observing that, even if one cannot be confident in ensuringthat an optimal warping path passes through a single cell, yet it is easy to guarantee that it passes through oneof a (suitable chosen) set of cells. This is formalized by the following preliminary definition (see also Fig. 4).
Definition 1 (Frontier). A frontier F is any set of cells of the cumulative distance matrix such that for any warpingpath W it is W \ F 5 ;. The x-frontier (or lower-left frontier) anchored in cell [i, i] is the set of 2b + 1 cells
x½i; i� ¼ f½i; i�; ½i; iþ 1�; . . . ; ½i; iþ b�; ½iþ 1; i�; . . . ; ½iþ b; i�g:
The q-frontier (or upper-right frontier) anchored in cell [i,i] is the set of 2b + 1 cellsq½i; i� ¼ f½i; i�; ½i; i� 1�; . . . ; ½i; i� b�; ½i� 1; i�; . . . ; ½i� b; i�g:
Now we define two relaxed versions of DTW, which will be used in the definition of our new SDTW dis-tance measure.
Definition 2 (Boundary-relaxed DTW). Given streams R and S, let ts and te be two generic time instants. Wedefine
• The start-relaxed DTW (DTWx) between subsequences Rtets
and Stets
is the value of Dx[te,te], where Dx is the
start-relaxed cumulative distance matrix initialized as follows:
Dx½ts; tsþj� ¼ d½ts; tsþj�; Dx½tsþj; ts� ¼ d½tsþj; ts� ð0 6 j 6 bÞ ð6Þ
and then completed using the recurrence relation defined in Eq. (3).• The start-end-relaxed DTW (DTWxq) between subsequences Rtetsand Ste
tsis
DTW xqðRtets; Ste
tsÞ ¼ minfDTW xðRi
ts; Sj
tsÞ j ½i; j� 2 q½te; te�g ð7Þ
R
4 9 16
1 4 9 1
1 4 9 1
9 16 4
1
4 1
25 9
4
9 4 9
4 1 4 9
4 1 4 9
0 1 4
3 4 5 3 3 2 3 43 4 5 3 3 2 3 4
1
2
2
1
0
1
1
2
1
2
2
1
0
1
1
2
Sd
R
3 4 5 3 3 2 3 4
1
2
2
1
0
1
1
2
D
5 14
15 16
17
21
22
23
24 28
1 5 14
15 16
17
21
22
23
24 28
1
S
15
5 14
10 21 18 19
35 27
31
27 26
31 25 34
35 26 34
23
4 9 16
1
15
5 14
10 21 18 19
35 27
31
27 26
31 25 34
35 26 34
23
4 9 16
1
3 4 5 3 3 2 3 4
1
2
2
1
0
1
1
2
R
5 14
15 16
17
21
22
23
1
23
5 14
15 16
17
21
22
23
1
23
15
5 14
10 21 18 19
35 27
31
27 26
31 25
35 26
4 9 16
1
34
34
24 28
15
5 14
10 21 18 19
35 27
31
27 26
31 25
35 26
4 9 16
1
34
34
24 28
SD
a b c
Fig. 4. (a) Distance matrix; (b) start-relaxed DTW; (c) start-end-relaxed DTW.
α
β
γ
δ
Wopt
Dk
Dk+1
kn(k-1)n+1 (k-1)n+1+i kn+i
Fig. 5. How SDTW works (SDTW2 = a � (b � c) + d).
444 P. Capitani, P. Ciaccia / Data & Knowledge Engineering 62 (2007) 438–458
Note that DTWx computation proceeds as with DTW, yet all the cells in the lower-left frontier anchored ints have as value the distance between the corresponding samples (see Fig. 4(b)). This is to say that warpingpaths can start from any cell in x[ts,ts].When te < ts we conventionally set DTWxðRte
ts; Ste
tsÞ ¼ 0.
For computing DTW xqðRtets; Ste
tsÞ one also relaxes the end-boundary condition. In practice one can still use the
same start-relaxed Dx matrix used for computing DTW xðRtets; Ste
tsÞ, and then just look at the minimum value on
the upper-right frontier anchored in [te, te] (see Fig. 4(c)).The basic rationale underlying the computation of the SDTW distance is to split, using frontiers, the
optimal warping path of the DTW into 2 distinct pieces (one of them possibly null at some time instants):The first piece starts by spanning the whole current window of size n, and then, at each time step,progressively reduces; the second piece starts empty and then progressively grows. After exactly n timesteps everything starts again. For each of the two pieces of Wopt the SDTW provides an accurateapproximation.
More in detail, assume without loss of generality that streams’ analysis starts at time t0 = 1, i.e., when thefirst streams’ samples are available.2 The computation of SDTW is performed by segmenting the positive timeaxis into a sequence of non-overlapping time intervals, [(k � 1)n + 1:kn], k = 1,2, . . . , and then associating tothe kth interval a corresponding start-relaxed cumulative distance matrix, denoted Dxk . In other terms, Dxk is thematrix that is initialized with the values in the lower-left frontier anchored in [(k � 1)n + 1,(k � 1)n + 1] (seeDefinition 2).
In order to understand how such start-relaxed matrices are exploited in SDTW computation, it has first tobe observed that, for any active window [ts:te], with te = kn + i (k P 1,0 6 i < n) and ts = te � n + 1 =(k � 1)n + 1 + i, the window intersects either only time interval [(k � 1)n + 1:kn] (if i = 0) or the two timeintervals [(k � 1)n + 1:kn] and [(kn + 1:(k + 1)n] (if i 5 0) (see also Fig. 5). Intuitively, as the following defi-nition makes precise, for computing the value of SDTW ðRte
ts; Ste
tsÞ only values from matrices Dxk and Dxkþ1 are
needed, in the general case.
Definition 3 (Stream-DTW). Let te = k n + i, where k P 1 and 0 6 i < n, and let ts = te � n + 1 =(k � 1)n + 1 + i. The Stream-DTW (SDTW) distance between subsequences Rte
tsand Ste
tsis defined as
2 In
SDTW Rknþiðk�1Þnþ1þi; S
knþiðk�1Þnþ1þi
� �¼
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffia� ðb� cÞ þ d
pð8Þ
where
Section 4.2 we relax this assumption and consider arbitrary values of t0.
P. Capitani, P. Ciaccia / Data & Knowledge Engineering 62 (2007) 438–458 445
• a ¼ DTW xqðRknðk�1Þnþ1; S
knðk�1Þnþ1Þ (from matrix Dxk)
• b ¼ DTW xðRðk�1Þnþ1þiðk�1Þnþ1 ; Sðk�1Þnþ1þi
ðk�1Þnþ1 Þ (from matrix Dxk)
• c = d(R(k�1)n+1+i,S(k�1)n+1+i))
• d ¼ DTW xðRknþiknþ1; S
knþiknþ1Þ (from matrix Dxkþ1).
4.1. Formal results
In this section we prove three basic facts about the SDTW distance, namely: SDTW lower bounds the DTWdistance (Theorem 1), it has optimal update complexity (Theorem 2), and its space overhead is linear in thewindow size (Theorem 3).
Theorem 1 (Lower bound). The SDTW distance is a lower bound of DTW.
Proof. Refer to Fig. 5. Let Wopt be the optimal warping path for aligning Rknþiðk�1Þnþ1þi and Sknþi
ðk�1Þnþ1þi, and con-sider the frontiers q[kn,kn] and x[kn + 1,kn + 1]. Consider the first part of Wopt, call it Wopt,k, that ends in acell of q[kn,kn], and the 2nd part of Wopt, call it Wopt,k+1, that starts from a cell in x[kn + 1,kn + 1]. We claimthat a � (b � c) lower bounds the DTW contribution, call it DTWk, corresponding to Wopt,k and that d lowerbounds the component, call it DTWk+1, corresponding to Wopt,k+1. Since the squared DTW distance betweenthe two subsequences is PDTWk + DTWk+1 this will prove the result.
[a � (b � c) 6 DTWk]. Consider the optimal start-relaxed path, call it Wb, corresponding to b and endingin cell [ts,ts], and the path Wopt,k, which shares cell [ts,ts] with Wb. Counting just once the contribution of cell[ts,ts], i.e., c, we end up with a total cost given by b � c + DTWk for going from x[(k � 1)n + 1,(k � 1)n + 1] toq[kn,kn]. From the definition of DTWxq, this cannot be less than a, which proves the assertion.
[d 6 DTWk+1]. Immediate from the definition of start-relaxed DTW. h
The above proof details how we can approximate from below the DTW by splitting the optimal warpingpath into two pieces. The two frontiers we use to this purpose are by no means the only possible ones; in par-ticular, as Fig. 5 shows, a part of Wopt could traverse some cells, after leaving q[kn,kn] and before enteringx[kn + 1,kn + 1], that SDTW does not consider at all. We can prove that our arguments are still applicableshould we replace x[kn + 1,kn + 1] with the q[kn + 1,kn + 1] frontier.
Unlike simpler approximations such as LBKeogh, the computation of SDTW shares with DTW the need ofcomputing an optimal warping path, even if on a start-relaxed cumulative distance matrix. This alone suggeststhat the very first time we compute the SDTW distance we have to pay a cost of O(nb). However, as the fol-lowing theorem shows, SDTW is amenable to be efficiently updatable.
Theorem 2 (Update complexity). The SDTW distance can be updated in time O(b) at any time step, which is
optimal for any distance that must compare streams’ samples using a band of width b.
Proof. Denote with a 0,b 0,c 0 and d 0 the new values, at time step te0 ¼ te þ 1 ¼ knþ iþ 1, of the four terms inEq. (8). Recall that Dxk is the start-relaxed cumulative distance matrix for time interval [(k � 1)n + 1:kn], andthat Dxkþ1 is the one for the interval [kn + 1:(k + 1)n]. Let dk and dk+1 be the corresponding matrices storingdistances between samples of R and S.
The steps needed to update the value of SDTW include the computation of distance values for the two newsamples of R and S, and the extension of matrix Dxkþ1 up to frontier q½te0 ; te0 � ¼ q½te þ 1; te þ 1�. Both stepsrequire O(b) time.
Consider now the case when te0 ¼ knþ iþ 1, with 0 6 i < n � 1. We have
• a 0 = a, since this term does not depend on i.• By definition of start-relaxed DTW and of Dxk , it is b0 ¼ Dxk ½ðk � 1Þnþ 1þ iþ 1 : ðk � 1Þnþ 1þ iþ 1�, i.e.,
computing b 0 costs O(1). The same is clearly true for c 0 = d(R(k�1)n+1+i+1,S(k�1)n+1+i+1).• Finally, d0 ¼ Dxkþ1½knþ iþ 1 : knþ iþ 1�, again with cost O(1).
446 P. Capitani, P. Ciaccia / Data & Knowledge Engineering 62 (2007) 438–458
When i = n � 1, it is te0 ¼ knþ ðn� 1Þ þ 1 ¼ ðk þ 1Þn and we have b 0 = c 0 and d 0 = 0 (by definition of start-relaxed DTW). The new SDTW value reduces to a0 ¼ DTW xqðRðkþ1Þn
knþ1 ; Sðkþ1Þnknþ1 Þ, which, given matrix Dxkþ1, can
be computed in O(2b + 1) = O(b) time. h
Next theorem concerns the memory requirements of SDTW.
Theorem 3 (Space complexity). The SDTW distance can be computed using O(n + b) = O(n) space.
Proof. Refer to Fig. 5 and observe that, starting from time te = kn + 1, all information in matrices dz and Dxz ,with 1 6 z < k, is not needed anymore to determine future SDTW values. This immediately bounds the spacerequired to O(nb). In other terms, from the very definition of SDTW and as also Fig. 5 suggests, at each timestep a naıve implementation would require at most two distance matrices and 2 cumulative distance matrices.
Consider now Definition 3 and the proof of Theorem 2. To reduce the space to O(n) it is sufficient toobserve that, when te 2 [kn + 1:(k + 1)n], all one needs to know about the dk and Dxk matrices in order tocorrectly compute the SDTW and to update it is just the difference of corresponding diagonal elements (i.e.,the (b � c) terms) and the (single) a value. This accounts for the n term in O(n + b). Finally, the b term is dueto the need of storing the last q-frontier of matrix Dxkþ1, which is used to compute and update the d term. h
It should be observed that the storage required by SDTW is of the same order of that needed for updatingthe Euclidean distance. On the other hand, continuously updating the DTW distance would require O(nb)space, which is O(n2) if b grows linearly with n.
4.2. Shifting the start of computation
Unlike other distance measures, the value of the SDTW distance between any subsequences of R and S
depends on the time the computation started. Intuitively, this is the case since the values of the four terms thatmake up the SDTW distance (see Eq. (8)) vary according to how the current window [ts:te] is split by the twointervals [(k � 1)n + 1:kn] and [kn + 1:(k + 1)n] (corresponding to matrices Dxk and Dxkþ1, respectively). The fol-lowing definition helps to characterize this phenomenon.
Definition 4 (Offset and split time). Let t0 be the time step at which the SDTW analysis begins. The offset ofthe [ts:ts + n � 1] window of size n with respect to t0 is defined as
Offsetð½ts : ts þ n� 1�; t0Þ ¼ ðts � t0Þmodn: ð9Þ
The split time, st, of window [ts:ts + n � 1] is the unique time value such thatOffsetð½st : st þ n� 1�; t0Þ ¼ 0 st 2 ½ts : ts þ n� 1�: ð10Þ
Notice that split times are exactly those instants in which the SDTW algorithm begins to compute a new start-relaxed cumulative distance matrix. Consequently, the window [ts:ts + n � 1] has a 0-offset with respect to t0 iffits split time coincides with ts, i.e., st = ts.To ease understanding, in Section 4 we set t0 = 1, so that the split times were 1,n + 1,2n + 1, and so on. Thisis the reference case depicted in Fig. 6(a), where we also highlight how the four components of SDTW, namelya,b,c, and d, depend on the position of the split time within the current window.
In the general case in which the SDTW computation starts at time t0 = z n + j, with z P 0 and0 6 j < n � 1, two relevant cases might arise, depending on the value of j.When j = 1, from Eq. (9) and theproperties of the mod operator, we immediately have that
Offsetð½ts : te�; znþ 1Þ ¼ Offsetð½ts : te�; 1Þ 8z;8 ts ð11Þ
which means that the split times do not change with respect to the reference case t0 = 1 (see also Fig. 6(b)).This is no longer true when j 5 1 (see Fig. 6(c)), i.e.,Offsetð½ts : te�; znþ jÞ 6¼ Offsetð½ts : te�; 1Þ 8 z; 8 ts ð12Þ
If the split times do not change, then it follows from the very definition of SDTW that, for any position of thesliding window, the values of the four components of the SDTW distance will not change as well with respectsliding windowsliding window
t0=1 ts tet
δ
α - (β - γ)
R
S
n
n
i
sliding windowsliding window
t0=zn+1 ts tet
δ
α - (β - γ)
R
S
n
n
i
sliding windowsliding window
ts tet
δ
α - (β - γ)
R
S
n
n ij
t0=zn+j
a
b
cFig. 6. (a) Reference case: The SDTW computation starts as soon as data points are available (t0 = 1); (b) the SDTW computation startsat time t0 = zn + 1; (c) the SDTW computation starts at t0 = zn + j, j 5 1. In (a) and (b) the split times, represented by the vertical bars,are the same, whereas this is no longer true in (c).
P. Capitani, P. Ciaccia / Data & Knowledge Engineering 62 (2007) 438–458 447
to the reference case t0 = 1. This is also to say that SDTW ðRtets; Ste
tsÞ is not influenced at all by a delay of zn time
steps in the beginning of the computation. On the other hand, when the split times do change, it is possible thatSDTW ðRte
ts; Ste
tsÞ might change as well. In the limit case, SDTW ðRte
ts; Ste
tsÞ might assume n distinct values, one for
each position of the split time. In Section 5 we will show that this does not influence the good accuracy thatSDTW attains in approximating the DTW distance.
5. Experimental results
In this section we present experimental results aiming to evaluate the actual performance of SDTW, both interms of efficiency and of quality of approximation. For reference purpose, we contrast SDTW to the LBsymm
Keogh
method presented in Section 3.1. For fairness, we note that LBKeogh (of which LBsymmKeogh is a direct extension)
was not created to be a tight approximation of DTW, since it was designed, unlike our SDTW, to be a lowerbound allowing index-based processing of DTW queries on time series databases.
All experiments are performed on a 1.60 GHz Intel Pentium 4 CPU with 512 MB of main memory and run-ning Windows 2000 OS.
We use five different real-world datasets, obtained from the UCR Time Series Data Mining Archive(TSDMA) [19] (see Fig. 7 for samples from such datasets):
-152
-151
-150
-149
-148
-147
-146
-145
0 10 20 30 40 50 60 70 80 90 100 110 120
RW
-1-0.8-0.6-0.4-0.2
0
0.20.40.60.8
1
0 10 20 30 40 50 60 70 80 90 100 110 120
FD
0
500
1000
1500
2000
2500
3000
3500
4000
0 10 20 30 40 50 60 70 80 90 100 110 120
EEG
0
0.5
1
1.5
2
2.5
0 10 20 30 40 50 60 70 80 90 100 110 120
BI
60
62
64
66
68
70
72
74
76
78
0 10 20 30 40 50 60 70 80 90 100 110 120
NET
Fig. 7. Samples from the five datasets used in the experiments.
448 P. Capitani, P. Ciaccia / Data & Knowledge Engineering 62 (2007) 438–458
• Random walk (RW): classical random walk data, generated according to the equation Si = Si�1 +rand(�0.5,0.5);
• Burstin (BI): this dataset includes 12 hours of data from a small region of the sky, where Gamma Ray burstswere reported during that period;
• Fluid dynamics (FD): this stream represents one of the signals collected at a frequency of 500 Hz fromboundary layer experiments in the fluid dynamics research domain;
• Network (NET): this dataset reports the Round Trip Time (RTT) delay of a sequence of packets sent fromUniversity of California at Riverside to Carnegie Mellon University;
• EEG (EEG): 128 Hz-electroencephalographic data acquired at the Department of Physiology (University ofBologna).
For each of the above datasets we simulate acquisition of new samples by reading from memory a new datasample at fixed time intervals (sampling rate). In order to generate multiple streams of a same kind, we par-tition each dataset into non-overlapping parts, and then take each of such parts as a different stream. Unlessotherwise stated, all results we present are averaged over 32 streams, that is, 496 pairs of streams for eachdataset.
5.1. Quality of approximation
Our first experiments aim to verify that SDTW is indeed a good approximation of the DTW distance, andthat it consistently outperforms the LBsymm
Keogh bound. We define the tightness of a DTW approximation as theratio [7]
TableTightn
SDTW
LBsymmKeog
Windo
0.
0.
0.
0.
0.
1.
Tig
htn
ess
a
P. Capitani, P. Ciaccia / Data & Knowledge Engineering 62 (2007) 438–458 449
Tightness ¼ value of the approximation
value of theDTW distance
Since both SDTW and LBsymmKeogh are lower bounds of the DTW distance, their tightness stays in the [0, 1] range,
with values closer to 1 denoting a higher quality of the approximation. As it can be observed in Table 1, thetightness of SDTW is very stable over the different datasets, and always higher than 0.9.This is not the case forLBsymm
Keogh, whose tightness never reaches 0.8, and is as low as 0.277 for the Burstin dataset. Averaging over all thefive datasets, the tightness of SDTW is 56.7% higher than that of LBsymm
Keogh.Fig. 8 shows a sample trace of DTW, SDTW, and LBsymm
Keogh distances on a pair of EEG streams, considering128 different positions of the sliding window. Note that, among the five considered datasets, EEG is the one onwhich the tightness of SDTW is the worst one. Nonetheless, SDTW is able to closely track DTW variationsalong all the time course of the monitoring process. As to LBsymm
Keogh, it is apparent that it consistently underes-timates the true DTW distance, and that its output trace contains much less details than those of DTW andSDTW.
In Fig. 9(a) and (b) it is shown how the tightness of SDTW varies when either n or b change, respectively.Note that in both figures the tightness axis starts from 0.75. As expected, when the ratio b/n becomes smallerthe tightness improves, since the approximation introduced by the relaxation of the boundary conditions has adiminishing influence. However, even when b is as high as 0.25 n, the tightness of SDTW marks a remarkable80% (see the RW curves in Fig. 9(a) and (b)).
1ess of SDTW and LBsymm
Keogh
Datasets Average
RW BI FD NET EEG
0.948 0.945 0.937 0.936 0.907 0.935
h 0.778 0.277 0.720 0.671 0.536 0.596
w size n = 128, band width b = 8.
0
500
1000
1500
2000
2500
3000
3500
1 11 21 31 41 51 61 71 81 91 101 111 121
DTW
SDTW
LBKeoghSymm
Fig. 8. Sample traces of DTW, SDTW, and LBsymmKeogh. EEG dataset, window size n = 128, band width b = 8.
75
80
85
90
95
00
0 100 200 300 400 500 600Window size (n)
RW
BI
FD
NET
EEG
0.75
0.80
0.85
0.90
0.95
1.00
0 5 10 15 20 25 30 35
Band width (b)
Tig
htn
ess
RW
BI
FD
NET
EEG
b
Fig. 9. Tightness of SDTW: (a) vs. window size. b = 8; (b) vs. band width. n = 128.
450 P. Capitani, P. Ciaccia / Data & Knowledge Engineering 62 (2007) 438–458
The intuition that the tightness mostly depends on the b/n ratio, rather than on the absolute values of b andn, is confirmed by the results shown in Fig. 10, in which the ratio b/n is kept constant to 1/16th whileincreasing the sliding window size. Note that on each dataset tightness values vary in a range of at most 0.03.
0.75
0.80
0.85
0.90
0.95
1.00
0 100 200 300 400 500 600
Window size (n)
Tig
htn
ess
RW
BI
FD
NET
EEG
Fig. 10. Tightness of SDTW; the ratio b/n is kept constant (b/n = 1/16).
0.75
0.8
0.85
0.9
0.95
1
0 20 40 60 80 100 120Offset
Tig
htn
ess
FD
0.75
0.8
0.85
0.9
0.95
1
0 20 40 60 80 100 120Offset
ssent
hgi
T
EEG
0.75
0.8
0.85
0.9
0.95
1
0 20 40 60 80 100 120Offset
ssent
hgi
TBI
0.75
0.8
0.85
0.9
0.95
1
0 20 40 60 80 100 120Offset
Tig
htn
ess
NET
0.75
0.8
0.85
0.9
0.95
1
0 20 40 60 80 100 120Offset
Tig
htn
ess
RW
Fig. 11. How the SDTW tightness varies with respect to the offset of the sliding window for the data sets under consideration. n = 128,b = 8.
Table 2Tightness variabilty of SDTW due to different offset values
Datasets Average
RW BI FD NET EEG
Average 0.954 0.944 0.934 0.934 0.915 0.936Max 0.981 0.964 0.974 0.966 0.954 0.968Min 0.931 0.935 0.893 0.889 0.902 0.910Max–min 0.051 0.029 0.081 0.078 0.051 0.058
Window size n = 128, band width b = 8.
P. Capitani, P. Ciaccia / Data & Knowledge Engineering 62 (2007) 438–458 451
5.1.1. Offset influence
In the following experiment we investigate the influence of the window offset on the tightness of SDTW. Asexplained in Section 4.2, changing the start time of computation, t0, might influence the values of the fourcomponents making up the SDTW distance measure, and as such this can lead to different values for a samepair of subsequences Rte
tsand Ste
ts.The results we present in Fig. 11 are obtained by repeating n times the tight-
ness experiments for the setting n = 128 and b = 8, keeping fixed the position of the sliding window and eachtime varying the value of t0 (t0 = 1, . . . ,n � 1). This covers all possible offset cases.
As expected, SDTW values, and consequently SDTW tightness, are influenced by the offset. However, thisintrinsic variability stays always limited and does not degrade the overall quality of approximation. Forinstance, in the worst case, corresponding to the FD dataset, the tightness varies at most 0.081, ranging from0.974 to 0.893. Table 2 shows details for all the five datasets. Each row in the table is a specific tightness sta-tistics taken over all offset values, whereas the last column shows corresponding statistics averaged over all thedatasets.
5.1.2. Classification results
The following experiments aim to demonstrate that the high tightness of SDTW also reflects into a highquality of results. In particular, we consider using SDTW for classification tasks. We start with three labelleddatasets from UCR TSDMA:3
• Trace: this 4-class dataset is a subset of the Transient Classification Benchmark (trace project) [20]. It is asynthetic dataset designed to simulate instrumentation failures in a nuclear power plant. The full datasetconsists of 16 classes (50 instances in each class) and each instance has 4 features. The Trace subset onlyuses the second feature of classes 2 and 6, and the third feature of classes 3 and 7. Hence, this dataset con-tains 200 instances, 50 for each class. All instances are linearly interpolated to have the same length of 275data points and are z-normalized;
• Trace-2: this dataset is a 2-class subset of the Trace dataset, as used in [18], since it only uses the third fea-ture of classes 3 and 6;
• Gun-Point: this dataset comes from the video surveillance domain and has two classes, each containing 100instances. All instances were created using one female actor and one male actor in a single session. The twoclasses are:– Gun-Draw: the actors have their hands by their sides. They draw a replicate gun from a hip-mounted
holster, point it at a target for approximately one second, then return the gun to the holster, and theirhands to their sides;
– Point: the actors have their hands by their sides. They point with their index fingers to a target forapproximately one second, and then return their hands to their sides.
Each instance has a length of 150 data points and is z-normalized.
For each dataset we perform a ‘‘leave-one-out’’ 1 nearest neighbor classification, using SDTW, DTW, andthe Euclidean (L2) distance. Since in this case we are not dealing with streaming data, what we show for
3 The description of the datasets is taken almost verbatim from the index file accompanying the TSDMA distribution CD.
452 P. Capitani, P. Ciaccia / Data & Knowledge Engineering 62 (2007) 438–458
SDTW are just the effects of using a start-relaxed cumulative distance matrix (on which SDTW computation isbased) in place of the exact cumulative distance matrix (which is the one used by the DTW distance).
Fig. 12(a) clearly shows that using SDTW in place of DTW introduces no appreciable variation in classi-fication performance. In particular, on both Trace and Trace-2 datasets both distance measures are able tocorrectly classify all instances, while for the Gun-Point dataset the SDTW only introduces one classificationerror more than DTW (3 rather than 2 errors over the 200 instances).
Having shown that SDTW performs similarly to DTW on ‘‘easy’’ classification tasks, we now consider itsbehavior over four ‘‘harder’’ datasets (namely 50Words, OSU Leaf, Lightning-7, and Fish), which we down-loaded from the UCR Time Series Classification/Clustering page [21]. For each dataset, we classify elements inthe testing set by using the label of the closest element in the training set. For both DTW and SDTW we use theoptimal band width values reported in [21], where further details on these datasets are available.Performanceof the three distances over individual datasets is shown in Fig. 12(b). Even here, the performance of SDTWstays quite close to that of DTW, with both methods significantly outperforming Euclidean distance. On theaverage, the classification accuracy of DTW is 69.5% and that of SDTW is 67.9%, whereas L2 can correctlyclassify only 57.4% elements.
5.2. Efficiency
Our first experiment concerning efficiency aspects aims to measure how many streams we can monitor inreal time for the methods under analysis. More precisely, for a given number of pairs of streams to be com-pared to each other, we vary the sampling rate and plot the value beyond which we cannot report resultsbefore the next stream samples have to be read. Thus, working, say, at 1000 Hz means that all the distancevalues between the pairs of streams are reported no later than 1 ms.
Fig. 13 shows results when the sliding window includes n = 128 samples and the width of the Sakoe–Chibaband is b = 8. It is evident that SDTW outperforms DTW by about two orders of magnitude. For instance, at100 Hertz SDTW can monitor up to about 100 pairs of streams, whereas DTW can only handle 1 pair.
80%
85%
90%
95%
100%
Trace-2 Trace Gun-Point
Dataset
Cla
ssif
icat
ion
acc
ura
cy
L2
DTW
SDTW
0%
20%
40%
60%
80%
100%
50Words OSU Leaf Lightning-7 FishDataset
Cla
ssifi
catio
n a
ccu
racy
L2
DTW
SDTW
a b
Fig. 12. Classification accuracy over three ‘‘easy’’ (a) and four ‘‘hard’’ (b) datasets.
0.1
1
10
100
1000
10000
100000
1 10 100 1000Pairs of streams
Max
sam
plin
g r
ate
(Hz)
DTW
LB-Keogh-symm
SDTW
Fig. 13. Maximum sampling rate vs. number of pairs of streams; n = 128, b = 8. Results are averaged over all the 5 datasets.
P. Capitani, P. Ciaccia / Data & Knowledge Engineering 62 (2007) 438–458 453
Next figures aim to better analyze how the SDTW efficiency gain depends on the size of the sliding windowand on the width of the Sakoe–Chiba band. For each experiment we show two tightly related measures,namely, the maximum sampling rate and the speed-up of SDTW with respect to DTW, defined as
1
10
Max
sam
plin
g r
ate
(Hz)
a
a
Max
sam
plin
g r
ate
(Hz)
Speed-up ¼ cost of a single DTW computation
cost of a single SDTW computation
In all the experiments we consider 6 pairs of streams to be monitored.Fig. 14 refers to the case where the size of the sliding window grows, whereas b is kept constant. Since the
update complexity of SDTW is O(b), thus independent of n, in this scenario the maximum sampling rate ofSDTW stays constant. On the other hand, DTW computation pays the price of starting from scratch at eachtime step, thus its maximum sampling rate is inversely proportional to n. Overall, the speed-up of SDTW
grows linearly with n, as expected.In Fig. 15 we keep fixed the size of the sliding window and increase the band width. In this case the speed-up
stays almost constant at a value that only depends on the window size (n = 128 in the figure). Consequently,for both distance measures the maximum sampling rate decreases as 1/b.
Finally, Fig. 16 shows the case in which the size of the sliding window grows and so b does, in a way that theratio b/n stays constant (in the figure it is b/n = 1/16). In this scenario the update complexity of DTW is O(nb) = O(n2), whereas that of SDTW is O(b) = O(n). This perfectly reflects into the experimental results, inwhich again we observe a linear speed-up. Note that although both DTW and SDTW have to reduce theirmaximum sampling rate when the window size increases, the reduction of SDTW goes as 1/n, whereas thatof DTW goes as 1/n2.
1
10
100
000
000
0 100 200 300 400 500 600
Window size (n)
DTW
SDTW
0
100
200
300
400
500
600
700
0 100 200 300 400 500 600
Window size (n)
pu-
deep
S
RW
BI
FD
NET
EEG
b
Fig. 14. Variable window size, b = 8. (a) Maximum sampling rate (averaged over all the datasets); (b) SDTW speed-up.
0
40
80
120
160
0 5 10 15 20 25 30 35
Band width (b)
Sp
eed
-up RW
BI
FD
NET
EEG
b
1
10
100
1000
10000
0 5 10 15 20 25 30 35
Band width (b)
DTW
SDTW
Fig. 15. Variable band width, n = 128. (a) Maximum sampling rate; (b) SDTW speed-up.
0
100
200
300
400
500
600
700
0 100 200 300 400 500 600Window size (n)
Sp
eed
-up
RW
BI
FD
NET
EEG
1
10
100
1000
10000
100000
0 100 200 300 400 500 600Window size (n)
Max
sam
plin
g r
ate
(Hz)
DTW
SDTW
a b
Fig. 16. Variable window size, b = 1/16 n. (a) Maximum sampling rate; (b) SDTW speed-up.
454 P. Capitani, P. Ciaccia / Data & Knowledge Engineering 62 (2007) 438–458
5.3. Pruning power
Our last experiment aims to shed some light on the effects of the tightness in the case one is interested inperforming DTW-based queries on streams. More precisely, we consider the task to determine whether theDTW distance falls below a given threshold value �, and use either LBsymm
Keogh or SDTW to filter out irrelevantsubsequences. Recall that, according to Theorem 1, SDTW can be safely used to this purpose since it is a lowerbound of DTW. To measure the performance of DTW approximations we consider their pruning power,defined as [7]
TablePrunin
SDTW
LBsymmKeog
Windo
Pruning Power ¼ 1� number of DTW computations with the filter
number of DTW computations without the filter
Clearly, the higher the pruning power the less the number of (costly) DTW computations that have to be per-formed, the better the overall performance. Please note that for a fixed query selectivity, which measures thefraction of cases in which the DTW distance falls under the threshold, increasing the pruning power also re-duces the number of false alarms.
Table 3 reports the pruning power of LBsymmKeogh and SDTW for the different datasets. The experimental
setting is the same as in Table 1 (496 pairs of streams, n = 128 and b = 8) and � is chosen so as to have a1% selectivity, that is, in 1% of the cases the DTW distance is under the threshold. Note that this limitsthe maximum pruning power to 0.99. It is evident that the good tightness of SDTW pays off, in that forall datasets SDTW prunes almost all irrelevant subsequences. In particular, for the Fluid dynamics (FD)dataset no false alarm at all occurs. This and Random walk (RW) are the only datasets for whichLBsymm
Keogh exhibits pruning capabilities somewhat comparable to those of SDTW, whereas in the other casesits effectiveness is seriously compromised. Overall, the pruning power of SDTW translates into a negligible0.8% average percentage of false alarms, whereas this value jumps to 41.1% for LBsymm
Keogh.We conclude this section with some observations concerning this filter & refine query processing scenario.
We recall that this is just a possible application of the SDTW distance, which in this paper we have introducedas an alternative to DTW for monitoring purposes.
3g power of SDTW and LBsymm
Keogh
Datasets Average
RW BI FD NET EEG
0.982 0.974 0.990 0.984 0.978 0.982
h 0.929 0.000 0.946 0.740 0.280 0.579
w size n = 128, band width b = 8, DTW selectivity = 1%.
P. Capitani, P. Ciaccia / Data & Knowledge Engineering 62 (2007) 438–458 455
Although above results indeed show that SDTW is able to consistently limit the number of falsealarms, one should be aware that long bursts of streams’ subsequences for which the DTW distance needsto be computed are always possible, even with an ‘‘ideal’’ approximation that attains the maximum pos-sible pruning power (i.e., no false alarms at all). What one should expect in this scenario? To make thingsprecise, assume that streams are sampled at a frequency that does not allow DTW computation at eachtime step for all pairs of streams, whereas the sampling rate stays below the maximum one tolerated bySDTW. In particular, SDTW computation on all pairs of streams takes only a fraction, say load < 1, ofthe period between two time steps. For instance, for a sampling rate of 100 Hz a load = 0.7 means that allSDTW computations require 0.7/100 s (7 ms), and that for 3 ms the system is idle. If at time t we need toperform one or more DTW computations, we can use the idle time. This will introduce no delay at all ifthe number of subsequences surviving the filtering step does not exceed a certain value, which depends onboth the cost of a single DTW computation and on 1-load. If this is not the case, some DTW distancewill be computed during the next time interval, thus in turn delaying the upcoming SDTW computations,and so on.
Although the analysis of this scenario is beyond the scope of the paper, it is important to observe that, forany fixed amount of memory available to store unprocessed streams’ samples, one might experience burstswhose length leads to memory saturation, thus forcing the adoption of some load shedding technique [22].It is therefore an interesting problem to precisely understand how the number and the characteristics of thestreams might affect the distribution of delays and the probability of memory saturation.
6. Related works
There is a large body of literature on data streams systems that is only marginally relevant here, for whichwe refer the reader to [12] and [23]. The works most closely related to ours are [3] and [24]. The StatStreamsystem by Zhu and Shasha [3] was designed to speed-up real time computation of statistics over a large num-ber of streams. In particular, for monitoring the correlation between pairs of streams, StatStreams uses anapproximate Euclidean distance based on the first m (m� n) coefficients of the Discrete Fourier Transform(DFT) of streams’ subsequences [16]. From the results in [25], the DFT coefficients can be updated in con-stant-time at each time step,4 which is a key to the high performance level achieved by StatStream. As arguedand demonstrated several times, the Euclidean distance can lead to inaccurate results. Further, the DFTapproximation only works well for streams whose energy concentrates in the low frequencies, which is notalways the case [26]. Similar considerations apply to the Stardust system by Bulut and Singh [24], whichuses a wavelet-based approximate Euclidean distance and improves over StatStream for detecting correlationsabove a given threshold.
Monitoring data streams to look for the occurrence of specific patterns is also an important problem thathas been the subject of some recent works [24,27]. The base case in which each pattern is independentlychecked against incoming data is considered in [24], whereas [27] defines an envelope (much similar to thatused by LBKeogh) over a set of (similar) patterns. Since the distance between the envelope and the currentstream subsequence lower bounds the Euclidean distance between the subsequence and any of the patternsin the envelope, this method can quickly discard many irrelevant patterns at nce.
As far as we know, the usage of Dynamic Time Warping distance in a streaming environment has not beenconsidered before. Although many works adopt the DTW distance (see, e.g., [14,15,7,28,29]), all of them con-sider the problem of searching for close matches to a query in large, static, time series archives. Interestingly,extending the technique developed in [29] for speeding-up DTW-based nearest neighbors queries to work overstreams is left as ‘‘a promising but very challenging direction’’ (quoted from [29]).
The work by Wong and Wong [30] presents a method for supporting DTW-based subsequence queries,where one wants to find all the subsequences Sj
i in the database within distance � from a query Q. Thestarting point of the method in [30] is that a point of Sj
i must be aligned with at least one point of a sub-sequence Qy
x of Q. This is somewhat similar to our ‘‘frontier’’ concept (Definition 1). However, while we use
4 Here the assumption is that the number of DFT coefficients, m, is independent of n, the size of the sliding window.
456 P. Capitani, P. Ciaccia / Data & Knowledge Engineering 62 (2007) 438–458
the frontier to split the optimal warping path of DTW into two parts, Wong and Wong build on the aboveobservation in a completely different way, which results in a generalization of LBKeogh to the case of sub-sequence queries.
7. Conclusions
In this paper we have proposed a novel method, called SDTW (Stream-DTW), to compare data streams.SDTW, unlike the well-known Dynamic Time Warping (DTW) distance, can be efficiently updated when awindow slides over the streams. Our experimental results confirm the theoretical analysis and show that theimprovement in efficiency of SDTW over DTW linearly grows with the size of the sliding window, thusmaking it possible to compute a DTW-like distance measure even on (very) large sliding windows. We havealso shown that extensions of methods developed for the static case, i.e., when time series are known inadvance and there used to speed-up the evaluation of DTW-based queries, cannot cope with SDTW interms of accuracy.
While in this paper we have focused on the problem of monitoring the distance of streams and proposedSDTW as an alternative to DTW, we have also shown that the good accuracy of SDTW, together with itslower-bounding property, makes SDTW an ideal candidate for implementing the filtering step if one wantsto perform DTW-based queries on streams.
Our results open a series of interesting research directions and possibilities. An efficiently updatable DTW-like distance measure makes it possible to perform on streams several analysis developed only for the staticcase, such as those related to mining tasks. For instance, SDTW could be compared to the Euclidean distancein clustering and classification tasks, as already done with static time series. Our preliminary results along thisdirection indeed show that using SDTW in place of DTW leads to no appreciable difference in the quality ofclassification. It would also be interesting to investigate if the basic techniques we have used to approximatethe DTW distance, while preserving its very spirit, could be applied to other complex distances proposed forthe comparison of time-varying signals that cannot be easily extended to a streaming scenario. One of them isthe Frechet distance [31].
Finally, a major issue that remains to be addressed concerns the definition of a method able to compute aDTW-like distance on streams that need to be normalized on-line. Indeed, working with unnormalized datamay have negative effects on the significance of any similarity measure, as demonstrated for static time series[32].Extending our method to normalized data does not seem to be an easy task. Intuitively, the problem withnormalization is that now a data sample has no longer a fixed value, since at each time step the normalizationconstants (mean and standard deviation) may change. Reusing computations performed at previous time stepsthen becomes a major challenge which is worth addressing.
References
[1] W. Lin, M.A. Orgun, G.J. Williams, An overview of temporal data mining, in: Proceedings of the first Australian Data MiningWorkshop (ADM’02), Canberra, Australia, 2002, pp. 83–90.
[2] J.F. Roddick, K. Hornsby, M. Spiliopoulou, An updated bibliography of temporal, spatial, and spatio-temporal data miningresearch, in: Temporal, Spatial, and Spatio-Temporal Data Mining: First International Workshop (TSDM 2000), Lyon, France,2000, pp. 147–164.
[3] Y. Zhu, D. Shasha, StatStream: statistical monitoring of thousands of data streams in real time, in: Proceedings of the 28thInternational Conference on Very Large Data Bases (VLDB’02), Hong Kong, China, 2002, pp. 358–369.
[4] D.J. Berndt, J. Clifford, Using dynamic time warping to find patterns in time series, in: AAAI 1994 Workshop on KnowledgeDiscovery in Databases, Seattle, WA, USA, 1994, pp. 359–370.
[5] C.A. Ratanamahatana, E.J. Keogh, Making Time-Series Classification More Accurate Using Learned Constraints, in: Proceedings ofthe Fourth SIAM International Conference on Data Mining (SDM’04), Lake Buena Vista, FL, USA, 2004.
[6] I. Bartolini, P. Ciaccia, M. Patella, WARP: accurate retrieval of shapes using phase and time warping distance, IEEE Transactions onPattern Analysis and Machine Intelligence 27 (1) (2005) 142–147.
[7] E. Keogh, Exact indexing of dynamic time warping, in: Proceedings of the 28th International Conference on Very Large Data Bases(VLDB’02), Hong Kong, China, 2002, pp. 406–417.
[8] B.-K. Yi, C. Faloutsos, Fast time sequence indexing for arbitrary Lp norms, in: Proceedings of the 26th International Conference onVery Large Data Bases (VLDB 2000), Cairo, Egypt, 2000, pp. 385–394.
P. Capitani, P. Ciaccia / Data & Knowledge Engineering 62 (2007) 438–458 457
[9] E. Keogh, T. Palpanas, V.B. Zordan, D. Gunopulos, M. Cardie, Indexing large human-motion databases, in: Proceedings of the 30thInternational Conference on Very Large Data Bases (VLDB’04), Toronto, Canada, 2004, pp. 780–791.
[10] S. Chu, E.J. Keogh, D. Hart, M. Pazzani, Iterative deepening dynamic time warping for time series, in: Proceedings of the SecondSIAM International Conference on Data Mining (SDM’02), Arlington, VA, USA, 2002.
[11] H. Sakoe, S. Chiba, A dynamic programming algorithm optimization for spoken word recognition, IEEE Transactions on Acoustics,Speech, and Signal Processing 26 (1) (1978) 43–49.
[12] B. Babcock, S. Babu, M. Datar, R. Motwani, J. Widom, Models and issues in data stream systems, in: Proceedings of the 21stSymposium on Principles of Database Systems (PODS’02), Madison, WI, USA, 2002, pp. 1–16.
[13] S. Babu, J. Widom, Continuous queries over data streams, SIGMOD Record 30 (3) (2001) 109–120.[14] B.-K. Yi, H.V. Jagadish, C. Faloutsos, Efficient retrieval of similar time sequences under time warping, in: Proceedings of the 14th
International Conference on Data Engineering (ICDE’98), Orlando, FL, USA, 1998, pp. 201–208.[15] S.-W. Kim, S. Park, W.W. Chu, An index-based approach for similarity search supporting time warping in large sequence databases,
in: Proceedings of the 17th International Conference on Data Engineering (ICDE’01), Heidelberg, Germany, 2001, pp. 607–614.[16] R. Agrawal, C. Faloutsos, A. Swami, Efficient similarity search in sequence databases, in: Proceedings of the 4th International
Conference on Foundations of Data Organizations and Algorithms (FODO’93), Chicago, IL, USA, 1993, pp. 69–84.[17] P. Ciaccia, M. Patella, Searching in metric spaces with user-defined and approximate distances, ACM Transactions on Database
Systems 27 (4) (2002) 398–437.[18] E.J. Keogh, C.A. Ratanamahatana, Exact indexing of dynamic time warping, Knowledge and Information Systems 7 (3) (2005) 358–386.[19] E.J. Keogh, T. Folias, The UCR Time Series Data Mining Archive, Available at URL <http://www.cs.ucr.edu/~eamonn/
TSDMA/index.html>, 2002.[20] D. Roverso, Multivariate temporal classification by windowed wavelet decomposition and recurrent neural networks, in: Proceedings
of the 3rd ANS International Topical Meeting on Nuclear Plant Instrumentation, Control and Human-Machine InterfaceTechnologies, Washington, DC, USA, 2000, pp. 527–538.
[21] E.J. Keogh, X. Xi, L. Wei, C.A. Ratanamahatana, The UCR Time Series Classification/Clustering Homepage, Available from:<http://www.cs.ucr.edu/~eamonn/time_series_data/>, 2006.
[22] N. Tatbul, U. Cetintemel, S.B. Zdonik, M. Cherniack, M. Stonebraker, Load shedding in a data stream manager, in: Proceedings of29th International Conference on Very Large Data Bases (VLDB’03), Berlin, Germany, 2003, pp. 309–320.
[23] L. Golab, M.T. Ozsu, Issues in data stream management, SIGMOD Record 32 (2) (2003) 5–14.[24] A. Bulut, A.K. Singh, A unified framework for monitoring data streams in real time, in: Proceedings of the 21st International
Conference on Data Engineering (ICDE’05), Tokyo, Japan, 2005, pp. 44–55.[25] D.Q. Goldin, P.C. Kanellakis, On similarity queries for time-series data: constraint specification and implementation, in: Proceedings
of the first International Conference on Principles and Practice of Constraint Programming (CP’95), Cassis, France, 1995, pp. 137–153.
[26] R. Cole, D. Shasha, X. Zhao, Fast window correlations over uncooperative time series, in: Proceedings of the 11th InternationalConference on Knowledge Discovery and Data Mining (KDD’05), Chicago, IL, USA, 2005, pp. 743–749.
[27] L. Wei, E.J. Keogh, H.V. Herle, A. Mafra-Neto, Atomic wedgie: efficient query filtering for streaming times series, in: Proceedings ofthe 5th IEEE International Conference on Data Mining (ICDM’05, Houston, TX, USA, 2005, pp. 490–497.
[28] Y. Zhu, D. Shasha, Warping indexes with envelope transforms for query by humming, in: Proceedings of the 2003 ACM SIGMODInternational Conference on Management of Data, San Diego, CA, USA, 2003, pp. 181–192.
[29] Y. Sakurai, M. Yoshikawa, C. Faloutsos, FTW: fast similarity search under the time warping distance, in: Proceedings of the 24thACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS’05), Baltimore, MD, USA, 2005, pp.326–337.
[30] T.S.F. Wong, M.H. Wong, Efficient subsequence matching for sequences databases under time warping, in: Proceedings of the 7thInternational Database Engineering and Applications Symposium (IDEAS 2003), Hong Kong, China, 2003, pp. 139–148.
[31] S. Brakatsoulas, D. Pfoser, R. Salas, C. Wenk, On map-matching vehicle tracking data, in: Proceedings of the 31st VLDB Conference(VLDB’05), Trondheim, Norway, 2005, pp. 853–864.
[32] E. Keogh, S. Kasetty, On the need for time series data mining benchmarks: a survey and empirical demonstration, Data Mining andKnowledge Discovery 7 (4) (2003) 349–371.
Paolo Capitani got a ‘‘Laurea’’ degree in Biology from the University of Bologna. Afterward, he focused hisresearch interests in the field of computer science and he received a Ph.D. in Electronics, Computer Science andTelecommunications working on data streams and data mining of biological signals. He is currently collaboratingwith the Department of Electronics, Computer Science and Systems and with the Department of Human andGeneral Physiology at the University of Bologna.
Paolo Ciaccia has a ‘‘Laurea’’ degree in Electronic Engineering (1985) and a Ph.D. in Electronic and ComputerEngineering (1992), both from the University of Bologna, Italy, where he has been Full Professor of Information
Systems since 2000. He has published more than 80 refereed papers in major international journals (includingIEEE TKDE, IEEE TSE, IEEE TPAMI, ACM TODS, ACM TOIS, IS, DKE, and Biological Cybernetics) andconferences (including VLDB, ACM-PODS, EDBT, ICDE, and CIKM). He was PC chair of the 2002 edition ofthe SEBD italian conference on advanced data bases. His current research interests include similarity and pref-erence-based query processing, image retrieval, P2P systems, and data streams. He is one of the designers of the M-tree, an index for metric data used by many multimedia and data mining research groups in the world. He is amember of IEEE.458 P. Capitani, P. Ciaccia / Data & Knowledge Engineering 62 (2007) 438–458