time series analysis - purdue university · that our indexing technique can be used to index star...

Time Series Analysis

Topics in Data MiningFall 2015

Bruno Ribeiro

© Bruno Ribeiro

2

Motivation

© Bruno Ribeiro

• Sequence of data points

• Sequence must have meaning• Otherwise it is just a set of points

________________________________________________________E. KeoghL. WeiX. XiDepartment of Computer Science & Engineering, UCR.

M. VlachosIMB Watson

S-H LeeDepartment of Anthropology, UCR.

P. ProtopapasHarvard-Smithsonian Center for Astrophysics

LB_Keogh Supports Exact Indexing of Shapes under Rotation Invariance withArbitrary Representations and Distance MeasuresEamonn Keogh · Li Wei · Xiaopeng Xi · Michail Vlachos · Sang-Hee Lee · Pavlos Protopapas

Abstract Shape matching and indexing is important topic in itsown right, and is a fundamental subroutine in most shape datamining algorithms. Given the ubiquity of shape, shape matching isan important problem with applications in domains as diverse asbiometrics, industry, medicine, zoology and anthropology. Thedistance/similarity measure for used for shape matching must beinvariant to many distortions, including scale, offset, noise,articulation, partial occlusion, etc. Most of these distortions arerelatively easy to handle, either in the representation of the data orin the similarity measure used. However rotation invariance isnoted in the literature as being an especially difficult challenge.Current approaches typically try to achieve rotation invariance inthe representation of the data, at the expense of discriminationability, or in the distance measure, at the expense of efficiency. Inthis work we show that we can take the slow but accurateapproaches and dramatically speed them up. On real worldproblems our technique can take current approaches and makethem four orders of magnitude faster, without false dismissals.Moreover, our technique can be used with any of the dozens ofexisting shape representations and with all the most populardistance measures including Euclidean distance, Dynamic TimeWarping and Longest Common Subsequence. We further showthat our indexing technique can be used to index star light curves,an important type of astronomical data, without modification.Keywords Shape, Indexing, Dynamic Time Warping

1. INTRODUCTIONShape matching and indexing is important topic in its own right,and is a fundamental subroutine in most shape data miningalgorithms. Given the ubiquity of shape, shape matching is animportant problem with applications in domains as diverse asbiometrics, industry, medicine, zoology and anthropology. The

distance/similarity measure for used for shape matching must beinvariant to many distortions, including scale, offset, noise,articulation, partial occlusion, etc.. Figure 1 gives a visualintuition of these problems in a familiar domain, butterflies andmoths. Most of these distortions are relatively easy to handle,particularly if we use the well-known technique of converting theshapes into time series as in Figure 2. However, no matter whatrepresentation is used, rotation invariance seems to be uniquelydifficult to handle. For example [20] notes “rotation is alwayssomething hard to handle compared with translation andscaling”.

Figure 1: Examples of the distortions we may be interested in beinginvariant to when matching shapes. The left column shows drawings ofinsects dating back to 1734 [32]. The right column shows real insects. Theflexible wingtips of Actias maenas require articulation invariance. One ofthe Papilio antimachus must be resized before matching. The Agriassardanapalus need their offsets corrected in order to match. The realPapilio rutulus has a broken wing which appears as an occlusion to shapematching algorithms. The real Sphinx Ligustri needs to be rotated to matchthe drawing, achieving this invariance is the focus of this work

Many current approaches try to achieve rotation invariance in therepresentation of the data, at the expense of discrimination ability[28], or in the distance measure, at the expense of efficiency[1][2][3][9].

Actias maenas

Papilio antimachus

Agrias sardanapalus

SphinxLigustri

Papiliorutulus

Dear Reader: This is an expanded version of our VLDB paper

Keogh et al. VLDB 2006

Google Finance

© Bruno Ribeiro

} Time series: ◦ Sequence of data points◦ Sequence must have meaning (e.g., organization in space)◦ But 2D also very common

4

1D Representation of Objects Very Common

Figure 16: A group average hierarchal clustering of eight primateskulls based on the lateral view, using Euclidean distance

It is important to recall that Figure 16 shows a phenogram, not aphylogenetic tree. However on larger scale experiments in thisdomain (shown in [14]) we found that large subtrees of thedendrograms did conform to the current consensus on primateevolution.While the Euclidean distance works very well on the relativelysimple primate skulls, we found that considering a more(morphologically) diverse groups of animals, such as all reptiles,requires DTW as a distance measure. Consider Figure 17 whichshows a hierarchical clustering of a very diverse set of reptiles. Aswith the primates, this is not the correct phylogenetic tree forthese animals, once again however, the (uniquely colored)subtrees do correspond to current consensus on reptiles evolutionbased on DNA analysis and/or more complete morphologicalstudies [10][11].Note that we are not claiming that our shape matching techniquesreplace or even complement classic morphometrics in zoology.The point of these experiments is that if the shape matchingtechniques can produce intuitive results in a domain in which weknow the correct relationships by other means, this suggests thatalgorithms may also produce meaningful results in shapeproblems for which there is more uncertainty, including projectilepoints (see [26] and Figure 15), petroglyphs, insect bite patternsin leaves [42], mammographic calcifications [43] etc.It has recently been claimed that shape matching methods thatonly look at the contours of shapes (boundary based methods) arebrittle to articulation distortion [33], however we believe thatwhile this may be true for certain boundary based methods (i.eHausdorff, Champer etc) the centroid based method we use is veryrobust to articulation distortions. To demonstrate this, weconducted a simple experiment/demonstration. We took threeLepidoptera, including the very similar and closely related Actiasmaenaes and Actias philippinica, and produced a copy of each.We then took these copies and “bent” the right hindwing. Theclustering of the three originals and three copies under Euclideandistance, group average linkage is shown in Figure 18.

Figure 17: A group average hierarchal clustering of fourteen reptileskulls based on the superior view, using DTW distance

• De Brazza monkey

• De Brazza monkey(juvenile)

• Human

• Human Ancestor(Skhul V)

• Red HowlerMonkey

• Mantled HowlerMonkey

• Orangutan(juvenile)

• Orangutan

0 200 400 600 800 1000 1200

Iguania

Chelonia

Amphisbaenia

Alligatorinae

Crocodylidae

Cricosauratypica

Xantusiavigilis

Elseyadentata

Glyptemysmuhlenbergii

Phrynosomamcallii

Phrynosomaditmarsi

Phrynosomataurus

Phrynosomadouglassii

Phrynosomahernandesi

Alligatormississippiensis

Caimancrocodilus

Crocodyluscataphractus

Tomistomaschlegelii

Crocodylusjohnstoni


NASA

Up to 2011 there have been 1,709 KDD/SIGKDD papers (including industrial papers, posters, tutorial/keynote abstracts, etc. [9]). If every such paper was on time series, and each had looked at five hundred million objects, this would still not add up to the size of the data we consider here). However, the largest time series data considered in a SIGKDD paper was a “mere” one hundred million objects [35]. As large as a trillion is, there are thousands of research labs and commercial enterprises that have this much data. For example, many research hospitals have trillions of data points of EEG data, NASA Ames has tens of trillions of datapoints of telemetry of domestic flights, the Tennessee Valley Authority (a power company) records a trillion data points every four months, etc.

1.2 Explicit Statement of our Assumptions Our work is predicated on several assumptions that we will now enumerate and justify.

1.2.1 Time Series Subsequences must be Normalized In order to make meaningful comparisons between two time series, both must be normalized. While this may seem intuitive, and was explicitly empirically demonstrated a decade ago in a widely cited paper [19], many research efforts do not seem to realize this. This is critical because some speedup techniques only work on the un-normalized data; thus, the contributions of these research efforts may be largely nullified [8][28]. To make this clearer, let us consider the classic Gun/NoGun classification problem which has been in the public domain for nearly a decade. The data, which as shown in Figure 1.center is extracted from a video sequence, was Z-normalized. The problem has a 50/150 train/test split and a DTW one-nearest-neighbor classifier achieves an error rate of 0.087. Suppose the data had not been normalized. As shown in Figure 1.left and Figure 1.right, we can simulate this by adding a tiny amount of scaling/offset to the original video. In the first case we randomly change the offset of each time series by ± 10%, and in the second case we randomly change the scale (amplitude) by ± 10%. The new one-nearest-neighbor classifier error rates, averaged over 1,000 runs, are 0.326 and 0.193, respectively, significantly worse than the normalized case.

Figure 1: Screen captures from the original video from which the Gun/NoGun data was culled. The center frame is the original size; the left and right frames have been scaled by 110% and 90% respectively. While these changes are barely perceptible, they double the error rate if normalization is not used. (Video courtesy of Dr. Ratanamahatana)

It is important to recognize that these tiny changes we made are completely dwarfed by changes we might expect to see in a real world deployment. The apparent scale can be changed by the camera zooming, by the actor standing a little closer to the camera, or by an actor of a different height. The apparent offset can be changed by this much by the camera tilt angle, or even by the actor wearing different shoes. While we did this experiment on a visually intuitive example, all forty-five datasets in the UCR archive increase their error rate by at least 50% if we vary the offset and scale by just ± 5%. It is critical to avoid a common misunderstanding. We must normalize each subsequence before making a comparison, it is not sufficient to normalize the entire dataset.

1.2.2 Dynamic Time Warping is the Best Measure It has been suggested many times in the literature that the problem of time series data mining scalability is only due to DTW’s oft-touted lethargy, and that we could solve this problem by using some other distance measure. As we shall later show, this is not the case. In fact, as we shall demonstrate, our optimized DTW search is much faster than all current Euclidean distance searches. Nevertheless, the question remains, is DTW the right measure to speed up? Dozens of alternative measures have been suggested. However, recent empirical evidence strongly suggests that none of these alternatives routinely beats DTW. When put to the test on a collection of forty datasets, the very best of these measures are sometimes a little better than DTW and sometimes a little worse [6]. In general, the results are consistent with these measures being minor variants or “flavors” of DTW (although they are not typically presented this way). In summary, after an exhaustive literature search of more than 800 papers [6], we are not aware of any distance measure that has been shown to outperform DTW by a statistically significant amount on reproducible experiments [6][19]. Thus, DTW is the measure to optimize (recall that DTW subsumes Euclidean distance as a special case).

1.2.3 Arbitrary Query Lengths cannot be Indexed If we know the length of queries ahead of time we can mitigate at least some of the intractability of search by indexing the data [2][11][35]. Although to our knowledge no one has built an index for a trillion real-valued objects (Google only indexed a trillion webpages as recently as 2008), perhaps this could be done. However, what if we do not know the length of the queries in advance? At least two groups have suggested techniques to index arbitrary length queries [18][23]. Both methods essentially build multiple indexes of various lengths, and at query time search the shorter and longer indexes, “interpolating” the results to produce the nearest neighbor produced by a virtual index of the correct length. This is an interesting idea, but it is hard to imagine it is the answer to our problem. Suppose we want to support queries in the range of, say, 16 to 4096. We must build indexes that are not too different in size, say MULTINDEX-LENGTHS = {16, 32, 64, .., 1024, 2048, 4096}1. However, for time series data the index is typically about one-tenth the size of the data [6][18]. Thus, we have doubled the amount of disk space we need. Moreover, if we are interested in tackling a trillion data objects we clearly cannot fit any index in the main memory, much less all of them, or any two of them. There is an underappreciated reason why this problem is so hard; it is an implication of the need for normalization discussed above. Suppose we have a query Q of length 65, and an index that supports queries of length 64. We search the index for Q[1:64] and find that the best match for it has a distance of, say, 5.17. What can we say about the best match for the full Q? The answer is surprisingly little: 5.17 is neither an upper bound nor a lower bound to the best match for Q. This is because we must renormalize the subsequence when moving from Q[1:64] to the full Q. If we do not normalize any data, the results are meaningless (cf. Section 1.2.1), and the idea might be faster than sequential search. However, if we normalize the data we get so little information from indexes of the wrong length that we are no better off than sequential search. In summary, there are no known techniques to support similarity search of arbitrary lengths once we have datasets in the billions.

1 This collection of sizes is very optimistic. The step size should be at

most 100, creating two orders of magnitude space overhead.

© Bruno Ribeiro

} Given: one or more sequences (c1 , c2 , … , ct)(q1, q2, … , qt)

} Find ◦ similar sequences; forecasts◦ patterns; clusters; outliers

5

Formal Definition

© Bruno Ribeiro

} Matching Time Series

} Models

6

Overview

© Bruno Ribeiro


} Models

7

Overview

© Bruno Ribeiro

8

Data Representation is Key

Figure 18: An experiment to demonstrate that the centroid method isreasonably articulation invariant. The gray highlighted areas havebeen randomly “tweaked” in a photo editing program

As we can see, the one dimensional representation has hardlychanged, and the clustering correctly groups the three pairs. Wefound that using DTW we can even more radically distort theshapes and achieve similar results.Given the above, why do boundary based methods have such apoor reputation for domains where articulation is a problem [33]?We believe the answer is not intrinsic to boundary based methods,but lies in the measures typically used on them, especially theHausdorff distance and its many variants. Consider the followingthought experiment. Imagine we have two identical shapes; solidautomobiles. Assume that they have identical antenna protrudingfrom their roofs. As such, the Hausdorff distance between them iszero, but if we bend the antenna in the spirit of Figure 18, we cantrivially increase the Hausdorff distance to one meter.In addition to the above, the results in Table 8 already hinted atthe articulation invariance of our chosen representation. Many ofthe datasets considered have significant articulation. For examplethe face dataset considers classes with both closed-lip andlaughing/yawning people, the two leaf dataset have significantamounts of articulation at the stem, and the Yoga dataset featuresvery flexible people in version poses.Finally, we note that paper [25] uses the ideas in the conferenceversion of this work to index hand geometries for biometrics. It isclear that the human hand has a high degree of articulation.

5.3 Main Memory ExperimentsThere is increasing awareness that comparing two competingapproaches using only CPU time opens the possibility ofimplementation bias [17]. As a simple example, while the Haarwavelet transform is O(n) and DFT is O(nlogn), the DFT is muchfaster in the popular language Matlab, simply because it is ahighly optimized subroutine. For this reason many recent paperscompare approaches with some implementation-free metric[16][30][37][38]. As we noted earlier, the variable “num_steps”

returned by Table 1 and Table 5 allows an implementation freemeasure to compare performance.For Euclidean distance queries we compare to brute force andFourier (FFT) methods, which are the only competitors to alsoguarantee no false dismissals. The cost model for the FFT lowerbound is nlogn steps. If the FFT lower bound fails we allow theapproach to avail of our early abandoning techniques discussed inSection 3.We tested on two shape datasets, a homogeneous database of16,000 projectile point images, all of length 251 and aheterogeneous dataset consisting of all the data used in theclassification experiments, plus 1,000 projectile points. In totalthe heterogeneous dataset contains 5,844 objects of length 1,024.To measure the performance we averaged over 50 runs, with thequery object randomly chosen and removed from the dataset.We measure the average number of steps required by eachapproach for a single comparison of two shapes, divided by thenumber of steps require by brute force. For our method, weinclude a startup cost of O(n2), which is the time require to buildthe wedges. Because the utility of early abandoning depends onthe value of the best-so-far, we expect our method to do better aswe see larger and larger datasets.Figure 19 shows the results on the projectile points dataset usingEuclidean distance.

Figure 19: The relative performance of four algorithms on theProjectile Points dataset using the Euclidean distance measure

We can see that for small datasets our approach is slightly worsethan FFT and simple Early abandon because we had to spendsome time building the wedges. However, by the time we haveseen 64 objects we have already broken even, and thereafterrapidly race towards beating FFT and Early abandon by oneorder of magnitude and Brute force by two orders of magnitude.The results on the projectile points dataset using DTW are shownin Figure 20, and are even more dramatic.

Actias maenaes

Chorinea amazon

Actias philippinica

0

0.2

0.4

0.6

0.8

1.0

Brute forceFFTEarly abandonWedge1600080004000200010005002501256432

Projectile PointsEuclidean

Number of objects in database (m)

0

0.2

0.4

0.6

0.8

1.0

0

0.2

0.4

0.6

0.8

1.0

Brute forceFFTEarly abandonWedge1600080004000200010005002501256432

Projectile PointsEuclidean

Number of objects in database (m)

Figure 2: Shapes can be converted to time series. A) A bitmap of ahuman skull. B) The distance from every point on the profile to the centeris measured and treated as the Y-axis of a time series of length n (C)As an example of the former, the very efficient rotation invarianttechnique of [28] cannot differentiate between the shapes of thelowercase letters “d” and “b”. As an example of the latter, thework of Adamek and Connor [1], which is state of the art in termsof accuracy or precision/recall takes an untenable O(n3) for eachshape comparison.In this work we show that we can take the slow but accurateapproaches and dramatically speed them up. For example we cantake the O(n3) approach of [1] and on real world problems bringthe average complexity down to O(n1.06). This dramaticimprovement in efficiency does not come at the expense ofaccuracy; we prove that we will always return the same answer setas the slower methods.We achieve speedup over the existing methods in two ways,dramatically decreasing the CPU requirements, and allowingindexing. Our technique works by grouping together similarrotations, and defining an admissible lower bound to that group.Given such a lower bound, we can utilize the many search andindexing techniques known in the database community.Our technique has the following advantages:x There are dozens of techniques in the literature for

converting shapes to time series [1][3][7][38][39][44],including some that are domain specific [5][31]. Ourapproach works for any of these representations.

x While there are many distance measures for shapes in theliterature, Euclidean distance, Dynamic Time Warping [2][5][30][31] and Longest Common Subsequence [37] accountsfor the majority of the literature. Our approach works for anyof these distance measures.

x Our approach uses the idea of LB_Keogh lower bounding asits cornerstone. Since the introduction of this idea a fewyears ago [16], dozens of researchers world wide haveadopted and extended this framework for applications asdiverse as motion capture indexing [18], P2P searching [13],handwriting retrieval [31], dance indexing, and query byhumming and monitoring streams [40]. This widespreadadoption of LB_Keogh lower bounding has insured that ithas become a mature and widely supported technology, andsuggests that any contributions made here can be rapidlyadopted and expanded.

x In some domains it may be useful to express rotation-limitedqueries. For example, in order to robustly retrieve examplesof the number “�”, without retrieving infinity symbols “f”,we can issue a query such as: “Find the best match to thisshape allowing a maximum rotation of r 15 degrees”. Ourframework supports such rotation-limited queries.

The rest of this paper is organized as follows. In Section 2 wediscuss background material and related work. In Section 3 weformally introduce the problem and in Section 4 we offer oursolution. Section 5 offers a comprehensive empirical evaluation ofboth the effectiveness and efficiency of our technique. FinallySection 6 offers some conclusions and directions for future work.

2. BACKGROUND AND RELATEDWORKThe literature on shape matching is vast; we refer the interestedreader to [7][36] and [44] for excellent surveys. While not allwork on shape matching uses a 1D representation of the 2Dshapes, an increasingly large majority of the literature does. Wetherefore only consider such approaches here. Note that we loselittle by this omission. The two most popular measures thatoperate directly in the image space, the Chamfer [6] andHausdorff [27] distance measures, require O(n2logn) time1 andrecent experiments (including some in this work) suggest that 1Drepresentations can achieve comparable or superior accuracy.In essence there are three major techniques for dealing withrotation invariance, landmarking, rotation invariant features andbrute force rotation alignment. We consider each below.

2.1 LandmarkingThe idea of “landmarking” is to find the one “true” rotation andonly use that particular alignment as the input to the distancemeasure. The idea comes in two flavors, domain dependent anddomain independent.In domain dependent landmarking, we attempt to find a single (orvery few) fixed feature to use as a starting point for conversion ofthe shape to a time series. For example, in face profile recognitionthe most commonly used landmarks (fiducial points) are the chinor nose [5]. In limited domains this may be useful, but it requiresbuilding special purpose feature extractors. For example, even in adomain as intuitively well understood as human profiles,accurately locating the nose is a non-trivial problem, even if wediscount the possibility of mustaches and glasses. Probably theonly reason any progress has been made in this area is that mostwork reasonably assumes that faces presented in an image arelikely to be upright. For shape matching in skulls, the canonicallandmark is called the Frankfurt Horizontal [41], which is definedby the right and left porion (the highest point on the margin of theexternal auditory meatus) and the left orbitale (the lowest point onthe orbital margin). However, a skull can be missing the relevantbones to determine this orientation and still have enough globalinformation to match its shape to similar examples. Indeed thefamous Skhul V skull shown in Figure 14 is such an example.Other examples of domain dependent landmarking include [39]who use the “sharpest corner” of leafs as landmarks. This ideaappears meaningful in the subset of leaf shapes they considered,but in orbicular (circular) leafs the “sharpest corner” is not welldefined.In domain independent landmarking, we align all the shapes tosome cardinal orientation, typically the major axis. This approachmay be useful for the limited domains in which there is a well-defined major axis, perhaps the indexing of hand tools. However

1 More precisely the time complexity is O(Rplogp), where p is the numberof pixels in the perimeter and R is the number of rotations that need tobe executed. Here p = n, and while R is a user defined parameter, itshould be approximately equal n to guarantee all rotations (up to thelimit of rasterization) are considered.

0 200 400 600 800 1000 1200 1400

A B C



Approximately 2/3 of the papers in the literature do (some minor variant of) this.

x State-of-the-art (SOTA): Each sequence is Z-normalized from scratch, early abandoning is used, and the LB_Keogh lower bound is used for DTW. Approximately 1/3 of the papers in the literature do (some minor variant of) this.

x UCR Suite: We use all of our applicable speedup techniques. DTW uses R = 5% unless otherwise noted. For experiments where Naive or SOTA takes more than 24 hours to finish, we terminate the experiments and present the interpolated values, shown in gray. Where appropriate we also compare to an oracle algorithm: x GOd’s ALgorithm (GOAL) is an algorithm that only

maintains the mean and standard deviation using the online O(1) incremental calculations.

It is easy to see that, short of an algorithm that precomputes and stores a massive amount of data (quadratic in m), GOAL is a lower bound on the fastest possible algorithm for either ED or DTW subsequence search with unconstrained and unknown length queries. The acronym reminds us that we would like to be as close to this goal value as possible. It is critical to note that our implementations of Naive, SOTA and GOAL are incredibly efficient and tightly optimized, and they are not “crippled” in any way. For example, had we wanted to claim spurious speedup, we could implement SOTA recursively rather than iteratively, and that would make SOTA at least an order of magnitude slower. In particular, the code for Naive, SOTA and GOAL is exactly the same code as the UCR suite, except the relevant speedup techniques have been commented out. While very detailed spreadsheets of all of our results are archived in perpetuity at [43], we present subsets of our results below. We consider wall clock time on a 2 Intel Xeon Quad-Core E5620 2.40GHz with 12GB 1333MHz DDR3 ECC Unbuffered RAM (using just one core unless otherwise explicitly stated).

5.1 Baseline Tests on Random Walk We begin with experiments on random walk data. Random walks model financial data very well and are often used to test similarity search schemes. More importantly for us, they allow us to do reproducible experiments on massive datasets without the need to ship large hard drives to interested parties. We have simply archived the random number generator and the seeds used. We have made sure to use a very high-quality random number generator that has a period longer than the longest dataset we consider. In Table 2 we show the length of time it takes to search increasingly large datasets with queries of length 128. The numbers are averaged over 1000, 100 and 10 queries, respectively.

Table 2: Time taken to search a random walk dataset with |Q| =128 Million (Seconds) Billion (Minutes) Trillion (Hours)

UCR-ED 0.034 0.22 3.16 SOTA-ED 0.243 2.40 39.80 UCR-DTW 0.159 1.83 34.09

SOTA-DTW 2.447 38.14 472.80

These results show a significant difference between SOTA and UCR suite. However, this is for a very short query; what happens if we consider longer queries? As we show in Figure 10, the ratio of SOTA-DTW over UCR-DTW improves for longer queries. To reduce visual clutter we have only placed one Euclidean distance value on the figure, for queries of length 4,096. Remarkably, UCR-DTW is even faster than SOTA Euclidean distance. As we shall see in our EEG and DNA examples below, even though 4,096 is longer than any published query lengths in the literature, there is a need for even longer queries.

Figure 10: The time taken to search random walks of length 20 million with increasingly long queries, for three variants of DTW. In addition, we include just length 4,096 with SOTA-ED for reference It is also interesting to consider the results of the 128-length DTW queries as a ratio over GOAL. Recall that the cost for GOAL is independent of query length, and this experiment is just 23.57 seconds. The ratios for Naive, SOTA and UCR suite are 5.27, 2.74 and 1.41, respectively. This suggests that we are asymptomatically closing in on the fastest possible subsequence search algorithm for DTW. Another interesting ratio to consider is the time for UCR-DTW over UCR-ED, which is just 1.18. Thus, the time for DTW is not significantly different than that for ED, an idea which contradicts an assumption made by almost all papers on time series in the last decade (including papers by the current authors).

5.2 Supporting Long Queries: EEG The previous section shows that we gain the greatest speedup for long queries, and here we show that such long queries are really needed. The first user of the UCR suite was Dr. Sydney Cash, who together with author B.W. wants to search massive archives of EEG data for examples of epileptic spikes, as shown Figure 11.

Figure 11: Query Q shown with a match from the 0.3 trillion EEG dataset From a single patient S.C. gathered 0.3 trillion datapoints and asked us to search for a prototypical epileptic spike Q he created by averaging spikes from other patients. The query length was 7,000 points (0.23 seconds). Table 3 shows the results.

Table 3: Time to search 303,523,721,928 EEG datapoints, |Q| = 7000 Note that only ED is considered here because DTW

may produce false positives caused by eye blinks UCR-ED SOTA-ED

EEG 3.4 hours 494.3 hours

This data took multiple sessions over seven days to collect, at a cost of approximately $34,000 [43], so the few hours of CPU time we required to search the data are dwarfed in comparison.

5.3 Supporting Very Long Queries: DNA Most work on time series similarity search (and all work on time series indexing) has focused on relatively short queries, less than or equal to 1,024 data points in length. Here we show that we can efficiently support queries that are two orders of magnitude longer.

Table 4: An algorithm to convert DNA to time series

T1 = 0, for i = 1 to |DNAstring| if DNAstringi = A, then Ti+1 = Ti + 2 if DNAstringi = G, then Ti+1 = Ti + 1 if DNAstringi = C, then Ti+1 = Ti - 1 if DNAstringi = T, then Ti+1 = Ti - 2

Naïve DTW

100

1000

10000seconds

SOTA DTW

OPT DTW

(SOTA ED)

For query lengths of 4,096 (rightmost part of this graph) The times are:Naïve DTW : 24,286SOTA DTW : 5,078SOTA ED : 1,850OPT DTW : 567

Query Length

UCR DTWUCR DTW

0 1000 2000 3000 4000 5000 6000 7000

Recorded with platinum-tipped silicon micro-electrode probes inserted 1.0 mm into the cerebral cortex

Recordings made from 96 active electrodes, with data sampled at 30kHz per electrode

Continuous Intracranial EEG

Q

Rakthanmanon et al. KDD 2012

© Bruno Ribeiro

} L1: Manhattan distance

} L2 = Euclidean distance

} L∞ = max(|c1 – q1|, …, |ct – qt|)

9

Similarity

© Bruno Ribeiro

} Euclidean distance is inner productin t-dimensional space

} Related to◦ cosine similarity◦ cross-correlation

10

In Time Series Space

Day-t

© Bruno Ribeiro

} Fast matching of time series

} Even faster matching of time series

11

Early Termination

4.2.2 Reordering Early Abandoning In the previous section, we saw that the idea of early abandoning discussed in Section 4.1.3 can be generalized to the Z-normalization step. In both cases, we assumed that we incrementally compute the distance/normalization from left to right. Is there a better ordering? Consider Figure 7.left, which shows the normal left-to-right ordering in which the early abandoning calculation proceeds. In this case nine of the thirty-two calculations were performed before the accumulated distance exceeded b and we could abandon. In contrast, Figure 7.right uses a different ordering and was able to abandon earlier, with just five of the thirty-two calculations.

Figure 7: left) ED early abandoning. We have a best-so-far value of b. After incrementally summing the first nine individual contributions to the ED, we have exceeded b; thus, we abandon the calculation. right) A different ordering allows us to abandon after just five calculations This example shows what is obvious: on a query-by-query basis, different orderings produce different speedups. However, we want to know if there is a universal optimal ordering that we can compute in advance. This seems like a difficult question because there are n! possible orderings to consider. We conjecture that the universal optimal ordering is to sort the indices based on the absolute values of the Z-normalized Q. The intuition behind this idea is that the value at Qi will be compared to many Ci’s during a search. However, for subsequence search, with Z-normalized candidates, the distribution of many Ci’s will be Gaussian, with a mean of zero. Thus, the sections of the query that are farthest from the mean, zero, will on average have the largest contributions to the distance measure. To see if our conjecture is true we took the heartbeat discussed in Section 5.4 and computed its full Euclidean distance to a million other randomly chosen ECG sequences. With the conceit of hindsight we computed what the best ordering would have been. For this we simply take each Ci and sort them, largest first, by their sum of their contributions to the Euclidean distance. We compared this empirically optimal ordering with our predicted ordering (sorting the indices on the absolute values of Q) and found the rank correlation is 0.999. Note that we can use this trick for both ED and LB_Keogh, and we can use it in conjunction with the early abandoning Z-normalization technique (Section 4.2.1).

4.2.3 Reversing the Query/Data Role in LB_Keogh Normally the LB_Keogh lower bound discussed in Section 4.1.2 builds the envelope around the query, a situation we denote LB_KeoghEQ for concreteness, and illustrate in Figure 8.left. This only needs to be done once, and thus saves the time and space overhead that we would need if we built the envelope around each candidate instead, a situation we denote LB_KeoghEC.

Figure 8: left) Normally the LB_Keogh envelope is built around the query (see also Figure 4.right), and the distance between C and the closer of {U,L} acts as a lower bound. right) However, we can reverse the roles such that the envelope is built around C and the distance between Q and the closer of {U,L} is the lower bound However, as we show in the next section, we can selectively calculate LB_KeoghEC in a “just-in-time” fashion, only if all other lower bounds fail to prune. This removes space overhead, and as

we will see, the time overhead pays for itself by pruning more full DTW calculations. Note that in general, LB_KeoghEQ ≠ LB_KeoghEC and that on average each one is larger about half the time.

4.2.4 Cascading Lower Bounds One of the most useful ways to speed up time series similarity search is the use of lower bounds to admissibly prune off unpromising candidates [6][11]. This has led to a flurry of research on lower bounds, with at least eighteen proposed for DTW [1][6][20][21][33][40][41][42]. In general, it is difficult to state definitively which is the best bound to use, since there is a tradeoff between the tightness of the lower bound and how fast it is to compute. Moreover, different datasets and even different queries can produce slightly different results. However, as a starting point, we implemented all published lower bounds and tested them on fifty different datasets from the UCR archive, plotting the (slightly idealized for visual clarity) results in Figure 9. Following the literature [20], we measured the tightness of each lower bound as LB(A,B)/DTW(A,B) over 100,000 randomly sampled subsequences A and B of length 256.

Figure 9: The mean tightness of selected lower bounds from the literature plotted against the time taken to compute them The reader will appreciate that a necessary condition for a lower bound to be useful is for it to appear on the “skyline” shown with a dashed line; otherwise there exists a faster-to-compute bound that is at least as tight, and we should use that instead. Note that the early abandoning DTW discussed in Section 4.1.4 is a special case in that it produces a spectrum of bounds, as at every stage of computation it is incrementally computing the DTW until the last computation gives the final true DTW distance. Which of the lower bounds on the skyline should we use? Our insight is that we should use all of them in a cascade. We first use the O(1) LB_KimFL, which while a very weak lower bound prunes many objects. If a candidate is not pruned at this stage we compute the LB_KeoghEQ. Note that as discussed in Sections 4.1.3, 4.2.1 and 4.2.2, we can incrementally compute this; thus, we may be able to abandon anywhere between O(1) and O(n) time. If we complete this lower bound without exceeding the best-so-far, we reverse the query/data role and compute LB_KeoghEC (cf. Section 4.2.3). If this bound does not allow us to prune, we then start the early abandoning calculation of DTW (cf. Section 4.1.4). Space limits preclude detailed analysis of which lower bounds prune how many candidates. Moreover, the ratios depend on the query, data and size of the dataset. However, we note the following: Detailed analysis is available at [43], lesion studies tell us that all bounds do contribute to speedup; removing any lower bound makes search at least twice as slow; and finally, using this technique we can prune more than 99.9999% of DTW calculations for a large-scale search.

5. EXPERIMENTAL RESULTS We begin by noting that we have taken extraordinary measures to ensure our experiments are reproducible. In particular, all data and code will be available in perpetuity, archived at [43]. Moreover, the site contains several videos which visualize some of the experiments in real time. We consider the following methods: x Naive: Each subsequence is Z-normalized from scratch. The

full Euclidean distance or the DTW is used at each step.

CCQ Q

132 4

65

798

351 42

Standard early abandon ordering Optimized early abandon ordering

CU

L

UQ

L

0

1

O(1) O(n) O(nR)

LB_KimFL LB_KeoghEQ

max(LB_KeoghEQ, LB_KeoghEC)Early_abandoning_DTW

LB_KimLB_YiTi

ghtn

ess

of

low

er b

ound

LB_EcornerLB_FTW DTW

LB_PAA

point(s) [2][3][9]. For example paper [2] suggests “In order toavoid evaluation of the dissimilarity measure for every possiblepair of starting contour points …we propose to extract a small setof the most likely starting points for each shape.” Furthermore,both the heuristic used and the number of starting points must “beadjusted to a given application”, and it is not obvious how to bestachieve this.In forceful experiments on publicly available datasets it has beendemonstrated that brute force rotation alignment produces the bestprecision/recall and accuracy in diverse domains [1][2]. Inretrospect this is not too surprising. The rival techniques withrotation invariant features are all using some lossy transformationof the data. In contrast the brute force rotation alignmenttechniques are using a (potentially) lossless transformation of thedata. With more high quality information to use, any distancemeasures will have an easer time reflecting the true similarity ofthe original images.The contribution of this work is to speed up these accurate butslow methods by many orders of magnitude while producingidentical results.

2.4 Indexing Star Light CurvesWhile this paper is focused on the indexing of shapes, it has cometo our attention that our techniques are ideally suited to theindexing of an important type of astronomical data known as starlight curves. We would be remiss not to make this connectionclear, so we briefly discuss the application and provide someexperimental results below.Globally there are myriads of telescopes covering the entire skyand constantly recording massive amounts of valuableastronomical data. Having humans to supervise all observations ispractically impossible; hence the increasing interest in computeraided astronomy. A star light curve is a time series of brightnessof a celestial object as a function of time. The study of lightcurves in astronomy is associated with the study of variability ofsources. That led to the discoveries of pulsars, extra solar planets,supernovae, the rate of expansion of the universe just to namefew. At the Time Series Center at Harvard University Initiative inInnovative Computing there are more than a 100 million suchcurves (with billions more expected by 2009) however none ofthis data is currently searchable (other than by brute force search).

Figure 4: An examples of two similar star light curves.There is a need to compare the similarity of light curves for basicastronomical research, for example in [29] researchers discoverunusual light curves worthy of further examination by finding theexamples with the least similarity to other objects. There are twothings which make this difficult. First is the enormous volume ofdata, the second is the fact that while is it possible to extract asingle period of a light curve, there is no natural starting point. Inorder the find the similarity of two light curves it is thereforenecessary to compare every possible circular shift of the data [29],which as we show below corresponds exactly to the rotationinvariance matching problem for shapes in the one-dimensionalrepresentation. The astronomical community [29] has mitigatedsome of the CPU effort for circular-shift matching byrediscovering the convolution “trick” long known to the shape

matching community [38]. However this technique does not helpreduce disk accesses for data which does not fit in main memory,and only allows matching under the Euclidean metric.

3. ROTATION INVARIANT MATCHINGWe begin by formally defining the rotation invariant matchingproblem. We begin by assuming the Euclidean distance, andgeneralize to other distance measures later. For clarity ofpresentation we will generally refer to “time series”, which thereader will note can be mapped back to the original shapes.Suppose we have two time series, Q and C of length n, whichwere extracted from shapes by an arbitrary method.

Q = q1,q2,…,qi,…,qnC = c1,c2,…,cj,…,cn

As we are interested in large data collections we denote a databaseof m such time series as Q .

Q = {Q1, Q2, ...Qm}If we wish to compare two time series, and therefore shapes, wecan use the ubiquitous Euclidean distance:

� � � �¦ �{

n

iii cqCQED

1

2,

When using Euclidean distance as a subroutine in a classificationor indexing algorithm, we may be interested in knowing the exactdistance only when it is eventually going to be less than somethreshold r. For example, this threshold can be the “range” inrange search or the “best-so-far” in nearest neighbor search. If thisis the case, we can potentially speed up the calculation by doingearly abandoning [17].Definition 1. Early Abandon: During the computation of theEuclidean distance, if we note that the current sum of thesquared differences between each pair of corresponding datapoints exceeds r2, then we can stop the calculation, secure inthe knowledge that the exact Euclidean distance had wecalculated it, would exceed r.

While the idea of early abandoning is fairly obvious and intuitive,it is so important to our work we illustrate it in Figure 5 andprovide pseudocode in Table 1.

Figure 5: A visual intuition of early abandoning. Once the squaredsum of the accumulated gray hatch lines exceeds r2, we can be surethe full Euclidean distance exceeds r

Note that the “num_steps” value returned by the optimizedEuclidean distance in Table 1 is used only to tell us how usefulthe optimization was. If its value is significantly less than n thissuggests dramatic speedup.

Table 1: Euclidean distance optimized with earlyabandonmentalgorithm [dist, num_steps] = EA_Euclidean_Dist(Q, C, r )

accumulator = 0

for i = 1 to length(Q ) // Loop over time series

accumulator += (qi - ci)2

// Accumulate error contribution

If accumulator > r 2// Can we abandon?

0 20 40 60 80 100

OGLE052401.70-

OGLE052357.02- 0 10 20 30 40 50 60 70 80 90 1 00

ca lcula tiona band one da t th is po int

Q

C

0 10 20 30 40 50 60 70 80 90 1 00

ca lcula tiona band one da t th is po int

Q

C

QQ

CC

Rakthanmanon et al. 2012

Keorgh et al. 2006

© Bruno Ribeiro

12

But Alignment not Always Perfect

is decided on-the-fly based on the current K value. They are thevalues which evenly divide the ranges [1, current_K] and[current_K, max_K] into 5 intervals. Note that on average thebestSoFar value only changes log(m) during a linear search, so thisslight overhead in adjusting the parameter is not too burdensome,however we do include this cost in all experiments in Section 5.

4.2 Lower Bounding in Index SpaceTrue rotation invariance has traditionally been so demanding interms of CPU time that little or no effort was made to index it (orit was indexed with the possibility of false dismissals with regardto the raw shapes [4]). As we shall see in the experiments inSection 5.2, the ideas presented in the last section produce suchdramatic reductions in CPU time that it is worth consideringindexing the data.There are several possible techniques we could consider forindexing. Recent years have seen dozens of papers on indexingtime series envelopes that we could attempt to leverage off[16][21][30][31][37]. The only non-trivial adaptation to be madeis that instead of the query being a single envelope, it would benecessary to search for the best match to K envelopes in thewedge setW.Note however that we do not necessarily have to use theenveloping idea in the indexing phase. So long as we can lowerbound in the index space we can use an arbitrary technique to get(hopefully a small subset of) the data from disk to main memory[8], where our H-Merge can very efficiently find the distance tothe best rotation.

Table 7: A Vantage Point Tree for Indexing ShapesAlgorithm [BSF] = NNSearch(C)

BSF.ID = null; // BSF is the Best-So-Far variable

BSF.distance = infinity;

W = convert_time_series_to_wedge_set(C);

Search(rootQ ,W, BSF); // Invoke subroutine on the root of index tree

Subroutine Search(NODE, W, BSF)

if NODE.isLeaf // we are at a leaf node.

for each compressed time-series cT in node

LB = computeLowerBound(cT, W);

queue.push(cT,LB); // sorted by lower bound.

end

while (not (queue.empty()) and (queue.top().LB < BSF.distance))

if (BSF.distance > queue.top().LB)

retrieve full time series Q of queue.top() from disk;

distance = H-Merge(Q, W, BSF.distance ) // calculate full distance.

if distance < BSF.distance // update the best-so-far

BSF.distance = distance; // distance and location.

BSF.ID = Q;

end

end

end

else // we are at a vantage point.

LB = computeLowerBound(VP, W);

queue.push(VP,LB);

if LB < (node.median + BSF.distance)

search(NODE.left, W, BSF); // recursive search left.

else

search(NODE.right, W, BSF); // recursive search right.

end

end

One possible method to achieve this indexable lower bound is touse Fourier methods. Many authors have independently noted thattransforming the signal to the Fourier space and calculating theEuclidean distance between the magnitude of the coefficientsproduces a lower bounds to any rotation [4][38]. We can leverageof this lower bound to use a VP-tree to index our time series asshown in Table 7.

This technique is adapted from [38], and we refer the reader tothis work for a more complete treatment.

4.3 Generalizing to other Distance MeasuresAs we shall see in Section 5, the Euclidean distance is typicallyvery effective and intuitive as a distance measure for shapes.However in some domains it may not produce the best possibleprecision/recall or classification accuracy [2][30]. The problem isthat even after best rotation alignment, subjectively similar shapesmay produce time series that are globally similar but contain local“distortions”. These distortions may correspond to local featuresin that are present in both shapes but in different proportions. Forexample in Figure 11 we can see that the larger brain case of theLowland Gorilla changes the locations in which the brow ridgeand jaw map to in a time series relative to the Mountain Gorilla.

Figure 11: The Lowland Gorilla and Mountain Gorilla aremorphologically similar, but have slightly different proportions.Dynamic Time Warping can be used to align homologous features inthe time series representation space

Even if we assume that the database contains the actual objectused as a query, it is possible that the two time series are distortedversions of each. Here the distortions may be caused by cameraperspective effect, differences in lighting causing shadows whichappear to be features, parallax etc.Fortunately there is a well-known technique for compensatingsuch local misalignments, Dynamic Time Warping (DTW)[16][30]. While DTW was invented in the context of 1D speechsignals others have noted its utility for matching shapes, includingface profiles [5], hand gestures [25], leafs [30] and handwriting[31].To align two sequences using DTW, an n-by-n matrix isconstructed, where the (ith, jth) element of the matrix is thedistance d(qi, cj) between the two points qi and cj (i.e. d(qi, cj) = (qi- cj)2 ). Each matrix element (i, j) corresponds to the alignmentbetween the points qi and cj, as illustrated in Figure 12.

Mountain GorillaGorilla gorilla beringei

Lowland GorillaGorilla gorilla graueri

Rakthanmanon et al. KDD 2012

} Accelerations & decelerations possible

} Compute Euclidean withaccel. and decc. in mind

} Closely related to string-editing distance

13

Dynamic Time Warping (DTW)

Similar : Approximate matching - string editing distance:d( ‘holes’, ‘mole’) = 2

= min # of insertions, deletions, substitutions totransform the first string into the second


A warping path P is a contiguous set of matrix elements thatdefines a mapping between Q and C. The tth element of P isdefined as pt = (i, j)t so we have:

P = p1, p2, …, pt, …, pT n ≤ T < 2n-1The warping path that defines the alignment between the two timeseries is subject to several constraints. For example, the warpingpath must start and finish in diagonally opposite corner cells ofthe matrix; the steps in the warping path are restricted to adjacentcells (including diagonally adjacent cells); the points in thewarping path must be monotonically spaced in time. In addition tothese constraints, virtually all practitioners using DTW alsoconstrain the warping path in a global sense by limiting how far itmay stray from the diagonal [16][30][31]. A typical constraint isthe Sakoe-Chiba Band which states that the warping path cannotdeviate more than R cells from diagonal [34].

Figure 12: Left) Two time series sequences with local differences. Right)To align the sequences we construct a warping matrix, and search for theoptimal warping path, shown with solid squares. Note that Sakoe-ChibaBand with width R is used to constrain the warping path

The optimal warping path can be found in O(nR) time by dynamicprogramming [16].Based on an arbitrary wedge W and the allowed warping range R,we define two new sequences, DTW_U and DTW_L:

DTW_Ui = max(Ui-R : Ui+R )DTW_Li = min(Li-R : Li+R )

They form an additional envelope above and below the wedge, asillustrated in Figure 13.

Figure 13: The idea of bounding envelopes introduced in Figure 6 isgeneralized to allow DTW. A) Two time series C1 and C2. B) A timeseries wedge W, created from C1 and C2. C) In order to allow lowerbounding of DTW, an additional envelope is created above and belowthe wedge. D) An illustration of

DTWKeoghLB _

We can now define a lower bounding measure for DTW distancebetween an arbitrary query Q and the entire set of candidatesequences contained in a wedge W:

¦ °¯

°®

��!�

n

iiiii

iiii

DTW

otherwiseLDTWqifLDTWqUDTWqifUDTWq

WQKeoghLB1

2

2

0_)_(_)_(

),(_

We will now prove the claim of the lower bounding.Proposition 2: For any sequence Q of length n and a wedge Wcontaining a set of time series C1, …, Ck of the same length n, forany global constraint on the warping path of theform RjiRj �dd� , the following inequality holds:

),(),(_ sDTW CQDTWWQKeoghLB d , where s = 1, 2, ..., k.Proof:Suppose we know that among the k time series C1, …, Ck, Cs hasthe minimal DTW distance to query Q. And we wish to prove

¦¦

d°¯

°®

��!� T

tst

n

iiiii

iiii

potherwise

LDTWqifLDTWqUDTWqifUDTWq

11

2

2

0_)_(_)_(

Since the terms under radicals are positive, we can square bothsides:

¦¦

d°¯

°®

��!� T

tst

n

iiiii

iiii

potherwise

LDTWqifLDTWqUDTWqifUDTWq

11

2

2

0_)_(_)_(

Recall that that when we stated the definition of the warping pathabove we had, P = p1, p2, …, pt, …, pT n ≤ T < 2n-1. Wetherefore have n ≤ T, so our strategy will be to show that everyterm in the left summation can be matched with some greater orequal term in the right summation.There are three cases to consider, for the moment we will justconsider the case when qi > DTW_Ui . We want to show:

stii pUDTWq d� 2)_(22 )()_( sjiii CqUDTWq �d� By Definition 3

)()_( sjiii CqUDTWq �d� Since qi > DTW_Ui , wecan take square roots onboth sides

sji CUDTW �d� _ Subtract qi from bothsides

isj UDTWC _d Add DTW_Ui + Csj toboth sides

):max( RiRisj UUC ��dBy definition DTW_Ui =max(Ui-R : Ui+R )

Since the query sequence Q and all the candidate sequences C1,…, Ck are of the same length and j-R ≤ i ≤ j+R, we know i-R ≤ j ≤i+R. So we can rewrite the right side and the inequality becomes

),...,,...,,max( )1( RijRiRisj UUUUC ��d

If we remove all terms except Uj from the RHS we are left with)max( jsj UC d which is obviously true since Uj = max(C1j,..,Ckj).

The case qi < DTW_Ui yields to a similar argument. The final case

is simple to show, since clearly2)(0 sji Cq �d because

2)( sji Cq �

must be nonnegative.Thus we have shown that each term on the left side is matchedwith an equal or larger term on the right side. Our inequalityholds. ■

C2

C1

U

L

WU

L

W

W

DTW_U

DTW_L

W

DTW_U

DTW_L

W

Q

W

A

B

C

D

C2

C1

U

L

WU

L

W

W

DTW_U

DTW_L

W

DTW_U

DTW_L

W

Q

W

A

B

C

D

QC

QC

Q

C

QC

QC

Q

C

QC

QCC

Q

C

R

QC

QC

Q

C

QC

QC

Q

C

QC

QCC

Q

C

R

© Bruno Ribeiro

} Dynamic programmingD( i, j ) = matching cost up to times i and j

of sequences c and q

14

Dynamic Time Warping Computation

Complexity: O(t1 t2) - quadratic on the length of the sequences

Substitution cost

© Bruno Ribeiro

} Modify time series into a sequence of symbols

15

Quantization

acbbbccb

ac

bb

bc

cb

Allows:• Hashing• Suffix Trees• Simpler Markov Models• Use more text processing ideas

} Quantization: Comes from the statistical term “quantiles”◦ Use quantiles of signal density function to replace time series by symbols

} Widely used in signal processing

} Examples in time series mining◦ SAX – Lin et al. 2007

16

What is Quantization?

0

--

0 20 40 60 80 100 120

bb b

a

cc

c

abaabccbc

Wikipedia

Keorgh 2007

© Bruno Ribeiro

17

Euclidean Distance Lower Bound

= bbabcbacC

= bbaccbacQ

dist() can be implemented using a table lookup.

Camerra et al. ICDM 2010

© Bruno Ribeiro

} Longest Common Sub-String

18

Matching with Errors

For brevity we omit the very minor modifications required toindex LB_KeoghDTW(Q,W), however [37] contains the necessarymodifications for both DTW and for LCSS which is discussedbelow.To facilitate later efficiency comparisons to Euclidean distanceand other methods, it will be useful to define the time complexityof DTW in terms of “num_steps” as returned by Table 1 andTable 5. The variable “num_steps” is the number of real-valuesubtractions that must be performed, and completely dominatesthe CPU time, since the square root function is only performedonce (and can be removed, see [17]). If we construct a full n by nwarping matrix, then DTW clearly requires at least n2 steps.However as we noted above and illustrated in Figure 12, we cantruncate the corners of the matrix to reduce this number toapproximately nR, where R is the width of the Sakoe-Chiba Band.While nR is the number of steps for a single DTW, we expect theaverage number of steps to be less, because some full DTWcalculations will not be needed if the lower bound test fails. Sincethe lower bound test requires n steps, the average number of stepswhen doing m comparisons should be:

mnmnRam )()( �

Where a is the fraction of the database that requires the full DTWcalculated. Note that even this is pessimistic, since both DTW2

and LB_KeoghDTW are implemented as early abandoning (recallTable 5). We therefore simply count the “num_steps” required byeach approach and divide it by m to get the average number ofsteps required for one comparison.In addition to DTW, several researchers have suggested usingLongest Common SubSequence (LCSS) as a distance measure forshapes. The LCSS is very similar to DTW except that while DTWinsists that every point in C maps onto one (or more) point(s) inQ, LCSS allows some points to go unmatched. The intuitionbehind this idea in a time series domain is that subsequences maycontain additions or deletions, for example an extra (or forgotten)dance move in a motion capture performance, or a missed beat inECG data. Rather than forcing DTW to produce an unnaturalalignment between two such sequences, we can use LCSS, whichsimply ignores parts of the time series that are too difficult tomatch. In the image space the missing section of the time seriesmay correspond to a partial occlusion of an object, or to aphysically missing part of the object, as shown in Figure 14.

2 Note that a recursive implementation of DTW would always require nRsteps, however iterative implementation (as used here) can potentiallyearly abandon with as few as R steps.

Figure 14: A) The famous Skhul V is generally reproduced with themissing bones extrapolated in epoxy, however the original Skhul V (B)is missing the nose region, which means it will match to a modernhuman (C) poorly, even after DTW alignment (inset). In contrast, LCSSalignment will not attempt to match features that are outside a“matching envelope” (heavy gray line) created from the other sequence.

Real world examples of domains that require LCSS abound. Forexample anthropologists are interested in exploring large datasetof projectile points (“arrowheads”). At the UCR LithicTechnology Lab at UCR there are over a million specimens, soindexing is required for efficient access. While anthropologistshave long been interested in shape, interest in matching suchobjects is further fueled by the availability of computing powerand by a recent movement that notes, “an increasing number ofarchaeologists are showing interest in employing Darwinianevolutionary theory to explain variation in the material record”[26]. Anthropologists have recently used tools from biologicalmorphology to attempt to explain spatial and temporal distributionof projectile points in North America. As we illustrate in Figure15 many examples are incomplete, missing tip or tangs. LCSS canignore such missing feature to provide more robust matching.

Figure 15: Project points are frequently found with broken tips ortangs. Such objects require LCSS to find meaningful matches tocomplete specimens. From left to right, Edwards, Langtry, andGolondrina projectile points.

While we considered LCSS for generality, we will not furtherexplain how to incorporate it into our framework. It has been

A B

B

C

This region willnot be matched

DTW

LCSSAlignment

Keorgh et al. 2006

© Bruno Ribeiro

} Helps identify commands / words

} Can be used in prediction

} Obtain more “meaningful” symbolic representation

19

Time Series MotifsAutoPlait: Automatic Mining ofCo-evolving Time Sequences

Yasuko MatsubaraKumamoto University

[email protected]

Yasushi SakuraiKumamoto University

[email protected]

Christos FaloutsosCarnegie Mellon University

[email protected]

ABSTRACTGiven a large collection of co-evolving multiple time-series, whichcontains an unknown number of patterns of different durations,how can we efficiently and effectively find typical patterns and thepoints of variation? How can we statistically summarize all thesequences, and achieve a meaningful segmentation?

In this paper we present AUTOPLAIT, a fully automatic miningalgorithm for co-evolving time sequences. Our method has the fol-lowing properties: (a) effectiveness: it operates on large collectionsof time-series, and finds similar segment groups that agree with hu-man intuition; (b) scalability: it is linear with the input size, andthus scales up very well; and (c) AUTOPLAIT is parameter-free,and requires no user intervention, no prior training, and no param-eter tuning.

Extensive experiments on 67GB of real datasets demonstrate thatAUTOPLAIT does indeed detect meaningful patterns correctly, andit outperforms state-of-the-art competitors as regards accuracy andspeed: AUTOPLAIT achieves near-perfect, over 95% precision andrecall, and it is up to 472 times faster than its competitors.

Categories and Subject Descriptors: H.2.8 [Database manage-ment]: Database applications–Data mining

General Terms: Algorithms, Experimentation, Theory

Keywords: Time-series data, Automatic mining

1. INTRODUCTIONGiven a large collection of co-evolving time series, such as mo-

tion capture and web-click logs, how can we find the typical pat-terns or anomalies, and statistically summarize all the sequences?In this paper we focus on a challenging problem, namely fully-automatic mining, and more specifically, we tackle the four im-portant time-series analysis tasks, namely, CAPS (Compression /Anomaly-detection / Pattern-discovery / Segmentation).

Our goal is to analyze a large collection of multiple time-series,(hereafter, a "bundle" of time-series), and summarize all the se-quences into a compact yet powerful representation. So, what is agood representation of time-series? In practice, real-life data hasdistinct multiple trends, such as the weekday/weekend patterns of

Permission to make digital or hard copies of all or part of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for profit or commercial advantage and that copies bear this notice and the full cita-

tion on the first page. Copyrights for components of this work owned by others than

ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-

publish, to post on servers or to redistribute to lists, requires prior specific permission

and/or a fee. Request permissions from [email protected].

SIGMOD’14, June 22–27, 2014, Snowbird, UT, USA.

Copyright 2014 ACM 978-1-4503-2376-5/14/06 ...$15.00.

http://dx.doi.org/10.1145/2588555.2588556.

200 400 600 800 1000 1200 14000

0.5

1

12

34

left/rightlegs

left/rightarms

beaks tail feathersclaps

beaks tail feathersclapswings wings

(a) AUTOPLAIT result

beaks !"wings tail feathers claps#

(b) Four basic steps of the “chicken dance”

Figure 1: AUTOPLAIT “automatically” identifies the dancesteps of a motion capture clip, as well as the positions of theall cut points: original motion capture sequences and our re-sult for the “chicken dance”, which consists of four basic steps,“beaks”, “wings”, “tail feathers” and “claps”.

web-click sequences, and the normal/abnormal patterns seen in net-work traffic monitoring (hereafter we refer to such a pattern as a"regime"). How can we describe all these trends and distinct pat-terns (i.e., regimes) in large datasets? We need a good summariza-tion of time-series in terms of a statistical framework.

In this paper, we present AUTOPLAIT, 1 which answers the abovequestions, and provides a good description of large collections ofco-evolving sequences. Intuitively, the problem we wish to solve isas follows:

INFORMAL PROBLEM 1. Given a large collection of co-evolvingtime series X (i.e., bundle), find a compact description of X , i.e.,

• find a set of segments and their cut points,• find a set of groups (namely, regimes) of similar segments• automatically, and quickly

Our algorithm automatically identifies all distinct patterns/regimesin a time-series, and spots the time-position of each variation.

Preview of our results. Figure 1 (a) shows the original time-seriesof the “chicken dance”2 and the result we obtained with AUTO-PLAIT. The motion consists of four co-evolving sequences: left/rightarms, and left/right legs, composed of four basic steps in the fol-lowing order: “beaks”, “wings”, “tail feathers” and “claps” (see

1Available at http://www.cs.kumamoto-u.ac.jp/~yasuko/software.html2 Popular in the 1980’s; please see, e.g.,http://www.youtube.com/watch?v=6UV3kRV46Zs&t=49s

AutoPlait: Automatic Mining ofCo-evolving Time Sequences

Yasuko MatsubaraKumamoto University

[email protected]

Yasushi SakuraiKumamoto University

[email protected]

Christos FaloutsosCarnegie Mellon University

[email protected]

ABSTRACTGiven a large collection of co-evolving multiple time-series, whichcontains an unknown number of patterns of different durations,how can we efficiently and effectively find typical patterns and thepoints of variation? How can we statistically summarize all thesequences, and achieve a meaningful segmentation?

In this paper we present AUTOPLAIT, a fully automatic miningalgorithm for co-evolving time sequences. Our method has the fol-lowing properties: (a) effectiveness: it operates on large collectionsof time-series, and finds similar segment groups that agree with hu-man intuition; (b) scalability: it is linear with the input size, andthus scales up very well; and (c) AUTOPLAIT is parameter-free,and requires no user intervention, no prior training, and no param-eter tuning.

Extensive experiments on 67GB of real datasets demonstrate thatAUTOPLAIT does indeed detect meaningful patterns correctly, andit outperforms state-of-the-art competitors as regards accuracy andspeed: AUTOPLAIT achieves near-perfect, over 95% precision andrecall, and it is up to 472 times faster than its competitors.

Categories and Subject Descriptors: H.2.8 [Database manage-ment]: Database applications–Data mining

General Terms: Algorithms, Experimentation, Theory

Keywords: Time-series data, Automatic mining

1. INTRODUCTIONGiven a large collection of co-evolving time series, such as mo-

tion capture and web-click logs, how can we find the typical pat-terns or anomalies, and statistically summarize all the sequences?In this paper we focus on a challenging problem, namely fully-automatic mining, and more specifically, we tackle the four im-portant time-series analysis tasks, namely, CAPS (Compression /Anomaly-detection / Pattern-discovery / Segmentation).

Our goal is to analyze a large collection of multiple time-series,(hereafter, a "bundle" of time-series), and summarize all the se-quences into a compact yet powerful representation. So, what is agood representation of time-series? In practice, real-life data hasdistinct multiple trends, such as the weekday/weekend patterns of

Permission to make digital or hard copies of all or part of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for profit or commercial advantage and that copies bear this notice and the full cita-

tion on the first page. Copyrights for components of this work owned by others than

ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-

publish, to post on servers or to redistribute to lists, requires prior specific permission

and/or a fee. Request permissions from [email protected].

SIGMOD’14, June 22–27, 2014, Snowbird, UT, USA.

Copyright 2014 ACM 978-1-4503-2376-5/14/06 ...$15.00.

http://dx.doi.org/10.1145/2588555.2588556.

200 400 600 800 1000 1200 14000

0.5

1

12

34

left/rightlegs

left/rightarms

beaks tail feathersclaps

beaks tail feathersclapswings wings

(a) AUTOPLAIT result

beaks !"wings tail feathers claps#

(b) Four basic steps of the “chicken dance”

Figure 1: AUTOPLAIT “automatically” identifies the dancesteps of a motion capture clip, as well as the positions of theall cut points: original motion capture sequences and our re-sult for the “chicken dance”, which consists of four basic steps,“beaks”, “wings”, “tail feathers” and “claps”.

web-click sequences, and the normal/abnormal patterns seen in net-work traffic monitoring (hereafter we refer to such a pattern as a"regime"). How can we describe all these trends and distinct pat-terns (i.e., regimes) in large datasets? We need a good summariza-tion of time-series in terms of a statistical framework.

In this paper, we present AUTOPLAIT, 1 which answers the abovequestions, and provides a good description of large collections ofco-evolving sequences. Intuitively, the problem we wish to solve isas follows:

INFORMAL PROBLEM 1. Given a large collection of co-evolvingtime series X (i.e., bundle), find a compact description of X , i.e.,

• find a set of segments and their cut points,• find a set of groups (namely, regimes) of similar segments• automatically, and quickly

Our algorithm automatically identifies all distinct patterns/regimesin a time-series, and spots the time-position of each variation.

Preview of our results. Figure 1 (a) shows the original time-seriesof the “chicken dance”2 and the result we obtained with AUTO-PLAIT. The motion consists of four co-evolving sequences: left/rightarms, and left/right legs, composed of four basic steps in the fol-lowing order: “beaks”, “wings”, “tail feathers” and “claps” (see

1Available at http://www.cs.kumamoto-u.ac.jp/~yasuko/software.html2 Popular in the 1980’s; please see, e.g.,http://www.youtube.com/watch?v=6UV3kRV46Zs&t=49s

Matsubara et al. SIGMOD 2014

© Bruno Ribeiro


} Models

20

Overview

© Bruno Ribeiro

21

Linear Forecasting

© Bruno Ribeiro

22

} Linear Forecasting◦ Auto-regression: Least Squares; Recursive Least Squares

Google Finance

???

© Bruno Ribeiro

23

} Remove trends (drift)} Remove periodicity

Google Finance

Remove trends

Australian beer consumptionshows periodicity (seasonality)Autocorrelation function

© Bruno Ribeiro

24

} Interpolation(something is missing)

} (x1, …, xt)} (y1, …, yt)

?

Faloutosos 2014

© Bruno Ribeiro

25

Google Finance

noise

Similar problem to linear regression:express unknowns as a linear function of knowns

????

© Bruno Ribeiro

26

} X[t ⋅w] · a[w ⋅1] = y[t ⋅1]

} Over-constrained problem◦ a is the vector of the regression coefficients◦ X has the t values of the w indep. variables◦ y has the t values of the dependent variable

Faloutosos 2014

© Bruno Ribeiro

28

} a = (XT⋅X)-1⋅(XT⋅y)

X+ = (XT⋅X)-1⋅XT is the Moore–Penrose pseudoinverse

Or: a = X+ y

a is the vector that minimizes the RMSE of (y – X⋅a’)

Problems:Matrix X grows over time needs matrix inversion} O(t⋅w2) computation} O(t⋅w) storage

© Bruno Ribeiro

Recursive LS} Need much smaller, fixed size

matrix O(w×w)} Fast, incremental computation

O(1 x w2)} no matrix inversion

32

Comparison

Original Least Squares} Needs huge matrix

(growing in size) O(t×w)} Costly matrix operation

O(t×w2)

© Bruno Ribeiro

} HMMs

} Conditional Random Fields

} Linear-chain CRF

34

Graphical Models

1.3 Linear-Chain Conditional Random Fields 9

. . .

. . .

y

x

Figure 1.3 Graphical model of an HMM-like linear-chain CRF.

. . .

. . .

y

x

Figure 1.4 Graphical model of a linear-chain CRF in which the transition scoredepends on the current observation.

1.3 Linear-Chain Conditional Random Fields

In the previous section, we have seen advantages both to discriminative modelingand sequence modeling. So it makes sense to combine the two. This yields a linear-chain CRF, which we describe in this section. First, in Section 1.3.1, we define linear-chain CRFs, motivating them from HMMs. Then, we discuss parameter estimation(Section 1.3.2) and inference (Section 1.3.3) in linear-chain CRFs.

1.3.1 From HMMs to CRFs

To motivate our introduction of linear-chain conditional random fields, we beginby considering the conditional distribution p(y|x) that follows from the jointdistribution p(y,x) of an HMM. The key point is that this conditional distributionis in fact a conditional random field with a particular choice of feature functions.First, we rewrite the HMM joint (1.8) in a form that is more amenable to general-ization. This is

p(y,x) =1Z

exp

8

<

:

X

t

X

i,j2S

�ij1{yt=i}1{yt�1=j} +X

t

X

i2S

X

o2O

µoi1{yt=i}1{xt=o}

9

=

;

,

(1.13)where ✓ = {�ij , µoi} are the parameters of the distribution, and can be any realnumbers. Every HMM can be written in this form, as can be seen simply by setting�ij = log p(y0 = i|y = j) and so on. Because we do not require the parameters tobe log probabilities, we are no longer guaranteed that the distribution sums to 1,unless we explicitly enforce this by using a normalization constant Z. Despite thisadded flexibility, it can be shown that (1.13) describes exactly the class of HMMsin (1.8); we have added flexibility to the parameterization, but we have not addedany distributions to the family.

1.3 Linear-Chain Conditional Random Fields 9

. . .

. . .

y

x

Figure 1.3 Graphical model of an HMM-like linear-chain CRF.

. . .

. . .

y

x

Figure 1.4 Graphical model of a linear-chain CRF in which the transition scoredepends on the current observation.

1.3 Linear-Chain Conditional Random Fields

In the previous section, we have seen advantages both to discriminative modelingand sequence modeling. So it makes sense to combine the two. This yields a linear-chain CRF, which we describe in this section. First, in Section 1.3.1, we define linear-chain CRFs, motivating them from HMMs. Then, we discuss parameter estimation(Section 1.3.2) and inference (Section 1.3.3) in linear-chain CRFs.

1.3.1 From HMMs to CRFs

To motivate our introduction of linear-chain conditional random fields, we beginby considering the conditional distribution p(y|x) that follows from the jointdistribution p(y,x) of an HMM. The key point is that this conditional distributionis in fact a conditional random field with a particular choice of feature functions.First, we rewrite the HMM joint (1.8) in a form that is more amenable to general-ization. This is

p(y,x) =1Z

exp

8

<

:

X

t

X

i,j2S

�ij1{yt=i}1{yt�1=j} +X

t

X

i2S

X

o2O

µoi1{yt=i}1{xt=o}

9

=

;

,

(1.13)where ✓ = {�ij , µoi} are the parameters of the distribution, and can be any realnumbers. Every HMM can be written in this form, as can be seen simply by setting�ij = log p(y0 = i|y = j) and so on. Because we do not require the parameters tobe log probabilities, we are no longer guaranteed that the distribution sums to 1,unless we explicitly enforce this by using a normalization constant Z. Despite thisadded flexibility, it can be shown that (1.13) describes exactly the class of HMMsin (1.8); we have added flexibility to the parameterization, but we have not addedany distributions to the family.

1.2 Graphical Models 7

Logistic Regression

HMMs

Linear-chain CRFs

Naive BayesSEQUENCE

SEQUENCE

CONDITIONAL CONDITIONAL

Generative directed models

General CRFs

CONDITIONAL

GeneralGRAPHS

GeneralGRAPHS

Figure 1.2 Diagram of the relationship between naive Bayes, logistic regression,HMMs, linear-chain CRFs, generative models, and general CRFs.

Furthermore, even when naive Bayes has good classification accuracy, its prob-ability estimates tend to be poor. To understand why, imagine training naiveBayes on a data set in which all the features are repeated, that is, x =(x1, x1, x2, x2, . . . , xK , xK). This will increase the confidence of the naive Bayesprobability estimates, even though no new information has been added to the data.Assumptions like naive Bayes can be especially problematic when we generalizeto sequence models, because inference essentially combines evidence from di↵erentparts of the model. If probability estimates at a local level are overconfident, itmight be di�cult to combine them sensibly.Actually, the di↵erence in performance between naive Bayes and logistic regressionis due only to the fact that the first is generative and the second discriminative;the two classifiers are, for discrete input, identical in all other respects. Naive Bayesand logistic regression consider the same hypothesis space, in the sense that anylogistic regression classifier can be converted into a naive Bayes classifier with thesame decision boundary, and vice versa. Another way of saying this is that the naiveBayes model (1.5) defines the same family of distributions as the logistic regressionmodel (1.7), if we interpret it generatively as

p(y,x) =exp {

P

k �kfk(y,x)}P

y,x exp {P

k �kfk(y, x)} . (1.9)

This means that if the naive Bayes model (1.5) is trained to maximize the con-ditional likelihood, we recover the same classifier as from logistic regression. Con-versely, if the logistic regression model is interpreted generatively, as in (1.9), and istrained to maximize the joint likelihood p(y,x), then we recover the same classifieras from naive Bayes. In the terminology of Ng and Jordan [2002], naive Bayes andlogistic regression form a generative-discriminative pair.The principal advantage of discriminative modeling is that it is better suited to

y

x

Figures: Sutton & McCallum 2002

© Bruno Ribeiro

35

More Generally1.2 Graphical Models 7

Logistic Regression

HMMs

Linear-chain CRFs

Naive BayesSEQUENCE

SEQUENCE

CONDITIONAL CONDITIONAL

Generative directed models

General CRFs

CONDITIONAL

GeneralGRAPHS

GeneralGRAPHS

Figure 1.2 Diagram of the relationship between naive Bayes, logistic regression,HMMs, linear-chain CRFs, generative models, and general CRFs.

Furthermore, even when naive Bayes has good classification accuracy, its prob-ability estimates tend to be poor. To understand why, imagine training naiveBayes on a data set in which all the features are repeated, that is, x =(x1, x1, x2, x2, . . . , xK , xK). This will increase the confidence of the naive Bayesprobability estimates, even though no new information has been added to the data.Assumptions like naive Bayes can be especially problematic when we generalizeto sequence models, because inference essentially combines evidence from di↵erentparts of the model. If probability estimates at a local level are overconfident, itmight be di�cult to combine them sensibly.Actually, the di↵erence in performance between naive Bayes and logistic regressionis due only to the fact that the first is generative and the second discriminative;the two classifiers are, for discrete input, identical in all other respects. Naive Bayesand logistic regression consider the same hypothesis space, in the sense that anylogistic regression classifier can be converted into a naive Bayes classifier with thesame decision boundary, and vice versa. Another way of saying this is that the naiveBayes model (1.5) defines the same family of distributions as the logistic regressionmodel (1.7), if we interpret it generatively as

p(y,x) =exp {

P

k �kfk(y,x)}P

y,x exp {P

k �kfk(y, x)} . (1.9)

This means that if the naive Bayes model (1.5) is trained to maximize the con-ditional likelihood, we recover the same classifier as from logistic regression. Con-versely, if the logistic regression model is interpreted generatively, as in (1.9), and istrained to maximize the joint likelihood p(y,x), then we recover the same classifieras from naive Bayes. In the terminology of Ng and Jordan [2002], naive Bayes andlogistic regression form a generative-discriminative pair.The principal advantage of discriminative modeling is that it is better suited to

Figures: Sutton & McCallum 2002

y

x

y

x

y

x

y

x

y

x

y

x

Recommended Reading: C. Sutton & A. McCallum 2002

© Bruno Ribeiro

time series analysis - purdue university · that our indexing technique can be used to index star...

Documents