a multiresolution symbolic representation of time series

32
A Multiresolution Symbolic Representation of Time Series Vasileios Megalooikonomou 1 , Qiang Wang 1 , Guo Li 1 , Christos Faloutsos 2 1 Temple University, Philadelphia, USA 2 Carnegie Mellon University, Pittsburgh, USA

Upload: kalea

Post on 06-Feb-2016

56 views

Category:

Documents


0 download

DESCRIPTION

A Multiresolution Symbolic Representation of Time Series Vasileios Megalooikonomou 1 , Qiang Wang 1 , Guo Li 1 , Christos Faloutsos 2 1 Temple University, Philadelphia, USA 2 Carnegie Mellon University, Pittsburgh, USA. Outline. Background Methodology Experimental results Conclusion. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: A Multiresolution Symbolic Representation of Time Series

A Multiresolution Symbolic Representation of Time Series

Vasileios Megalooikonomou1, Qiang Wang1, Guo Li1, Christos Faloutsos2

1Temple University, Philadelphia, USA2Carnegie Mellon University, Pittsburgh, USA

Page 2: A Multiresolution Symbolic Representation of Time Series

Outline

Background

Methodology

Experimental results

Conclusion

Page 3: A Multiresolution Symbolic Representation of Time Series

IntroductionTime Sequence:A sequence (ordered collection) of real

values: X = x1, x2,…, xn

Challenges:

• High dimensionality

• High amount of data

• Similarity metric definition

……

Page 4: A Multiresolution Symbolic Representation of Time Series

Introduction

Goal: To achieve:

• High efficiency• High accuracy

in similarity searches among time series and

in discovering interesting patterns

Page 5: A Multiresolution Symbolic Representation of Time Series

Introduction

Similarity metric for time series• Euclidean Distance:

most common, sensitive to shifts

• Dynamic Time Warping (DTW):

improving accuracy, but time consuming O(n2)• Envelope-based DTW:

improving time complexity, o(n)

Page 6: A Multiresolution Symbolic Representation of Time Series

Introduction

Similarity metric for time series

A more intuitive idea:

two series should be considered similar if they have enough non-overlapping time-ordered pairs of subsequences that are similar (Agrawal et al. VLDB, 1995)

Page 7: A Multiresolution Symbolic Representation of Time Series

Introduction

Dimensionality reduction techniques:• DFT: Discrete Fourier Transform• DWT: Discrete Wavelet Transform• SVD: Singular Vector Decomposition

• APCA: Adaptive Piecewise Constant Approximation

• PAA: Piecewise Aggregate Approximate• SAX: Symbolic Aggregate approXimation• …

Page 8: A Multiresolution Symbolic Representation of Time Series

Introduction

Suggested Solution: Multiresolution Vector Quantized (MVQ) approximation1) Uses a ‘vocabulary’ of subsequences

2) Takes multiple resolutions into account

3) Unlike wavelets partially ignores the ordering of ‘codewords’

3) Exploits prior knowledge about the data

4) Provides a new distance metric

Page 9: A Multiresolution Symbolic Representation of Time Series

Background

Methodology

Experimental results

Conclusion

Outline: A Multiresolution Symbolic Representation of Time Series

Page 10: A Multiresolution Symbolic Representation of Time Series

Methodology

A new framework (four steps):• Create a ‘vocabulary’ of subsequences (codebook)

• Represent time series using codecords• Utilize multiple resolutions

• Employ a new distance metric

Page 11: A Multiresolution Symbolic Representation of Time Series

Methodology

Codebook s=16

Generation

Series Transformation

Series

Encoding

112100000000100012000100110000001000000012001100100000001100210000010101001100101010000100100011

……

c m d b c a i f a j b bm i n j j a ma I n j m h l d f k o p h c a k o o g c b l p o c c b l h l h n k k k p l c a c g k k g j h h g k g j l p

……

Page 12: A Multiresolution Symbolic Representation of Time Series

MethodologyCreating a ‘vocabulary’ Frequently

appearing patterns in

subsequences

Frequently appearing patterns in

subsequences

Q: How to create?

A: Use Vector Quantization, in particular, the Generalized Lloyd Algorithm (GLA)

Produces a codebook based on two conditions:

•Nearest neighbor Condition (NNC)

•Centroid condition (CC)

Output:

A codebook with s codewords

Page 13: A Multiresolution Symbolic Representation of Time Series

Methodology

Representing time series

X = x1, x2,…, xn

f = (f1,f2,…, fs)

is encoded with a new representation

(fi is the frequency of the i th codeword in X)

Page 14: A Multiresolution Symbolic Representation of Time Series

Methodology

New distance metric:

),(1

1),(

tqdistqSHM

s

i qiti

qiti

ff

fftqdis

1 ,,

,,

1),(

The histogram model is used to calculate similarity at each resolution level:

with

Page 15: A Multiresolution Symbolic Representation of Time Series

Methodology

Time series summarization:• High level information (frequently appearing patterns) is more useful

• The new representation can provide this kind of information

Both codeword (pattern) 3 & 5

show up 2 times

Both codeword (pattern) 3 & 5

show up 2 times

Page 16: A Multiresolution Symbolic Representation of Time Series

Methodology

Problems of frequency based encoding:• It can not record the location of a subsequence

• It is hard to define an approximate resolution (codeword length)

• It may lose global information

Page 17: A Multiresolution Symbolic Representation of Time Series

Methodology

Utilizing multiple resolutions:

Solution: encoding with multiple resolutionsEach resolution level will be complementary to each other

Reconstruction of time series using

different

resolutions

Reconstruction of time series using

different

resolutions

Page 18: A Multiresolution Symbolic Representation of Time Series

Methodology

New distance metric:For all resolution levels a weighted similarity metric is defined as:

c

1ijHMiijHHM )d(q,S * w )d(q,S

Page 19: A Multiresolution Symbolic Representation of Time Series

MethodologyParameters of MVQ

X Original time series, X= x1,x2,…,xn of length n

X’ Encoded form of the original time series X′=f′1,f′2,…,f′s

N Number of time series in the dataset

n Length of original time series

C Codebook: a set of codewords {c1,…,ck,…, cs}

c Number of resolution levels

s Size of codebook

l Length of codeword

Page 20: A Multiresolution Symbolic Representation of Time Series

MethodologyParameters of MVQ

•Number of resolution levels

c = log (n / lmin) +1 lmin is the minimal codeword length•Length of codeword (on i th level)

l = n / 2i-1 •Size of codebook

Data dependent. However, in practice, small codebooks can achieve very good results

Page 21: A Multiresolution Symbolic Representation of Time Series

Background

Methodology

Experimental results

Conclusion

Outline: A Multiresolution Symbolic Representation of Time Series

Page 22: A Multiresolution Symbolic Representation of Time Series

Experiments

Datasets SYNDATA (control chart data): synthetic

CAMMOUSE: 3 *5 sequences obtained using the Camera Mouse Program RTT: RTT measurements from UCR to CMU with sending rate of 50 msec for a day

Page 23: A Multiresolution Symbolic Representation of Time Series

Experiments

Best Match Searching: For a given query, time series within the same class as the query (given our prior knowledge) form the standard set (std_set(q) ), and the results found by different approaches (knn(q) ) are compared to this set

The matching accuracy is defined as:

100% k

|std_set(q) knn(q)| Accuracy

Page 24: A Multiresolution Symbolic Representation of Time Series

Experiments

Best Match Searching

Method Weight Vector

Accuracy 

Single levelVQ

[1 0 0 0 0]

0.55

[0 1 0 0 0]

0.70

[0 0 1 0 0]

0.65

[0 0 0 1 0]

0.48

[0 0 0 0 1]

0.46

MVQ [1 1 1 1 1]

0.83 

Euclidean

0.51 

SYNDATA CAMMOUSE

Method Weight Vector

Accuracy

Single levelVQ

[1 0 0 0 0] 0.56

[0 1 0 0 0] 0.60

[0 0 1 0 0] 0.44

[0 0 0 1 0] 0.56

[0 0 0 0 1] 0.60

MVQ [1 1 1 1 1] 0.83

Euclidean

0.58

Page 25: A Multiresolution Symbolic Representation of Time Series

ExperimentsBest Match Searching

(a) (b) Precision-recall for different methods

(a) on SYNDATA dataset (b) on CAMMOUSE dataset

Page 26: A Multiresolution Symbolic Representation of Time Series

ExperimentsClustering experiments

Given two clusterings, G=G1, G2, …, GK (the true clusters), and A = A1, A2, …, Ak (clustering result by a certain method), the clustering accuracy is evaluated with the cluster similarity defined as:

k

AGSimi ji

),(maxA)Sim(G,

j |A| |G|

|AG|2 Aj)Sim(Gi,

ji

ji

with

Page 27: A Multiresolution Symbolic Representation of Time Series

ExperimentsClustering experiments

Method Weight Vector

Accuracy

Single level

VQ

[1 0 0 0 0] 0.69

[0 1 0 0 0] 0.71

[0 0 1 0 0] 0.63

[0 0 0 1 0] 0.51

[0 0 0 0 1] 0.49

MVQ [1 1 1 1 1] 0.82

DFT 0.67

SAX 0.65

DTW 0.80

Euclidean 0.55

SYNDATA RTTMethod Weight

VectorAccuracy

Single levelVQ

[1 0 0 0 0] 0.55

[0 1 0 0 0] 0.52

[0 0 1 0 0] 0.57

[0 0 0 1 0] 0.80

[0 0 0 0 1] 0.79

MVQ [0 0 0 1 1] 0.81

DFT 0.54

SAX 0.54

DTW 0.62

Euclidean 0.50

Page 28: A Multiresolution Symbolic Representation of Time Series

ExperimentsSummarization (SYNDATA)

Typical series:

Page 29: A Multiresolution Symbolic Representation of Time Series

ExperimentsFirst Level Second Level

Page 30: A Multiresolution Symbolic Representation of Time Series

Background

Methodology

Experimental results

Conclusion

Outline: A Multiresolution Symbolic Representation of Time Series

Page 31: A Multiresolution Symbolic Representation of Time Series

Conclusion

• A new symbolic representation of time series

• A more meaningful similarity metric• Improved efficiency due to the dimensionality reduction

• Nice summarization of time series

• Utilizes multiple resolutions

• Uses prior knowledge (training process)

Page 32: A Multiresolution Symbolic Representation of Time Series