a multiresolution symbolic representation of time series

A Multiresolution Symbolic Representation of Time Series

Vasileios Megalooikonomou1, Qiang Wang1, Guo Li1, Christos Faloutsos2

1Temple University, Philadelphia, USA2Carnegie Mellon University, Pittsburgh, USA

Outline

Background

Methodology

Experimental results

Conclusion

IntroductionTime Sequence:A sequence (ordered collection) of real

values: X = x1, x2,…, xn

Challenges:

• High dimensionality

• High amount of data

• Similarity metric definition

……

Introduction

Goal: To achieve:

• High efficiency• High accuracy

in similarity searches among time series and

in discovering interesting patterns

Introduction

Similarity metric for time series• Euclidean Distance:

most common, sensitive to shifts

• Dynamic Time Warping (DTW):

improving accuracy, but time consuming O(n2)• Envelope-based DTW:

improving time complexity, o(n)

Introduction

Similarity metric for time series

A more intuitive idea:

two series should be considered similar if they have enough non-overlapping time-ordered pairs of subsequences that are similar (Agrawal et al. VLDB, 1995)

Introduction

Dimensionality reduction techniques:• DFT: Discrete Fourier Transform• DWT: Discrete Wavelet Transform• SVD: Singular Vector Decomposition

• APCA: Adaptive Piecewise Constant Approximation

• PAA: Piecewise Aggregate Approximate• SAX: Symbolic Aggregate approXimation• …

Introduction

Suggested Solution: Multiresolution Vector Quantized (MVQ) approximation1) Uses a ‘vocabulary’ of subsequences

2) Takes multiple resolutions into account

3) Unlike wavelets partially ignores the ordering of ‘codewords’

3) Exploits prior knowledge about the data

4) Provides a new distance metric

Background

Methodology


Conclusion

Outline: A Multiresolution Symbolic Representation of Time Series

Methodology

A new framework (four steps):• Create a ‘vocabulary’ of subsequences (codebook)

• Represent time series using codecords• Utilize multiple resolutions

• Employ a new distance metric

Methodology

Codebook s=16

Generation

Series Transformation

Series

Encoding

112100000000100012000100110000001000000012001100100000001100210000010101001100101010000100100011

……

c m d b c a i f a j b bm i n j j a ma I n j m h l d f k o p h c a k o o g c b l p o c c b l h l h n k k k p l c a c g k k g j h h g k g j l p

……

MethodologyCreating a ‘vocabulary’ Frequently

appearing patterns in

subsequences

Frequently appearing patterns in

subsequences

Q: How to create?

A: Use Vector Quantization, in particular, the Generalized Lloyd Algorithm (GLA)

Produces a codebook based on two conditions:

•Nearest neighbor Condition (NNC)

•Centroid condition (CC)

Output:

A codebook with s codewords

Methodology

Representing time series

X = x1, x2,…, xn

f = (f1,f2,…, fs)

is encoded with a new representation

(fi is the frequency of the i th codeword in X)

Methodology

New distance metric:

),(1

1),(

tqdistqSHM

s

i qiti

qiti

ff

fftqdis

1 ,,

,,

1),(

The histogram model is used to calculate similarity at each resolution level:

with

Methodology

Time series summarization:• High level information (frequently appearing patterns) is more useful

• The new representation can provide this kind of information

Both codeword (pattern) 3 & 5

show up 2 times

Both codeword (pattern) 3 & 5

show up 2 times

Methodology

Problems of frequency based encoding:• It can not record the location of a subsequence

• It is hard to define an approximate resolution (codeword length)

• It may lose global information

Methodology

Utilizing multiple resolutions:

Solution: encoding with multiple resolutionsEach resolution level will be complementary to each other

Reconstruction of time series using

different

resolutions

Reconstruction of time series using

different

resolutions

Methodology

New distance metric:For all resolution levels a weighted similarity metric is defined as:

c

1ijHMiijHHM )d(q,S * w )d(q,S

MethodologyParameters of MVQ

X Original time series, X= x1,x2,…,xn of length n

X’ Encoded form of the original time series X′=f′1,f′2,…,f′s

N Number of time series in the dataset

n Length of original time series

C Codebook: a set of codewords {c1,…,ck,…, cs}

c Number of resolution levels

s Size of codebook

l Length of codeword

MethodologyParameters of MVQ

•Number of resolution levels

c = log (n / lmin) +1 lmin is the minimal codeword length•Length of codeword (on i th level)

l = n / 2i-1 •Size of codebook

Data dependent. However, in practice, small codebooks can achieve very good results

Background

Methodology


Conclusion


Experiments

Datasets SYNDATA (control chart data): synthetic

CAMMOUSE: 3 *5 sequences obtained using the Camera Mouse Program RTT: RTT measurements from UCR to CMU with sending rate of 50 msec for a day

Experiments

Best Match Searching: For a given query, time series within the same class as the query (given our prior knowledge) form the standard set (std_set(q) ), and the results found by different approaches (knn(q) ) are compared to this set

The matching accuracy is defined as:

100% k

|std_set(q) knn(q)| Accuracy

Experiments

Best Match Searching

Method Weight Vector

Accuracy

Single levelVQ

[1 0 0 0 0]

0.55

[0 1 0 0 0]

0.70

[0 0 1 0 0]

0.65

[0 0 0 1 0]

0.48

[0 0 0 0 1]

0.46

MVQ [1 1 1 1 1]

0.83

Euclidean

0.51

SYNDATA CAMMOUSE


Accuracy

Single levelVQ

[1 0 0 0 0] 0.56

[0 1 0 0 0] 0.60

[0 0 1 0 0] 0.44

[0 0 0 1 0] 0.56

[0 0 0 0 1] 0.60

MVQ [1 1 1 1 1] 0.83

Euclidean

0.58

ExperimentsBest Match Searching

(a) (b) Precision-recall for different methods

(a) on SYNDATA dataset (b) on CAMMOUSE dataset

ExperimentsClustering experiments

Given two clusterings, G=G1, G2, …, GK (the true clusters), and A = A1, A2, …, Ak (clustering result by a certain method), the clustering accuracy is evaluated with the cluster similarity defined as:

k

AGSimi ji

),(maxA)Sim(G,

j |A| |G|

|AG|2 Aj)Sim(Gi,

ji

ji

with

ExperimentsClustering experiments


Accuracy

Single level

VQ

[1 0 0 0 0] 0.69

[0 1 0 0 0] 0.71

[0 0 1 0 0] 0.63

[0 0 0 1 0] 0.51

[0 0 0 0 1] 0.49

MVQ [1 1 1 1 1] 0.82

DFT 0.67

SAX 0.65

DTW 0.80

Euclidean 0.55

SYNDATA RTTMethod Weight

VectorAccuracy

Single levelVQ

[1 0 0 0 0] 0.55

[0 1 0 0 0] 0.52

[0 0 1 0 0] 0.57

[0 0 0 1 0] 0.80

[0 0 0 0 1] 0.79

MVQ [0 0 0 1 1] 0.81

DFT 0.54

SAX 0.54

DTW 0.62

Euclidean 0.50

ExperimentsSummarization (SYNDATA)

Typical series:

ExperimentsFirst Level Second Level

Background

Methodology


Conclusion


Conclusion

• A new symbolic representation of time series

• A more meaningful similarity metric• Improved efficiency due to the dimensionality reduction

• Nice summarization of time series

• Utilizes multiple resolutions

• Uses prior knowledge (training process)

a multiresolution symbolic representation of time series

Documents