a multiresolution symbolic representation of time series

28
A Multiresolution A Multiresolution Symbolic Symbolic Representation of Time Representation of Time Series Series Vasileios Megalooikonomou Vasileios Megalooikonomou Qiang Wang Qiang Wang Guo Li Guo Li Christos Faloutsos Christos Faloutsos Presented by Rui Presented by Rui Li Li

Upload: bernard-madden

Post on 02-Jan-2016

59 views

Category:

Documents


1 download

DESCRIPTION

A Multiresolution Symbolic Representation of Time Series. Vasileios Megalooikonomou Qiang Wang Guo Li Christos Faloutsos. Presented by Rui Li. Abstract. Introducing a new representation of time series, the Multiresolution Vector Quantized (MVQ) approximation - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: A Multiresolution Symbolic Representation of Time Series

A Multiresolution Symbolic A Multiresolution Symbolic Representation of Time SeriesRepresentation of Time Series

Vasileios MegalooikonomouVasileios MegalooikonomouQiang WangQiang Wang

Guo LiGuo LiChristos FaloutsosChristos Faloutsos

Presented by Rui LiPresented by Rui Li

Page 2: A Multiresolution Symbolic Representation of Time Series

AbstractAbstract

Introducing a new representation of Introducing a new representation of time series, the Multiresolution time series, the Multiresolution Vector Quantized (MVQ) Vector Quantized (MVQ) approximationapproximation– MVQ keeps both local and global MVQ keeps both local and global

information about the original time information about the original time series in a hierarchical mechanismseries in a hierarchical mechanism

– Processing the original time series at Processing the original time series at multiple resolutionsmultiple resolutions

Page 3: A Multiresolution Symbolic Representation of Time Series

Abstract (cont.)Abstract (cont.)

Representation of time series is Representation of time series is symbolic employing key symbolic employing key subsequences and potentially allows subsequences and potentially allows the application of text-based the application of text-based retrieval techniques into the retrieval techniques into the similarity analysis of time series.similarity analysis of time series.

Page 4: A Multiresolution Symbolic Representation of Time Series

IntroductionIntroduction

Two series should be considered Two series should be considered similar if they have enough non-similar if they have enough non-overlapping time-ordered pairs of overlapping time-ordered pairs of subsequences that are similar.subsequences that are similar.

Page 5: A Multiresolution Symbolic Representation of Time Series

Introduction (cont.)Introduction (cont.)Instead of calculating the Euclidean Instead of calculating the Euclidean distance, first extract key distance, first extract key subsequences utilizing the Vector subsequences utilizing the Vector Quantization (VQ) technique and Quantization (VQ) technique and encode each time series based on encode each time series based on the frequency of appearance of each the frequency of appearance of each key subsequence.key subsequence.Then calculate similarities in terms of Then calculate similarities in terms of key subsequence matches.key subsequence matches.

Page 6: A Multiresolution Symbolic Representation of Time Series

Introduction (cont.)Introduction (cont.)

Hierarchical mechanism: the original Hierarchical mechanism: the original time series are processed at several time series are processed at several different resolutions, and similarity different resolutions, and similarity analysis is performed using a analysis is performed using a weighted distance function weighted distance function combining all the resolution levelscombining all the resolution levels

Page 7: A Multiresolution Symbolic Representation of Time Series

BackgroundBackground

Many of the previous work focus on Many of the previous work focus on the avoidance of false dismissals. the avoidance of false dismissals. However, in some cases the However, in some cases the existence of too many false alarms existence of too many false alarms may decrease the efficiency of may decrease the efficiency of retrieval.retrieval.

The Euclidean distance is not always The Euclidean distance is not always the optimal distance measure.the optimal distance measure.

Page 8: A Multiresolution Symbolic Representation of Time Series

Background (cont.)Background (cont.)

For large datasets, the computational For large datasets, the computational complexity associated with the complexity associated with the Euclidean distance calculation is a Euclidean distance calculation is a problem ( problem ( O(N*nO(N*n) ).) ).

Euclidean distance (point-based Euclidean distance (point-based model) is vulnerable to shape model) is vulnerable to shape transformations such as shifting and transformations such as shifting and scaling.scaling.

Page 9: A Multiresolution Symbolic Representation of Time Series

Background (cont.)Background (cont.)

A new framework that utilizes high-A new framework that utilizes high-level features is proposedlevel features is proposed– Codebook generationCodebook generation– Time series encodingTime series encoding– Time series representation and retrievalTime series representation and retrieval

In order to keep both local and global In order to keep both local and global information, use multiple codebooks information, use multiple codebooks with different resolutionswith different resolutions

Page 10: A Multiresolution Symbolic Representation of Time Series

Background (cont.)Background (cont.)

For each resolution, VQ is applied to For each resolution, VQ is applied to discover the vocabulary of discover the vocabulary of subsequences (codewords)subsequences (codewords)– In VQ, a codeword is used to represent a In VQ, a codeword is used to represent a

number of similar vectors.number of similar vectors.

The Generalized Lloyd Algorithm is The Generalized Lloyd Algorithm is used to produce a “locally optimal” used to produce a “locally optimal” codebook from a training set.codebook from a training set.

Page 11: A Multiresolution Symbolic Representation of Time Series

Background (cont.)Background (cont.)

To quantitatively measure the To quantitatively measure the similarity between different time similarity between different time series encoded with a VQ codebook, series encoded with a VQ codebook, the Histogram Model is employed.the Histogram Model is employed.– wherewhere

– and refer to the appearance and refer to the appearance frequency of codeword in time series frequency of codeword in time series tt and and qq, respectively., respectively.

tqdistqSHM ,1

1,

s

i qiti

qiti

ff

fftqdis

1 ,,

,,

1,

tif , qif ,

ic

Page 12: A Multiresolution Symbolic Representation of Time Series

Proposed MethodProposed Method

MVQ approximationMVQ approximation– Partitions each time series into equi-Partitions each time series into equi-

length segments and represents each length segments and represents each segment with the most similar key segment with the most similar key subsequence from a codebook.subsequence from a codebook.

– Represent each time series as the Represent each time series as the appearance frequency of each codeword appearance frequency of each codeword in it.in it.

– Apply at several resolutionsApply at several resolutions

Page 13: A Multiresolution Symbolic Representation of Time Series

Proposed Method (cont.)Proposed Method (cont.)

Codebook GenerationCodebook Generation– The dataset is preprocessedThe dataset is preprocessed

Each time series is partitioned into a number Each time series is partitioned into a number of segments each of length of segments each of length ll, and each , and each segment forms a sample of the training set segment forms a sample of the training set that is used to generate the codebook.that is used to generate the codebook.

– Each codeword corresponds to a key Each codeword corresponds to a key subsequencesubsequence

Page 14: A Multiresolution Symbolic Representation of Time Series

Example1Example1– Codewords of a 2-level codebookCodewords of a 2-level codebook

Page 15: A Multiresolution Symbolic Representation of Time Series

Proposed Method (cont.)Proposed Method (cont.)

Time Series EncodingTime Series Encoding– Every time series is decomposed into Every time series is decomposed into

segments of length segments of length ll..– For each segment, the closest codeword For each segment, the closest codeword

in the codebook is found and the in the codebook is found and the corresponding index is used to corresponding index is used to represent this segment.represent this segment.

– The appearance frequency of each The appearance frequency of each codeword is counted.codeword is counted.

Page 16: A Multiresolution Symbolic Representation of Time Series

Proposed Method (cont.)Proposed Method (cont.)

Time Series Encoding (cont.)Time Series Encoding (cont.)– The representation of a time series is a The representation of a time series is a

vector showing the vector showing the appearance frequency of every appearance frequency of every codeword.codeword.

sfffX ,...,, 21

Page 17: A Multiresolution Symbolic Representation of Time Series

Proposed Method (cont.)Proposed Method (cont.)

Time Series SummarizationTime Series Summarization– The codewords stand for the most The codewords stand for the most

representative subsequences for the representative subsequences for the entire dataset.entire dataset.

– We can just check the appearance We can just check the appearance frequencies of the codewords and get an frequencies of the codewords and get an overview of the time series.overview of the time series.

Page 18: A Multiresolution Symbolic Representation of Time Series

Example2Example2

Page 19: A Multiresolution Symbolic Representation of Time Series

Proposed Method (cont.)Proposed Method (cont.)

Distance Measure and Multiresolution Distance Measure and Multiresolution RepresentationRepresentation– Using only one codebook (single Using only one codebook (single

resolution) introduces problemsresolution) introduces problemsThe order among the indices of codewords is The order among the indices of codewords is not kept; some important global information not kept; some important global information is lostis lost

Increasing false alarmsIncreasing false alarms

Page 20: A Multiresolution Symbolic Representation of Time Series

Proposed Method (cont.)Proposed Method (cont.)

Distance Measure and Multiresolution Distance Measure and Multiresolution Representation (cont.)Representation (cont.)– A hierarchical mechanism is introduced.A hierarchical mechanism is introduced.

Several different resolutions are involved.Several different resolutions are involved.

higher resolution → local informationhigher resolution → local information

lower resolution → global informationlower resolution → global information

Page 21: A Multiresolution Symbolic Representation of Time Series

Example3Example3– Reconstruction of time series using Reconstruction of time series using

different resolutionsdifferent resolutions

Page 22: A Multiresolution Symbolic Representation of Time Series

Proposed Method (cont.)Proposed Method (cont.)

Distance Measure and Multiresolution Distance Measure and Multiresolution Representation (cont.)Representation (cont.)– By being assigned different weights to By being assigned different weights to

different resolutions, a weighted different resolutions, a weighted similarity measure (Hierarchical similarity measure (Hierarchical Histogram Model) is defined:Histogram Model) is defined:

c

ijHMiijHHM dqSWdqS

1

,,

Page 23: A Multiresolution Symbolic Representation of Time Series

ExperimentsExperiments

Best Matches RetrievalBest Matches Retrieval– SYNDATASYNDATA

6 classes; 100 time series for each class; 60 6 classes; 100 time series for each class; 60 points for each time seriespoints for each time series

Page 24: A Multiresolution Symbolic Representation of Time Series

Experiments (cont.)Experiments (cont.)

Best Matches RetrievalBest Matches Retrieval (cont.)(cont.)– CAMMOUSECAMMOUSE

1600 points for each time series1600 points for each time series

Page 25: A Multiresolution Symbolic Representation of Time Series

Experiments (cont.)Experiments (cont.)

Best Matches RetrievalBest Matches Retrieval (cont.)(cont.)– Comparisons with other methodsComparisons with other methods

Page 26: A Multiresolution Symbolic Representation of Time Series

Experiments (cont.)Experiments (cont.)

ClusteringClustering– SYNDATASYNDATA

Page 27: A Multiresolution Symbolic Representation of Time Series

Experiments (cont.)Experiments (cont.)

Clustering (cont.)Clustering (cont.)– CAMMOUSECAMMOUSE

Page 28: A Multiresolution Symbolic Representation of Time Series

Thank you!Thank you!