a multiresolution symbolic representation of time series
DESCRIPTION
A Multiresolution Symbolic Representation of Time Series Vasileios Megalooikonomou 1 , Qiang Wang 1 , Guo Li 1 , Christos Faloutsos 2 1 Temple University, Philadelphia, USA 2 Carnegie Mellon University, Pittsburgh, USA. Outline. Background Methodology Experimental results Conclusion. - PowerPoint PPT PresentationTRANSCRIPT
A Multiresolution Symbolic Representation of Time Series
Vasileios Megalooikonomou1, Qiang Wang1, Guo Li1, Christos Faloutsos2
1Temple University, Philadelphia, USA2Carnegie Mellon University, Pittsburgh, USA
Outline
Background
Methodology
Experimental results
Conclusion
IntroductionTime Sequence:A sequence (ordered collection) of real
values: X = x1, x2,…, xn
Challenges:
• High dimensionality
• High amount of data
• Similarity metric definition
……
Introduction
Goal: To achieve:
• High efficiency• High accuracy
in similarity searches among time series and
in discovering interesting patterns
Introduction
Similarity metric for time series• Euclidean Distance:
most common, sensitive to shifts
• Dynamic Time Warping (DTW):
improving accuracy, but time consuming O(n2)• Envelope-based DTW:
improving time complexity, o(n)
Introduction
Similarity metric for time series
A more intuitive idea:
two series should be considered similar if they have enough non-overlapping time-ordered pairs of subsequences that are similar (Agrawal et al. VLDB, 1995)
Introduction
Dimensionality reduction techniques:• DFT: Discrete Fourier Transform• DWT: Discrete Wavelet Transform• SVD: Singular Vector Decomposition
• APCA: Adaptive Piecewise Constant Approximation
• PAA: Piecewise Aggregate Approximate• SAX: Symbolic Aggregate approXimation• …
Introduction
Suggested Solution: Multiresolution Vector Quantized (MVQ) approximation1) Uses a ‘vocabulary’ of subsequences
2) Takes multiple resolutions into account
3) Unlike wavelets partially ignores the ordering of ‘codewords’
3) Exploits prior knowledge about the data
4) Provides a new distance metric
Background
Methodology
Experimental results
Conclusion
Outline: A Multiresolution Symbolic Representation of Time Series
Methodology
A new framework (four steps):• Create a ‘vocabulary’ of subsequences (codebook)
• Represent time series using codecords• Utilize multiple resolutions
• Employ a new distance metric
Methodology
Codebook s=16
Generation
Series Transformation
Series
Encoding
112100000000100012000100110000001000000012001100100000001100210000010101001100101010000100100011
……
c m d b c a i f a j b bm i n j j a ma I n j m h l d f k o p h c a k o o g c b l p o c c b l h l h n k k k p l c a c g k k g j h h g k g j l p
……
MethodologyCreating a ‘vocabulary’ Frequently
appearing patterns in
subsequences
Frequently appearing patterns in
subsequences
Q: How to create?
A: Use Vector Quantization, in particular, the Generalized Lloyd Algorithm (GLA)
Produces a codebook based on two conditions:
•Nearest neighbor Condition (NNC)
•Centroid condition (CC)
Output:
A codebook with s codewords
Methodology
Representing time series
X = x1, x2,…, xn
f = (f1,f2,…, fs)
is encoded with a new representation
(fi is the frequency of the i th codeword in X)
Methodology
New distance metric:
),(1
1),(
tqdistqSHM
s
i qiti
qiti
ff
fftqdis
1 ,,
,,
1),(
The histogram model is used to calculate similarity at each resolution level:
with
Methodology
Time series summarization:• High level information (frequently appearing patterns) is more useful
• The new representation can provide this kind of information
Both codeword (pattern) 3 & 5
show up 2 times
Both codeword (pattern) 3 & 5
show up 2 times
Methodology
Problems of frequency based encoding:• It can not record the location of a subsequence
• It is hard to define an approximate resolution (codeword length)
• It may lose global information
Methodology
Utilizing multiple resolutions:
Solution: encoding with multiple resolutionsEach resolution level will be complementary to each other
Reconstruction of time series using
different
resolutions
Reconstruction of time series using
different
resolutions
Methodology
New distance metric:For all resolution levels a weighted similarity metric is defined as:
c
1ijHMiijHHM )d(q,S * w )d(q,S
MethodologyParameters of MVQ
X Original time series, X= x1,x2,…,xn of length n
X’ Encoded form of the original time series X′=f′1,f′2,…,f′s
N Number of time series in the dataset
n Length of original time series
C Codebook: a set of codewords {c1,…,ck,…, cs}
c Number of resolution levels
s Size of codebook
l Length of codeword
MethodologyParameters of MVQ
•Number of resolution levels
c = log (n / lmin) +1 lmin is the minimal codeword length•Length of codeword (on i th level)
l = n / 2i-1 •Size of codebook
Data dependent. However, in practice, small codebooks can achieve very good results
Background
Methodology
Experimental results
Conclusion
Outline: A Multiresolution Symbolic Representation of Time Series
Experiments
Datasets SYNDATA (control chart data): synthetic
CAMMOUSE: 3 *5 sequences obtained using the Camera Mouse Program RTT: RTT measurements from UCR to CMU with sending rate of 50 msec for a day
Experiments
Best Match Searching: For a given query, time series within the same class as the query (given our prior knowledge) form the standard set (std_set(q) ), and the results found by different approaches (knn(q) ) are compared to this set
The matching accuracy is defined as:
100% k
|std_set(q) knn(q)| Accuracy
Experiments
Best Match Searching
Method Weight Vector
Accuracy
Single levelVQ
[1 0 0 0 0]
0.55
[0 1 0 0 0]
0.70
[0 0 1 0 0]
0.65
[0 0 0 1 0]
0.48
[0 0 0 0 1]
0.46
MVQ [1 1 1 1 1]
0.83
Euclidean
0.51
SYNDATA CAMMOUSE
Method Weight Vector
Accuracy
Single levelVQ
[1 0 0 0 0] 0.56
[0 1 0 0 0] 0.60
[0 0 1 0 0] 0.44
[0 0 0 1 0] 0.56
[0 0 0 0 1] 0.60
MVQ [1 1 1 1 1] 0.83
Euclidean
0.58
ExperimentsBest Match Searching
(a) (b) Precision-recall for different methods
(a) on SYNDATA dataset (b) on CAMMOUSE dataset
ExperimentsClustering experiments
Given two clusterings, G=G1, G2, …, GK (the true clusters), and A = A1, A2, …, Ak (clustering result by a certain method), the clustering accuracy is evaluated with the cluster similarity defined as:
k
AGSimi ji
),(maxA)Sim(G,
j |A| |G|
|AG|2 Aj)Sim(Gi,
ji
ji
with
ExperimentsClustering experiments
Method Weight Vector
Accuracy
Single level
VQ
[1 0 0 0 0] 0.69
[0 1 0 0 0] 0.71
[0 0 1 0 0] 0.63
[0 0 0 1 0] 0.51
[0 0 0 0 1] 0.49
MVQ [1 1 1 1 1] 0.82
DFT 0.67
SAX 0.65
DTW 0.80
Euclidean 0.55
SYNDATA RTTMethod Weight
VectorAccuracy
Single levelVQ
[1 0 0 0 0] 0.55
[0 1 0 0 0] 0.52
[0 0 1 0 0] 0.57
[0 0 0 1 0] 0.80
[0 0 0 0 1] 0.79
MVQ [0 0 0 1 1] 0.81
DFT 0.54
SAX 0.54
DTW 0.62
Euclidean 0.50
ExperimentsSummarization (SYNDATA)
Typical series:
ExperimentsFirst Level Second Level
Background
Methodology
Experimental results
Conclusion
Outline: A Multiresolution Symbolic Representation of Time Series
Conclusion
• A new symbolic representation of time series
• A more meaningful similarity metric• Improved efficiency due to the dimensionality reduction
• Nice summarization of time series
• Utilizes multiple resolutions
• Uses prior knowledge (training process)