zoom-svd: fast and memory efficient method for extracting key patterns in an arbitrary...

10
Zoom-SVD: Fast and Memory Efficient Method for Extracting Key Paerns in an Arbitrary Time Range Jun-Gi Jang Seoul National University [email protected] Dongjin Choi Seoul National University [email protected] Jinhong Jung Seoul National University [email protected] U Kang Seoul National University [email protected] ABSTRACT Given multiple time series data, how can we efficiently find latent patterns in an arbitrary time range? Singular value decomposition (SVD) is a crucial tool to discover hidden factors in multiple time series data, and has been used in many data mining applications including dimensionality reduction, principal component analysis, recommender systems, etc. Along with its static version, incremen- tal SVD has been used to deal with multiple semi-infinite time series data and to identify patterns of the data. However, existing SVD methods for the multiple time series data analysis do not provide functionality for detecting patterns of data in an arbitrary time range: standard SVD requires data for all intervals corresponding to a time range query, and incremental SVD does not consider an arbitrary time range. In this paper, we propose Zoom-SVD, a fast and memory efficient method for finding latent factors of time series data in an arbitrary time range. Zoom-SVD incrementally compresses multiple time series data block by block to reduce the space cost in storage phase, and efficiently computes singular value decomposition (SVD) for a given time range query in query phase by carefully stitching stored SVD results. Through extensive experiments, we demonstrate that Zoom-SVD is up to 15× faster, and requires 15× less space than existing methods. Our case study shows that Zoom-SVD is useful for capturing past time ranges whose patterns are similar to a query time range. KEYWORDS Time range query; singular value decomposition; multiple time series data ACM Reference Format: Jun-Gi Jang, Dongjin Choi, Jinhong Jung, and U Kang. 2018. Zoom-SVD: Fast and Memory Efficient Method for Extracting Key Patterns in an Arbi- trary Time Range . In 2018 ACM Conference on Information and Knowledge Management (CIKM’18), October 22–26, 2018, Torino, Italy. ACM, New York, NY, USA, 11 pages. https://doi.org/10.475/123_4 1 INTRODUCTION Given multiple time series data (e.g., measurements from multiple sensors) and a time range (e.g., 1:00 am - 3:00 am yesterday), how can Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. CIKM ’18, October 22–26, 2018, Torino, Italy © 2018 Association for Computing Machinery. ACM ISBN 978-1-4503-6014-2/18/10. . . $15 https://doi.org/10.475/123_4 we efficiently discover latent factors of the time series in the range? Revealing hidden factors in time series is important for analysis of patterns and tendencies encoded in the time series data. Singular value decomposition (SVD) effectively finds hidden factors in data, and has been extensively utilized in many data mining applications such as dimensionality reduction [27], principal component analysis (PCA) [12, 39], data clustering [22, 34], tensor analysis [911, 21, 26, 31], graph mining [1315, 36] and recommender systems [17? ]. SVD has been also successfully applied to stream mining tasks [35, 39] in order to analyze time series data. However, methods based on standard SVD [2, 6, 30, 40] are not suitable for finding latent factors in an arbitrary time range since the methods have an expensive computational cost, and they have to store all the raw data. This limitation makes it difficult to investigate patterns of a time range in stream environment even if it is important to analyze a specific past event or find recurring patterns in time series [24]. A naive approach for a time range query on time series is to store all of the arrived data and apply SVD to the data, but this approach is inefficient since it requires huge storage space, and the computational cost of SVD for a long time range query is expensive. In this paper, we propose Zoom-SVD (Zoomable SVD), an effi- cient method for revealing hidden factors of multiple time series in an arbitrary time range. With Zoom-SVD, users can zoom-in to find patterns in a specific time range of interest, or zoom-out to extract patterns in a wider time range. Zoom-SVD comprises two phases: storage phase and query phase. Zoom-SVD considers mul- tiple time series as a set of blocks of a fixed length. In the storage phase, Zoom-SVD carefully compresses each block using SVD and low-rank approximation to reduce storage cost and incrementally updates the most recent block of a newly arrived data. In the query phase, Zoom-SVD efficiently computes the SVD results in a given time range based on the compressed blocks. Through extensive experiments with real-world multiple time series data, we demon- strate the effectiveness and the efficiency of Zoom-SVD compared to other methods as shown in Figure 1. The main contributions of this paper are summarized as follows: Algorithm. We propose Zoom-SVD, an efficient method for extracting key patterns from multiple time series data in an arbitrary time range. Analysis. We theoretically analyze the time and the space complexities of our proposed method Zoom-SVD. Experiment. We present experimental results showing that Zoom-SVD computes time range queries up to 15× faster, and requires up to 15× less space than other methods. We also confirm that our proposed method Zoom-SVD provides the best trade-off between efficiency and accuracy. The codes and datasets for this paper are available at http:// datalab.snu.ac.kr/zoomsvd. In the rest of this paper, we describe the

Upload: others

Post on 29-Jan-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Zoom-SVD: Fast and Memory Efficient Method for Extracting Key Patterns in an Arbitrary ...ukang/papers/zoomsvdCIKM18.pdf · 2018-08-26 · Zoom-SVD: Fast and Memory Efficient Method

Zoom-SVD: Fast and Memory Efficient Method forExtracting Key Patterns in an Arbitrary Time Range

Jun-Gi JangSeoul National University

[email protected]

Dongjin ChoiSeoul National [email protected]

Jinhong JungSeoul National [email protected]

U KangSeoul National University

[email protected]

ABSTRACT

Given multiple time series data, how can we efficiently find latentpatterns in an arbitrary time range? Singular value decomposition(SVD) is a crucial tool to discover hidden factors in multiple timeseries data, and has been used in many data mining applicationsincluding dimensionality reduction, principal component analysis,recommender systems, etc. Along with its static version, incremen-tal SVD has been used to deal with multiple semi-infinite time seriesdata and to identify patterns of the data. However, existing SVDmethods for the multiple time series data analysis do not providefunctionality for detecting patterns of data in an arbitrary timerange: standard SVD requires data for all intervals correspondingto a time range query, and incremental SVD does not consider anarbitrary time range.

In this paper, we propose Zoom-SVD, a fast and memory efficientmethod for finding latent factors of time series data in an arbitrarytime range. Zoom-SVD incrementally compresses multiple timeseries data block by block to reduce the space cost in storage phase,and efficiently computes singular value decomposition (SVD) for agiven time range query in query phase by carefully stitching storedSVD results. Through extensive experiments, we demonstrate thatZoom-SVD is up to 15× faster, and requires 15× less space thanexisting methods. Our case study shows that Zoom-SVD is usefulfor capturing past time ranges whose patterns are similar to a querytime range.

KEYWORDS

Time range query; singular value decomposition; multiple timeseries dataACM Reference Format:

Jun-Gi Jang, Dongjin Choi, Jinhong Jung, and U Kang. 2018. Zoom-SVD:Fast and Memory Efficient Method for Extracting Key Patterns in an Arbi-trary Time Range . In 2018 ACM Conference on Information and Knowledge

Management (CIKM’18), October 22–26, 2018, Torino, Italy. ACM, New York,NY, USA, 11 pages. https://doi.org/10.475/123_4

1 INTRODUCTION

Given multiple time series data (e.g., measurements from multiplesensors) and a time range (e.g., 1:00 am - 3:00 am yesterday), how can

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected] ’18, October 22–26, 2018, Torino, Italy

© 2018 Association for Computing Machinery.ACM ISBN 978-1-4503-6014-2/18/10. . . $15https://doi.org/10.475/123_4

we efficiently discover latent factors of the time series in the range?Revealing hidden factors in time series is important for analysis ofpatterns and tendencies encoded in the time series data. Singularvalue decomposition (SVD) effectively finds hidden factors in data,and has been extensively utilized in many data mining applicationssuch as dimensionality reduction [27], principal component analysis(PCA) [12, 39], data clustering [22, 34], tensor analysis [9–11, 21,26, 31], graph mining [13–15, 36] and recommender systems [17? ].SVD has been also successfully applied to stream mining tasks [35,39] in order to analyze time series data. However, methods based onstandard SVD [2, 6, 30, 40] are not suitable for finding latent factorsin an arbitrary time range since the methods have an expensivecomputational cost, and they have to store all the raw data. Thislimitation makes it difficult to investigate patterns of a time range instream environment even if it is important to analyze a specific pastevent or find recurring patterns in time series [24]. A naive approachfor a time range query on time series is to store all of the arriveddata and apply SVD to the data, but this approach is inefficientsince it requires huge storage space, and the computational cost ofSVD for a long time range query is expensive.

In this paper, we propose Zoom-SVD (Zoomable SVD), an effi-cient method for revealing hidden factors of multiple time seriesin an arbitrary time range. With Zoom-SVD, users can zoom-in tofind patterns in a specific time range of interest, or zoom-out toextract patterns in a wider time range. Zoom-SVD comprises twophases: storage phase and query phase. Zoom-SVD considers mul-tiple time series as a set of blocks of a fixed length. In the storagephase, Zoom-SVD carefully compresses each block using SVD andlow-rank approximation to reduce storage cost and incrementallyupdates the most recent block of a newly arrived data. In the queryphase, Zoom-SVD efficiently computes the SVD results in a giventime range based on the compressed blocks. Through extensiveexperiments with real-world multiple time series data, we demon-strate the effectiveness and the efficiency of Zoom-SVD comparedto other methods as shown in Figure 1. The main contributions ofthis paper are summarized as follows:• Algorithm.We propose Zoom-SVD, an efficient method forextracting key patterns from multiple time series data in anarbitrary time range.• Analysis.We theoretically analyze the time and the spacecomplexities of our proposed method Zoom-SVD.• Experiment.We present experimental results showing thatZoom-SVD computes time range queries up to 15× faster,and requires up to 15× less space than other methods. Wealso confirm that our proposed method Zoom-SVD providesthe best trade-off between efficiency and accuracy.

The codes and datasets for this paper are available at http://datalab.snu.ac.kr/zoomsvd. In the rest of this paper, we describe the

Page 2: Zoom-SVD: Fast and Memory Efficient Method for Extracting Key Patterns in an Arbitrary ...ukang/papers/zoomsvdCIKM18.pdf · 2018-08-26 · Zoom-SVD: Fast and Memory Efficient Method

100101102103104105106107108

100 101 102 103 104 105 106 107 108

Zoom-SVD Randomized SVD TallSkinny SVD Basic SVD

100

101

102

103

104

1⋅104 3⋅104 1⋅105 4⋅105

9.6x

Que

ry T

ime

(mse

c)

Query range

(b) Query time of the Activity dataset

10-1

100

101

102

103

1⋅104 3⋅104 1⋅105 4⋅105

15x

Que

ry T

ime

(mse

c)

Query range

(c) Query time of the Gas dataset

10-1

100

101

102

103

1⋅104 3⋅104 1⋅105 4⋅105

12x

Que

ry T

ime

(mse

c)

Query range

(d) Query time of the London dataset

Figure 1: Query time of Zoom-SVD compared to other SVDmethods. The starting point ts and the ending point te are arbitrarilychosen, and we increase time range te − ts from 104 to 3.2× 105. (a) The query time of Zoom-SVD is up to 9.6× faster than that of

the second best method in Activity dataset. (b) Zoom-SVD is up to 15× faster than the second best method in the query phase

of Gas dataset. (c) Zoom-SVD is up to 12× faster than the second best method in the query phase of London dataset.

Table 1: Symbol description.

Symbol Description

b Initial block sizeξ Threshold for low-rank approximationk Number of singular valuesk(i) Number of singular values in i-th block[

X ; Y]

Vertical concatenation of two matrices X and YA Raw multiple time series data

A(i) i-th block of AU(i) Left singular vector matrix of A(i)

Σ(i) Singular value matrix of A(i)

V(i) Right singular vector matrix of A(i)

U(S :E) Left singular vector matrix computed in query phaseΣ(S :E) Singular value matrix computed in query phaseV(S :E) Right singular vector matrix computed in query phaseU Set of left singular vector matrix U(i)S Set of singular value matrix Σ(i)V Set of right singular vector matrix V(i)[ts , te ] Time range queryts Starting point of time range queryte Ending point of time range queryS Index of block matrix corresponding to tsE Index of block matrix corresponding to te

preliminaries and formally define the problem in Section 2, proposeour method Zoom-SVD in Section 3, present experimental resultsin Section 4, demonstrate the case study in Section 5, discuss relatedworks in Section 6, and conclude in Section 7.

2 PRELIMINARIES

We describe preliminaries on singular value decomposition (SVD)and incremental SVD (Sections 2.1 and 2.2). We then define theproblem handled in this paper (Section 2.3). Table 1 lists the symbolsused in this paper.

2.1 Singular Value Decomposition (SVD)

SVD is a decomposition method for finding latent factors in a matrixA ∈ Rt×c . Suppose the rank of the matrix A is r . Then, SVD of A isrepresented as A = UΣVT where Σ is an r×r diagonal matrix whosediagonal entries are singular values. The i-th singular value σi is

located in Σi,i where σ1 ≥ σ2 ≥ · · · ≥ σr ≥ 0. U ∈ Rt×r is calledthe left singular vector matrix (or a set of left singular vectors) ofA; U =

[u1 · · · ur

]is a column orthogonal matrix where u1, · · · , ur

are the eigenvectors of AAT . V ∈ Rc×r is the right singular vectormatrix of A; V =

[v1 · · · vr

]is a column orthogonal matrix where

v1, · · · , vr are the eigenvectors of AT A. Note that the singularvectors in U and V are used as hidden factors to analyze the datamatrix A.

Low-rank approximation. Low-rank approximation effectivelyapproximates the original data matrix based on SVD. The key ideaof the low-rank approximation is to keep top-k highest singular val-ues and corresponding singular vectors where k is a number smallerthan the rank r of the original matrix. The low-rank approximationof A is represented as follows:

A ≃ UkΣkVTk , A s.t. k < r

where the reconstruction data A is the low-rank approximation ofA,Uk =

[u1, · · · , uk

],Vk =

[v1, · · · , vk

], and Σk = diaд(σ1, · · · ,σk ).

The error of the low-rank approximation is represented as follows:

∥A − A∥2F = ∥r∑

i=k+1σiuivTi ∥

2F =

r∑i=k+1

σ 2i

where ∥·∥F is the Frobenius norm of a matrix, and r is the rank ofthe original matrix. The parameter k for low-rank approximationis determined by the following equation:

k∗ = argmin1≤k≤r

f (k) =∑ki=1 σ

2i∑r

i=1 σ2i, s.t. f (k) ≥ ξ (1)

where ξ is a threshold between 0 and 1.

2.2 Incremental SVD

Incremental SVD dynamically calculates the SVD result of a matrixwith newly arrived data rows. Suppose that we have the SVD resultUt , Σt , and VTt of a data matrix At ∈ Rn×c at time t . When anm×cmatrix At+1 arrives at time t +1, the purpose of incremental SVD isto efficiently obtain the SVD result of

[At ; At+1

]based on the pre-

vious result Ut , Σt , and VTt . Note that[X ; Y

]denotes the vertical

concatenation of two matrices X and Y. Incremental SVD is usedto analyze patterns in time series data [33], and several efficient

Page 3: Zoom-SVD: Fast and Memory Efficient Method for Extracting Key Patterns in an Arbitrary ...ukang/papers/zoomsvdCIKM18.pdf · 2018-08-26 · Zoom-SVD: Fast and Memory Efficient Method

𝑐

𝑏

𝑡

IncrementalSVD

𝑏

Time

⋮ ⋮

𝑏⋮ ⋮

𝑏 − 1⋮⋮

1

𝑏𝑏 + 1

2𝑏

𝑡 − 𝑏+ 1

SVD result

SVD result

𝑐

⋮⋮ ⋮

𝑏⋮ ⋮

SVD result

𝑏⋮ ⋮ SVD result

query 𝑡) 𝑏⋮ ⋮

SVD result

PARTIAL-SVD STITCHED-SVD

Time[Storage phase] [Query phase]

SVD result between 𝑡* and 𝑡)

Time index

New data

query 𝑡*

𝑎,

𝐀(/01)

𝐀(3)

𝐀(/)𝐀(4)

𝐀(1)

𝐀𝒕6𝟏(𝒊)

𝑡 − 1

⋮⋮

𝐔(/:3) 𝐕(/:3)=𝚺(/:3)

𝐔′(/) 𝐕′(/)=𝚺′(/)

𝐔′(3) 𝐕′(3)=𝚺′(3)

PARTIAL-SVD

Figure 2: Overview of Zoom-SVD. In the storage phase, Zoom-SVD stores the SVD results for length-b blocks of multiple time

series data, updates the SVD result of the most recent block of a newly arrived data, and discards previously arrived data. In

the query phase, Zoom-SVD computes the SVD result in a given time range [ts , te ] based on the small SVD results. The query

phase exploits Partial-SVD (Section 3.3.1) and Stitched-SVD (Section 3.3.2) modules to process the time range query.

methods for incremental SVD were proposed [2, 30]. This incre-mental SVD technique is exploited in our method to incrementallycompress and store the data (see Algorithm 1 in Section 3.2).

2.3 Problem Definition

We formally define the time range query problem as follows:

Problem 1. (Time RangeQuery on Multiple Time Series)• Given: a time range [ts , te ], and multiple time series data

represented by a matrix A ∈ Rt×c where t is the length of the

time dimension, and c is the number of the time series,

• Find: the SVD result of the sub-matrix of A in the time range

quickly, without storing all of A. The SVD result includes U ∈R(te−ts+1)×k , Σ ∈ Rk×k , and V ∈ Rc×k where k is the rank

of the sub-matrix.

Applying the standard SVD or incremental SVD for the timerange query is impractical for the following reasons. Standard SVDneeds to extract the sub-matrix corresponding to the time rangebefore performing decomposition. Iwen et al. [7] proposed a hierar-chical and distributed approach for computing Σ and V except forU of a whole matrix A. Zadeh et al. [40] introduce Tall and SkinnySVD which obtains Σ and V by computing eigen-decompositionof ATA, and then computes U using Σ, V, and A. Halko et al. [6]propose Randomized SVD which computes SVD of A using ran-domized approximation techniques. However, such methods areinefficient because they need to compute SVDs from scratch formultiple overlapping queries. Furthermore, those methods need tokeep the entire time series data A, which is practically infeasiblein many streaming applications. Incremental SVD considers up-dates only on newly added data, and thus cannot perform SVD ona specific time range.

To address these limitations, we propose an efficient method forthe time range query in Section 3.

3 PROPOSED METHOD

We propose Zoom-SVD, a fast and space-efficient method for ex-tracting key patterns from multiple time series data in an arbitrarytime range. We first give an overview of Zoom-SVD in Section 3.1.

We describe details of Zoom-SVD in Sections 3.2 and 3.3. Finally,we analyze Zoom-SVD’s time and space complexities in Section 3.4.

3.1 Overview

Zoom-SVD efficiently extracts key patterns from multiple time se-ries data in an arbitrary time range using SVD. The main challengesfor the time range query problem (Problem 1) are as follows:

(1) Minimize the space cost. The amount of multiple timeseries data increases over time. How can we reduce the spacewhile supporting time range queries?

(2) Minimize the time cost. How can we quickly computeSVD of multiple time series data in an arbitrary time range?

We address the above challenges with the following ideas:(1) Compressmultiple time series data (Section 3.2).Zoom-

SVD compresses the raw data using incremental SVD, anddiscards the raw data in the storage phase.

(2) Avoid reconstructing the original data from the SVD

results computed in the storage phase (Sections 3.3.1

and 3.3.2). We propose Partial-SVD and Stitched-SVDto compute SVD without reconstructing the original datacorresponding to the query time range.

(3) Optimize the computational time of Stitched-SVD

(Section 3.3.2). We optimize the performance of Stitched-SVD by reducing numerical computations using a block ma-trix structure.

Zoom-SVD comprises two phases: storage phase and queryphase. In the storage phase (Algorithm 1), Zoom-SVD stores theSVD results corresponding to length-b blocks in the time series datain order to support time range queries as shown in Figure 2. When anew data arrives, Zoom-SVD incrementally updates the SVD resultwith the newly arrived data, block by block. In the query phase(Algorithms 2), Zoom-SVD returns the SVD result for a given timerange [ts , te ]. The query phase utilizes our proposed Partial-SVDand Stitched-SVDmodules to process the time range query. PartialSVD (Algorithm 3) manipulates the SVD result containing ts (or te )to match the query time range as shown in Figure 2. Stitched-SVD(Algorithms 2) efficiently computes the SVD result between ts andte by stitching the SVD results for blocks in the time range.

Page 4: Zoom-SVD: Fast and Memory Efficient Method for Extracting Key Patterns in an Arbitrary ...ukang/papers/zoomsvdCIKM18.pdf · 2018-08-26 · Zoom-SVD: Fast and Memory Efficient Method

3.2 Storage Phase of Zoom-SVD

Given multiple time series stream A, the objective of the storagephase is to incrementally compress the input data and discard theoriginal input data A to achieve space efficiency. A naive incre-mental SVD would update one large SVD result when the data arenewly added. However, this approach is impractical because theprocessing cost for the newly added data increases over time. Also,the naive incremental SVD does not support a time range queryquickly in the query phase because it manipulates the large SVDresult stored for the total time regardless of the query time range.

The storage phase of Zoom-SVD (Algorithm 1) is designed toefficiently process newly added data and quickly support time rangequeries. Given multiple time series data A, the storage phase ofZoom-SVD (Algorithm 1) incrementally compresses the input datablock by block using incremental SVD, and discards the originalinput data A to reduce space cost. Assume the multiple time seriesdata are represented by a matrix A ∈ Rt×c where t is time length,and c is the number of time series (e.g., sensors). We conceptuallydivide the matrix A into length-b blocks represented by A(i) ∈ Rb×cas shown in Figure 2. We then store the low-rank approximationresult of each block matrix A(i), where we exploit an incrementalSVD method in the process. We formally define the block matrixA(i) in Definition 1.

Definition 1 (Block matrix A(i)). Suppose a multivariate time

series is A =[a1; a2; · · · ; at

]where aj ∈ R1×c is the j-th row vector

of A, and [; ] denotes the vertical concatenation of vectors. The i-th

block matrix A(i) is then represented as follows:

A(i) ≡[ab×i+1; ab×i+2; · · · ; ab×i+b

]where b is a block size. In addition, A(i)t−1 denotes the i-th block matrix

at time t − 1 where i indicates the index of the most recent block as

shown in Figure 2. Note that the number of rows in A(i)t−1 is less thanor equal to b. �

The computed SVD result U(i), Σ(i), and V(i) of each block matrixA(i) are stored as follows.

Definition 2 (Sets of SVD resultsU, S, andV). The setsU,

S, andV store the SVD resultsU(i),Σ(i), andV(i) for all i , respectively.�

Note that the original time series data are discarded, and we storeonly the SVD results which occupy less space than the original data.The SVD results for block matrices are used in the query phase(Algorithm 2). Now we are ready to describe the details of thestorage phase.

The storage phase (Algorithm 1) compresses the multiple timeseries data block by block using incremental SVD to support timerange queries. When new multiple time series data at ∈ R1×c aregiven at time t (line 2), we generate the new SVD result of at for thenext block matrix A(i)t if the SVD result are stored inU, S, andVat time t − 1 (lines 3 and 4). If not, we have the SVD result U(i,t−1),Σ(i,t−1), and V(i,t−1) of the most recent block matrix A(i)t−1 whichis the i-th block matrix at time t − 1. Assume that we have the SVDresult U(i,t−1), Σ(i,t−1), and V(i,t−1) of A(i)t−1 (i.e., the block matrixfrom time t −b + 1 to t − 1 as seen in Figure 2). We then update the

Algorithm 1 Storage phase of Zoom-SVDInput: multiple time series data AOutput: set U of left singular vector matrix U(i ), set S of singular value

matrix Σ(i ), and set V of right singular vector matrix V(i )Parameters: block size b1: Initialize: U(0,1), Σ(0,1), and V(0,1) ← SVD result of a1, block index

i ← 0, and time index t ← 22: while at is newly arriving do3: if U(i,t−1), Σ(i,t−1), V(i,t−1) do not exist then4: U(i,t ), Σ(i,t ), V(i,t ) ← SVD result of at5: else

6: U(i,t ), Σ(i,t ), V(i,t ) ← Incremental-SVD (U(i,t−1), Σ(i,t−1),V(i,t−1), at )

7: end if

8: if the number of rows of U(i,t ) is equal to b then

9: U ← U ∪ {U(i,t ) }, S ← S ∪ {Σ(i,t ) }, V ← V ∪ {V(i,t ) }10: i ← i + 111: end if

12: t ← t + 113: end while

SVD result into U(i,t ), Σ(i,t ), and V(i,t ) for the new data at using anincremental SVD method (line 6). If the number of rows of U(i,t ) isb, we put the SVD result U(i,t ), Σ(i,t ), and V(i,t ) intoU, S, andV ,respectively (lines 8∼11). Equations (2) and (3) represent the detailsof how to update the SVD result of A(i)t−1 for the new incoming dataat , when A(i)t−1 contains b − 1 rows. A(i)t is represented by at andthe SVD result of A(i)t−1 in Equation (2):

A(i )t =[A(i )t−1

at

]≃

[U(i,t−1)Σ(i,t−1)VT

(i,t−1)at

]=

[U(i,t−1) O(b−1)×1O1×kt−1 I1×1

] [Σ(i,t−1)VT

(i,t−1)at

] (2)

where Ox×y is an x × y zero matrix, and Ix×x is an x × x iden-tity matrix. We then perform SVD to decompose

[Σ(i,t−1)VT

(i,t−1)at

]∈ R(kt−1+1)×c into UΣVT:[

U(i,t−1) O(b−1)×1O1×kt−1 I1×1

] [Σ(i,t−1)VT

(i,t−1)at

]=

[U(i,t−1) O(b−1)×1O1×kt−1 I1×1

]UΣVT , U(i,t )Σ(i,t )VT

(i,t )

(3)

where U(i,t ) =[

U(i,t−1) O(b−1)×1O1×kt−1 I1×1

]U, Σ(i,t ) = Σ, and VT

(i,t ) = VT.Note that U(i,t ) is a column orthogonal matrix since it is the productof two orthogonal matrices. V(i,t ) is also column orthogonal, andΣ(i,t ) is a diagonal matrix whose diagonal entries are sorted in thedescending order. Hence, U(i,t ), Σ(i,t ), and V(i,t ) are consideredas the SVD result of A(i)t by the definition of SVD [38]. The timeindex t can be omitted as in U(i), Σ(i), V(i), and A(i), as describedin Definitions 1 and 2, if the number of rows of U(i,t ) is b.

3.3 Query Phase of Zoom-SVD

Given the starting point ts and the ending point te of a time rangequery, the goal of the query phase of Zoom-SVD is to obtain theSVD result from ts to te . A naive approach would reconstruct thetime series data from the SVD results of the block matrices rangedbetween ts and te , and perform SVD on the reconstructed data inthe range. However, this approach requires heavy computations

Page 5: Zoom-SVD: Fast and Memory Efficient Method for Extracting Key Patterns in an Arbitrary ...ukang/papers/zoomsvdCIKM18.pdf · 2018-08-26 · Zoom-SVD: Fast and Memory Efficient Method

𝑐

⋮⋮

𝑏⋮ ⋮SVD result

query 𝑡%

query 𝑡& 𝑏⋮ ⋮

SVD resultTime

SVD

SVD

Multiplication

Multiplication

𝐀())

𝐀(+)

⋮𝐔′(+) 𝐕′(+)/𝚺′(+)

𝐔′()) 𝐕′())/𝚺′())

Figure 3: Example of Partial-SVD. Given a time range

query [ts , te ], we remove rows of U(S ) and U(E) which are out

of the query time range. Partial-SVD exploits SVD on the

filtered matrices for left singular vector matrices and singu-

lar value matrices within the query time range (red-colored

boxes). Then Partial-SVD computes right singular vector

matrices of the output SVD results by multiplying relevant

matrices (blue-colored boxes).

Algorithm 2 Query phase of Zoom-SVDInput: setsU, S, andV of block SVD results, starting point ts , and ending

point teOutput: SVD result U(S :E), Σ(S :E), and V(S :E) in [ts , te ]1: U′(S ), Σ′(S ), V′(S ), U′(E), Σ′(E), and V′(E)← Partial-SVD (ts , te , U, S, V)

2: ΣVT ← [Σ′(S )V′T(S );Σ(S+1)VT(S+1); · · · ;Σ

′(E)V′

T(E)]

3: Ur , Σr , VTr ← low-rank approximation of ΣVT using SVD

4: V(S :E) ← Vr and Σ(S :E) ← Σr5: U(S :E) ← [U′(S )U

′r (S ) ; U(S+1)Ur (S+1) ; · · · ; U′(E)U

′r (E)]

6: return U(S :E), Σ(S :E), and V(S :E)

especially for a long time range query, and thus is not appropriatefor serving time range queries quickly.

We propose two sub-modules, Partial-SVD and Stitched-SVD,which are used in the query phase of our proposed method (Al-gorithm 2) to efficiently process time range queries by avoidingreconstruction of the raw data. Let S be the index of the block ma-trix including ts , and E be the index of the block matrix includingte . Partial-SVD (Algorithm 3) adjusts the time range of the SVDresults for A(S ) and A(E) as seen in the red-colored boxes of Figure 3(line 1 of Algorithm 2). Stitched-SVD combines the SVD resultsof Partial-SVD and those of block matrices from A(S+1) to A(E−1)

(lines 2 to 5 in Algorithm 2). We describe the details of Partial-SVDand Stitched-SVD in Sections 3.3.1 and 3.3.2, respectively.

3.3.1 Partial-SVD. This module manipulates the SVD resultsof block matrices A(S ) and A(E) to return the SVD results in a giventime range [ts , te ]. As seen in Figure 2, A(S ) may contain the timerange before ts , and A(E) may include the time range after te . Notethat those time ranges are out of the time range of the given query;thus, our goal for this module is to extract SVD results from A(S )

and A(E) according to the time range query without reconstructingraw data. Figure 3 depicts the operation of Partial-SVD. For theblock matrix A(S ) and its SVD U(S )Σ(S )V(S ), Partial-SVD firsteliminates rows of left singular vector matrix U(S ) which are outof the query time range. After that, Partial-SVD multiplies the

Algorithm 3 Partial-SVDInput: starting point ts , ending point te , and sets U, S, and V of SVD

resultsOutput: SVD results U′(S ), Σ

′(S ), V

′(S ), U

′(E), Σ

′(E), and V′(E) of A(S ) and A(E)

within the time range1: S ← ts /b and E ← te /b2: bS ← b − ts%b and bE ← te%b (%: the modulus operator)3: construct Xs and Xe based on bS and bE as in Equation (4)4: U(S ), Σ(S ), VT

(S )← low-rank approximation of XsU(S )Σ(S ) using SVD5: U′(S ) ← U(S ), Σ′(S ) ← Σ(S ), and V′T(S ) ← VT

(S )VT(S )

6: U(E), Σ(E), VT(E)← low-rank approximation of XeU(E)Σ(E) using SVD

7: U′(E) ← U(E), Σ′(E) ← Σ(E), and V′T(E) ← VT(E)V

T(E)

8: return U′(S ), Σ′(S ), V′(S ), U′(E), Σ′(E), and V′(E)

remaining left singular vectormatrixXsU(S ) with the singular valuematrix Σ(S ), and performs SVD U(S )Σ(S )VT

(S ) ← XsU(S )Σ(S ) of theresulting matrix. The resulting singular vector matrix U(S ) and thesingular value matrix Σ(S ) constitute the output of Partial-SVD.The remaining right singular vector matrix output of Partial-SVDis computed by multiplying the right singular vector matrix VT

(S )with VT

(S ). Similar operations are performed for the block matrixA(E) and its SVD U(E)Σ(E)V(E).

Now, we describe the details of this module (Algorithm 3). Wefirst introduce elimination matrices which are used in Partial-SVDto adjust the time range.

Definition 3 (Eliminationmatrices). Suppose rS is the number

of rows to be eliminated in A(S ) according to ts . Then bS = b − rS is

the number of remaining rows in A(S ). Similarly, let rE be the number

of rows to be eliminated in A(E) according to te ; then bE = b − rE is

the number of remaining rows in A(E). The elimination matrices Xsand Xe for A(S ) and A(E) are defined as follows:

Xs =[ObS×rS IbS×bS

]Xe =

[IbE×bE ObE×rE

](4)

�The matrices A(S ) and A(E) are multiplied to the elimination

matrices, and the time ranges of the resulting matrices XsA(S )

and XeA(E) are within the query time range [ts , te ]. Partial-SVDconstructs those elimination matrices based on bS and bE (line 3 ofAlgorithm 3). The filtered block matrix XsA(S ) is given by

XsA(S ) ≃ Xs (U(S )Σ(S )VT(S )) = (XsU(S )Σ(S ))VT

(S ) (5)

where A(S ) ≃ U(S )Σ(S )VT(S ) was computed at the storage phase.

Partial-SVD decomposes XsU(S )Σ(S ) into U(S )Σ(S )VT(S ) via SVD

and low-rank approximation with threshold ξ since XsU(S ) is nota column orthogonal matrix, and XsU(S )Σ(S )VT

(S ) is not a form ofthe SVD result; then, Equation (5) is written as follows:

(XsU(S )Σ(S ))VT(S )≃ U(S )Σ(S )(V(S )V(S ))T=U′(S )Σ

′(S )V

′T(S ) (6)

where U′(S ) = U(S ), Σ′(S ) = Σ(S ), and V′T(S ) = VT(S )V

T(S ). In line 4,

Partial-SVD performs SVD on XsU(S )Σ(S ), and in line 5 it com-putes V′T(S ) = VT

(S )VT(S ). Partial-SVD similarly computes the SVD

result of XeA(E) in lines 6∼7 of Algorithm 3.

Page 6: Zoom-SVD: Fast and Memory Efficient Method for Extracting Key Patterns in an Arbitrary ...ukang/papers/zoomsvdCIKM18.pdf · 2018-08-26 · Zoom-SVD: Fast and Memory Efficient Method

3.3.2 Stitched-SVD. Thismodule combines the Partial-SVDof A(S ) and A(E), and the stored SVD results of blocks matricesA(S+1),A(S+1), · · · ,A(E−1) in the query time range [ts , te ] to returnthe final SVD result corresponding to the query range as shownin Figure 2. A naive approach is to reconstruct the data blocks us-ing the stored SVD results and perform SVD on the reconstructeddata of the given query time range. However, this approach cannotprovide fast query speed for a long time range due to heavy compu-tations induced by the reconstruction and the following SVD. Thegoal of Stitched-SVD is to efficiently stitch the SVD results in thequery time range by avoiding reconstruction and minimizing thenumerical computation of matrix multiplication.

Specifically, Stitched-SVD stitches several consecutive blockSVD results together to compute the SVD corresponding to thequery time range: is.e., it combines the SVD resultU(i),Σ(i), andV(i)of the ith block matrix A(i), for i = S ..E, to compute the SVD U(S :E),Σ(S :E), and V(S :E). The main idea is 1) to carefully decouple thematrices U(i) from Σ(i)V(i), 2) construct a stacked matrix containingΣ(i)V(i) for i = S ..E, 3) perform SVD on the stacked matrix to getthe singular value matrix and the right singular vector matrix ofthe final SVD result, and 4) carefully combine U(i) with the leftsingular matrix of SVD of the stacked matrix to get the left singularvector matrix of the final SVD result.

Lines 2 to 5 of Algorithm 2 present how stitched SVDmatrices arecomputed. First, we construct ΣVT based on the block matrix struc-ture where ΣVT is equal to

[Σ′(S )V′T(S ) ; Σ(S+1)VT

(S+1) ; · · · ; Σ′(E)V′T(E)]T .

After organizing the block matrix structure of ΣVT, we define blockdiagonal matrix diaд(S : E) as follows.

Definition 4 (Block diagonal matrix). Suppose U′(S ) and U′(E)are the left singular vector matrices produced by Partial-SVD. Let

U(S+1), U(S+2), · · · , U(E−1) be the left singular vector matrices inU.

The block diagonal matrix diaд(S : E) is defined as follows:

diaд(S : E) =

U′(S ) O · · · O

O U(S+1)...

.

.

.. . .

.

.

.O · · · · · · U′(E)

Then, the matrix corresponding to the time range query [ts , te ]is represented as follows:

A(S :E) =

XsA(S )

A(S+1)· · ·

XeA(E)

U′(S )Σ′(S )V

′T(S )

U(S+1)Σ(S+1)VT(S+1)

· · ·U′(E)Σ

′(E)V

′T(E)

= diaд(S : E)ΣVT (7)

where Xs and Xe are elimination matrices of Partial-SVD, andΣVT is equal to

[Σ′(S )V′T(S ) ; Σ(S+1)VT

(S+1) ; · · · ; Σ′(E)V′T(E)]T . As we ap-

ply SVD and low-rank approximation to ΣVT, Equation (7) becomesas follows:

diaд(S : E)ΣVT ≃ diaд(S : E)UrΣrVTr

= U(S :E)Σ(S :E)VT(S :E)

(8)

where ΣVT ≃ UrΣrVTr is computed by low-rank approximation

and SVD, U(S :E) = diaд(S : E)Ur , Σ(S :E) = Σr , and V(S :E) = Vr . Toavoid matrix multiplication between Ur and zero sub-matrices of

diaд(S : E), we split Ur block by block as follows:

Ur =[U′r (S ) ; Ur (S+1) ; · · · ; U′r (E)

]Twhere U′r (S ) and U′r (E) correspond to U′(S ) and U′(E), respectively,and Ur (i) correspond to U(i) for S + 1 ≤ i ≤ E − 1. Then U(S :E) =diaд(S : E)Ur of Equation (8) is computed as follows:

U(S :E) = diaд(S : E)Ur

=[U′(S )U

′r (S ) ; U(S+1)Ur (S+1) ; · · · ; U′(E)U

′r (E)

]T (9)

The column orthogonality of U(S :E) = diaд(S : E)Ur is estab-lished as it is the product of two column orthogonal matrices; also,V(S :E) = Vr is column orthogonal. Note that we perform Partial-SVD to satisfy column orthogonal condition before performingStitched-SVD.

3.4 Theoretical Analysis

We theoretically analyze our proposed method Zoom-SVD in termsof time and memory cost. Note that a collection of multiple timeseries data A is a dense matrix, and the time complexity to computeSVD of A ∈ Rt×c is O(min(t2c, tc2)).

Time Complexity. We analyze the time complexities of thestorage and the query phases in Theorems 1 and 2, respectively.

Theorem 1. When a vector at ∈ R1×c is given at time t , thecomputation cost of storage phase in Zoom-SVD isO(k2(b+c)), wherek is the number of singular values.

Proof. Performing SVD of[Σ(i,t−1)VT

(i,t−1) ; at]takes O(min(c2

(1 + kt−1), c(1 + kt−1)2), and multiplication of[U(i,t−1) O

O I

]and U

takesO(b(kt−1+1)kt ) since the row length of[U(i,t−1) O

O I

]is always

smaller than or equal to the block size b. Assume kt−1 and kt areequal to k . The total computational cost of storage phase in Zoom-SVD is O(min(c2k, ck2)+bk2). We simply express the computationalcost of storing the incoming data at each time tick as O(k2(b + c))since the number c of columns is generally greater than k . �

In Theorem 1, the computation of storing the incoming data ateach time tick takes constant time since b and c are constants andk is smaller than c .

Theorem 2. Given a time range query [ts , te ], the time cost of

query phase (Algorithm 2) is O((te − ts )k(k + c2b ) + k

2(b + c)).

Proof. It takesO((b − rS + c)k ′2(S ) + (b − rE + c)k′2(E)) to compute

Partial-SVDwhere k ′(S ) and k′(E) are the number of singular values

computed by Partial-SVD (line 1 in Algorithm 2). The computa-tional time to perform SVD of

[Σ′(S )V′T(S ) ; Σ(S+1)VT

(S+1) ; · · · ; Σ′(E)V′T(E)]

depends on O(min(c2(k ′(S ) + k ′(E) +∑E−1i=S+1 k(i)), c(k

′(S ) + k ′(E) +∑E−1

i=S+1 k(i))2) in Stitched-SVD since horizontal and vertical length

of the matrix are c and (k ′(S ) + k ′(E) +∑E−1i=S+1 k(i)), respectively

(lines 2∼ 4 in Algorithm 2). Also, the computational time of blockmatrixmultiplication forU(S :E) (line 5 inAlgorithm 2) takes O(k(S :E)((b − rS )k ′(S )+(b − rE )k

′(E) +b

∑E−1i=S+1 k(i))) where k(S :E) is the num-

ber of singular values with respect to the SVD result of a giventime range [ts , te ]. Let all k(i)’s be k in query phase, k ′(S ) + k

′(E) +

Page 7: Zoom-SVD: Fast and Memory Efficient Method for Extracting Key Patterns in an Arbitrary ...ukang/papers/zoomsvdCIKM18.pdf · 2018-08-26 · Zoom-SVD: Fast and Memory Efficient Method

Table 2: Description of real-world multiple time series

datasets. Each dataset is a matrix in Rt×c where t corre-

sponds to total length (time), and c corresponds to the num-

ber of time series.

Dataset Total length (t ) Attribute (c)

Activity1 382,000 41Gas2 4,107,000 16London3 350,000 10∑E−1

i=S+1 k(i) be larger than c; also, replace b − ri with block sizeb since b is always greater than b − ri . Then, the computationaltime of Partial-SVD and Stitched-SVD takes O(k2(b + c)) andO((te − ts )k(k + c2

b )), respectively. We can simply express the com-putational cost of Zoom-SVD asO((te −ts )k(k+ c2

b )+k2(b+c)). �

Theorem 2 implies that the computational time of Zoom-SVDin query phase linearly depends on the time range (te − ts ).

Space Complexity. We analyze the space complexity of thestorage phase in Theorem 3.

Theorem 3. Space complexity of Zoom-SVD for storing data is

O(tk(1+ kb +

cb )) where t is the total time length, and k is the number

of singular values.

Proof. At time t , we have ⌊ tb ⌋ SVD results whereU(i) ∈ Rbi×k(i ) ,Σ(i) ∈ Rk(i )×k(i ) , and V(i) ∈ Rk(i )×c . Therefore, the number of ele-

ments in every matrix we have isO(∑ ⌊ tb ⌋i=1 k(i)(b +k(i) +c)). Assum-ing k(i) is equal to k , we briefly express the space cost of Zoom-SVDas O(tk(1 + k

b +cb )). �

The space cost linearly depends on t and k , but k is much smallerthan the number c of time series. Therefore, Zoom-SVD efficientlycompresses incoming data using space linear to time length withlow coefficient.

4 EXPERIMENT

We aim to answer the following questions to evaluate the perfor-mance of our method Zoom-SVD from experiments.• Q1. Time cost (Section 4.2). How quickly does Zoom-SVDprocess time range queries compared to other methods?• Q2. Space cost (Section 4.3). How much memory spacedoes Zoom-SVD require for compressed blocks?• Q3. Time-Space-Accuracy Trade-off (Section 4.4).Whatare the tradeoffs between query time, space, and accuracyby Zoom-SVD compared to other baselines?• Q4. Parameter (Section 4.5). How does the block size b inAlgorithm 1 affect the performance of Zoom-SVD in termsof time and space?

4.1 Experiment Settings

Methods. We compare Zoom-SVD with Randomized SVD [6],Tall and Skinny SVD [40], and basic SVD of JBLAS library. Allthese methods are implemented using JAVA, and we use JBLAS, an1https://archive.ics.uci.edu/ml/datasets/PAMAP2+Physical+Activity+Monitoring2https://archive.ics.uci.edu/ml/datasets/Gas+sensor+array+under+dynamic+gas+mixtures3http://www.londonair.org.uk/london/asp/publicdetails.asp

104

105

106

104 105

slope = 0.9721

Run

ning

tim

e (m

sec)

Total time length

Zoom-SVD

(a) Running time of Zoom-SVD’sstorage phase in Activity dataset

103

104

105

106

104 105 106

slope = 0.9873

Run

ning

tim

e (m

sec)

Total time length

Zoom-SVD

(b) Running time of Zoom-SVD’sstorage phase in Gas dataset

Figure 4: Running time of Zoom-SVD’s storage phase on Ac-

tivity and Gas datasets. The results show that the time to

update all data is linear to the total time length t . The rea-

son is that Zoom-SVD incrementally reads a new input vec-

tor, and computes the SVD of the block matrix in a constant

time. The pattern on London dataset is similar to the above

results.

0

100

200

300

400

500

600

ACTIVITY GAS LONDON

7.88x

15.63x

7.19xMem

ory

usag

e (M

B)

Compressed Data by Zoom-SVD Original Data

Figure 5: Space savings by Zoom-SVD. Zoom-SVD requires

7.88×, 15.63×, and 7.19× less space than the original data re-

quire for Activity, Gas sensor, and London, respectively.

open source JAVA basic linear algebra package, to support matrixoperations.

Dataset. We use multiple time series datasets [4, 29] describedin Table 2. Activity dataset contains data with timestamps, and 41measured values such as heartbeat, and inertial measurement units(IMU) data installed in the hands, chest, and ankle. Gas datasetconsists of timestamps, and measurement of 16 chemical sensorshaving 4 different types. Activity and Gas datasets are obtainedfrom UCI repository [? ]. London dataset consists of timestamps,and attributes related to London air quality such as nitric oxide,temperature, and so on.

Parameters. For all experiments except the ones in Section 4.5,we set the block size b to 1000 in the storage phase. We evaluatethe effects of the block size b in Section 4.5. We set the rank k usingEquation (1) with ξ = 0.98 in the storage phase.

4.2 Time Cost

We examine the time costs of the storage and the query phases ofour proposed method Zoom-SVD.

Storage phase (Algorithm 1). We measure the running timeof the storage phase for multiple time series data varying the timelength. As shown in Figure 4, the running time of Zoom-SVD’sstorage phase is linearly proportional to the time length over alldatasets. The reason is that the storage phase processes the timeseries data row by row, and the time cost of processing a row is

Page 8: Zoom-SVD: Fast and Memory Efficient Method for Extracting Key Patterns in an Arbitrary ...ukang/papers/zoomsvdCIKM18.pdf · 2018-08-26 · Zoom-SVD: Fast and Memory Efficient Method

100101102103104105

100 101 102 103 104 105

Zoom-SVD Randomized SVD TallSkinny SVD

10-1

100

101

102

100 101 102

ξ = 0.95 ξ = 0.98 ξ = 0.99

10-3

10-2

10-1

100

101 102 103

Rec

onst

ruct

ion

erro

r

Query time (msec)

Best

(a) Query time vs Error in Activity

10-3

10-2

10-1

100 101 102

Rec

onst

ruct

ion

erro

r

Query time (msec)

Best

(b) Query time vs Error in Gas

10-3

10-2

10-1

100 101 102

Rec

onst

ruct

ion

erro

r

Query time (msec)

Best

(c) Query time vs Error in London

10-3

10-2

10-1

100

101 102

Rec

onst

ruct

ion

erro

r

Memory usage (MB)

Best

(d) Memory usage vs Error in Activity

10-3

10-2

10-1

100 101 102

Rec

onst

ruct

ion

erro

r

Memory usage (MB)

Best

(e) Memory usage vs Error in Gas

10-3

10-2

10-1

100 101 102

Rec

onst

ruct

ion

erro

r

Memory usage (MB)

Best

(f) Memory usage vs Error in London

Figure 6: The trade-off between query time, space, and MSE rate on three real-world datasets. The first column shows the

results on the Activity dataset. The second and third columns show the results on the Gas and the London datasets. The colors

represent the methods, and the shapes distinguish the threshold ξ . The bottom-left region indicates the better performance.

On all the real-world datasets, Zoom-SVD has the best performance.

constant as discussed in Theorem 1; thus, the total time cost mainlydepends on the time length.

Query phase (Algorithm 2).We evaluate the performance ofthe query phase in terms of running time. We compare Zoom-SVDto other SVD based methods discussed in Section 2, and measurethe query time varying the length of the query range. The startingand the ending points of the query are arbitrarily chosen, and weincrease the length of the query range from 104 to 3.2 × 105. Asshown in Figure 1, Zoom-SVD is up to 9.6× faster than the secondbest competitor Randomized SVD on Activity dataset. Also, Zoom-SVD shows up to 15× and 12× faster query speed than RandomizedSVD on Gas sensor and London datasets, respectively.

4.3 Space Cost

We evaluate the compression performance of Zoom-SVD. In thestorage phase, we store the SVD results of multiple time seriesdata with block size b = 103 using low-rank approximation withthreshold ξ = 0.98 in Equation (1). We measure the compressionratio for storing the original data and the block compressed data (i.e.,the SVD results) by our method. Figure 5 shows the compressionperformance for storing time series data. Zoom-SVD requires up to7.88×, 15.63×, and 7.19× less space than the original data requirefor Activity, Gas, and London, respectively.

4.4 Trade-off between Accuracy and Efficiency

We evaluate the trade-off of Zoom-SVD between accuracy, time,and space compared to other methods. Wemeasure the time and thespace usage by setting the length of time range query to 1.8 × 105.The accuracy of each method is measured by the reconstruction er-ror ∥X(ts :te )−X(ts :te ) ∥2F

∥X(ts :te ) ∥2Fwhere X(ts :te ) is the reconstructed data (i.e.,

X(ts :te ) = U(S,E)Σ(S,E)VT(S,E)), and X(ts :te ) is the original input datain a query time range [ts , te ]. We use the values 0.95, 0.98, and 0.99for the threshold ξ in the low-rank approximation of each methodto investigate the effect of ξ on the trade-off performance. Fig-ure 6 demonstrates the experimental results on the trade-off of SVDbased methods including Zoom-SVD. Figures 6(a), 6(b), and 6(c)present the trade-off of Zoom-SVD between query time and re-construction error is better than those of other methods over allthe datasets. Also, Zoom-SVD shows a better trade-off betweenspace and reconstruction error than its competitors as shown inFigures 6(d), 6(e), and 6(f). These results indicate that Zoom-SVDhandles time range queries more efficiently with smaller error andspace than other SVD based methods.

4.5 Parameter Sensitivity

We examine the effects of the block size b in terms of query time,and space cost. When the block size b grows from 10 to 105, wemeasure the query time and the memory usage in query phaseby setting the query time range te − ts + 1 to 103, 104, and 105in Algorithm 2. In Figure 7, the query time and memory usageshow trade-off characteristics. The query time and the space usagefor Activity dataset decrease until the block size reaches 102 ∼103, and then increase afterwards when the block size b exceeds103. Note that Zoom-SVD consists of Partial-SVD and Stitched-SVD modules in the query phase. In Figure 7(a), the query time isdominated by the computation time of the Stitched-SVDwhen theblock size b is relatively small. The reason is that the computationtime of Partial-SVD decreases as the block size b decreases, whilethat of Stitched-SVD increases linearly with the number of blockswhich is inversely proportional to the block size. On the other

Page 9: Zoom-SVD: Fast and Memory Efficient Method for Extracting Key Patterns in an Arbitrary ...ukang/papers/zoomsvdCIKM18.pdf · 2018-08-26 · Zoom-SVD: Fast and Memory Efficient Method

100

101

102

103

104

105

101 102 103 104 105

Que

ry ti

me

(mse

c)

Block size

Query range 103

Query range 104

Query range 105

(a) Query time vs. block size

0.1

1

10

100

101 102 103 104 105

Mem

ory

usag

e (M

B)

Block size

Query range 103

Query range 104

Query range 105

(b) Space cost vs. block size

Figure 7: The effect of the block size b in terms of the query

time and the memory usage in Activity dataset. When the

block size b grows from 10 to 105, the query time and the

memory usage decrease until the block size is 103, and then

increase continuously after block size exceeds 103.

hand, the query time highly depends on the computation time ofPartial-SVD when the block size is relatively large. In Figure 7(b),space cost is high when b is small because the number of storedsingular vector matrices for blocks increases. On the other hand,space cost is also high when b is large because the number k ofsingular values increases as the block size b increases, and then thesize of left singular vector matrices U(i) increases. Therefore, weset the block size b to 103 which is a near-optimum value accordingto the experimental results. The sensitivities of Zoom-SVD on theblock size b in other datasets show similar patterns.

5 CASE STUDY

In this section, we present the analysis result of Gas sensor datasetusing Zoom-SVD. For a given time range, Zoom-SVD searches pasttime ranges for similar patterns to that of the given time range.

Data description. Gas sensor dataset consists of 16 chemicalsensors exposed to gas mixtures which contain ethylene and carbonmonoxide (CO) with air. There are four different types of sensors:TGS2600, TGS2602, TGS2610, and TGS2620 (four sensors per onetype). The sensors are placed in a 60ml measurement chamber. Theelectrical conductivities of the sensors are measured at every 0.01seconds, and the total time length is 12 hours.

Finding similar time ranges. Our goal is to find past timeranges whose patterns are similar to that of a given time rangequery which we call as the ‘base’. Suppose a time range [ts , te ] isgiven. First, we compute SVD of data in the range using Zoom-SVD. Next, we continuously compute SVDs of previous time rangesusing Zoom-SVD by sliding a window of size te − ts + 1. Note thatwe use the fixed window size for efficiency. We then compare u1of the base with u1 of previous time ranges, where u1 is the firstcolumn of left singular vector matrix U of the base, and u1 is thefirst column of U of a previous time range. Experimental setting isas follows:

• The base time range is [4033294, 4053293]which is randomlychosen, and thus the size of sliding window (time length) is20, 000. The sliding period is 500.• We compute the cosine similarity u1 ·u1

∥u1 ∥ ∥u1 ∥ to compare thepatterns, and then find the two most similar time ranges.

Figure 8 shows the first singular vectors of the given time range(denoted by ’Base’) and two previous time ranges (denoted by ’An-swer1’ and ’Answer2’) with similar patterns we found, and sensordata corresponding to each time range. In Figure 8(a), Zoom-SVDsuccessfully finds the similar patterns, and each sensor has the sametendency w.r.t. the three time ranges. We also discover a potentialanomaly. In Figure 8(c), note an abnormal spike of Base at the timearound 17, 960, which deviates significantly from the measurementsof the two previous time ranges.

6 RELATEDWORK

In this section, we discuss related works on incremental time seriesanalysis and incremental SVD.

It is important to analyze on-line time series data efficiently.There are several ideas based on linear modeling, including KalmanFilters (KF), linear dynamical systems (LDS) and the variants [8, 18,19], time warping [37], and correlation-based methods [16, 20, 32,41]. Among them, incremental SVD efficiently tracks the SVD ofdynamic time series data where new data rows are incrementallyattached to the current data. Incremental SVD has been used forupdating SVD results when new terms or documents are added toa document-term matrix [3], and has been adopted to build an in-cremental movie recommender system [2, 33]. Brand [1] developedan incremental SVD method for incomplete data which containmissing values. Papadimitriou et al. [23] proposed SPIRIT whichincrementally updates hidden factors in multiple time series, andexploited SPIRIT to predict future signals and interpolate missingvalues in sensor streams. Ross et al. [30] proposed an incremen-tal PCA for visual tracking applications. Papadimitriou et al. [24]introduced an incremental method to capture optimal recurringpatterns indicating the main trends in time series data.

The methods above have limitations in applying to time rangequery problem. Conventional methods such as Discrete FourierTransformation [20], data compression [5, 28], and time series clus-tering [25] require the whole time series data, or additional data toseek hidden patterns for a specific time range, thereby incurringhigh memory and time complexity. On the other hand, our pro-posed Zoom-SVD efficiently serves time range queries with smallmemory and low time complexity.

7 CONCLUSIONS

We propose Zoom-SVD, a novel algorithm for finding key patternsin an arbitrary time range from multiple time series data. Zoom-SVD efficiently serves time range queries by compressing multipletime series data, and reducing computation costs by carefully stitch-ing compressed SVD results. Consequently, Zoom-SVD providesfast query time and small memory usage. We provide theoreticalanalysis on the time and space complexities of Zoom-SVD. Ex-periments show that Zoom-SVD requires up to 15.66× less space,and runs up to 15× faster than existing methods. Also, we show areal-world case study of Zoom-SVD in searching past time rangesfor similar patterns as that of a query time range. Future researchincludes extending the method for multiple distributed streams.

Page 10: Zoom-SVD: Fast and Memory Efficient Method for Extracting Key Patterns in an Arbitrary ...ukang/papers/zoomsvdCIKM18.pdf · 2018-08-26 · Zoom-SVD: Fast and Memory Efficient Method

0

0.002

0.004

0.006

0.008

0.01

0.012

0.014

0 10000 20000

Left

sing

ular

vec

tor

(u1)

Time length

BaseAnswer1Answer2

(a) First left singular vector

1500

2000

2500

3000

3500

0 10000 20000

Sen

sor

1

Time length

BaseAnswer1Answer2

(b) Sensor 1

-100 0

100 200 300 400 500 600 700 800

0 10000 20000

Sen

sor

2

Time length

BaseAnswer1Answer2

(c) Sensor 2

1500 2000 2500 3000 3500 4000 4500 5000 5500 6000

0 10000 20000

Sen

sor

3

Time length

BaseAnswer1Answer2

(d) Sensor 3

2000 2500 3000 3500 4000 4500 5000 5500 6000 6500

0 10000 20000

Sen

sor

4

Time length

BaseAnswer1Answer2

(e) Sensor 4

Figure 8: First left singular vectors of the base and two time ranges having the similar patterns to that of the base. An-

swer1 and Answer2 are the past time ranges we found using Zoom-SVD by sliding a window. The time range of Answer1

is [1589294, 1609293], and that of Answer2 is [2736794, 2756793]. In (a), the three left singular vectors have the same pattern. In

(b, c, d, e), each sensor shows similar measurements for the three ranges in general. Note that Zoom-SVD discovers a potential

anomaly: (c) shows an abnormal spike of Base around 17, 960, which deviates vastly from themeasurements of the past ranges.

ACKNOWLEDGMENTS

This work was supported by the National Research Foundationof Korea(NRF) funded by the Ministry of Science, ICT and FuturePlanning (NRF-2016M3C4A7952587, PF Class Heterogeneous HighPerformance Computer Development). The Institute of EngineeringResearch at Seoul National University provided research facilitiesfor this work. The ICT at Seoul National University provides re-search facilities for this study. U Kang is the corresponding author.

REFERENCES

[1] Matthew Brand. 2002. Incremental singular value decomposition of uncertaindata with missing values. ECCV (2002), 707–720.

[2] Matthew Brand. 2003. Fast online svd revisions for lightweight recommendersystems. In SDM. 37–46.

[3] Scott Deerwester, Susan T Dumais, George W Furnas, Thomas K Landauer, andRichard Harshman. 1990. Indexing by latent semantic analysis. Journal of theAmerican society for information science 41, 6 (1990), 391.

[4] Jordi Fonollosa, Sadique Sheik, Ramón Huerta, and Santiago Marco. 2015. Reser-voir computing compensates slow response of chemosensor arrays exposed tofast varying gas concentrations in continuous monitoring. Sensors and ActuatorsB: Chemical 215 (2015), 618–629.

[5] Sorabh Gandhi, Suman Nath, Subhash Suri, and Jie Liu. 2009. Gamps: Compress-ing multi sensor data by grouping and amplitude scaling. In SIGMOD. ACM,771–784.

[6] NathanHalko, Per-GunnarMartinsson, and Joel A. Tropp. 2011. Finding Structurewith Randomness: Probabilistic Algorithms for Constructing Approximate MatrixDecompositions. SIAM Rev. 53, 2 (2011), 217–288.

[7] M. A. Iwen and B. W. Ong. 2016. A Distributed and Incremental SVD Algorithmfor Agglomerative Data Analysis on Large Networks. SIAM J. Matrix Analysis

Applications 37, 4 (2016), 1699–1718.[8] Ankur Jain, Edward Y Chang, and Yuan-Fang Wang. 2004. Adaptive stream

resource management using kalman filters. In SIGMOD. ACM.[9] ByungSoo Jeon, Inah Jeon, Lee Sael, and U. Kang. 2016. SCouT: Scalable coupled

matrix-tensor factorization - algorithm and discoveries. In ICDE. 811–822.[10] Inah Jeon, Evangelos E. Papalexakis, Christos Faloutsos, Lee Sael, and U. Kang.

2016. Mining billion-scale tensors: algorithms and discoveries. VLDB J. 25, 4(2016), 519–544.

[11] Inah Jeon, Evangelos E. Papalexakis, U. Kang, and Christos Faloutsos. 2015.HaTen2: Billion-scale Tensor Decompositions. In ICDE.

[12] Ian Jolliffe. 2002. Principal component analysis. Wiley Online Library.[13] U. Kang, Brendan Meeder, and Christos Faloutsos. 2011. Spectral Analysis for

Billion-Scale Graphs: Discoveries and Implementation. In PAKDD (2). 13–25.[14] U Kang, B. Meeder, E. Papalexakis, and C. Faloutsos. 2014. HEigen: Spectral Anal-

ysis for Billion-Scale Graphs. Knowledge and Data Engineering, IEEE Transactions

on 26, 2 (February 2014), 350–362. https://doi.org/10.1109/TKDE.2012.244[15] U. Kang, Hanghang Tong, and Jimeng Sun. 2012. Fast Random Walk Graph

Kernel. In SDM. 828–838.[16] Eamonn J. Keogh, Kaushik Chakrabarti, Sharad Mehrotra, and Michael J. Pazzani.

2001. Locally Adaptive Dimensionality Reduction for Indexing Large Time SeriesDatabases. In SIGMOD. 151–162.

[17] Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix factorization tech-niques for recommender systems. Computer 42, 8 (2009).

[18] Lei Li, James McCann, Nancy S Pollard, and Christos Faloutsos. 2009. Dynammo:Mining and summarization of coevolving sequences with missing values. In KDD.

ACM, 507–516.[19] Lei Li, B Aditya Prakash, and Christos Faloutsos. 2010. Parsimonious linear

fingerprinting for time series. PVLDB 3, 1-2 (2010), 385–396.[20] Abdullah Mueen, Suman Nath, and Jie Liu. 2010. Fast approximate correlation

for massive time-series data. In SIGMOD. ACM, 171–182.[21] Sejoon Oh, Namyong Park, Lee Sael, and U. Kang. 2018. Scalable Tucker Factor-

ization for Sparse Tensors - Algorithms and Discoveries. In ICDE.[22] Stanisław Osiński, Jerzy Stefanowski, and Dawid Weiss. 2004. Lingo: Search

results clustering algorithm based on singular value decomposition. In Intelligent

information processing and web mining. Springer, 359–368.[23] Spiros Papadimitriou, Jimeng Sun, and Christos Faloutsos. 2005. Streaming

Pattern Discovery in Multiple Time-Series. In VLDB. 697–708.[24] Spiros Papadimitriou and Philip S. Yu. 2006. Optimal multi-scale patterns in time

series streams. In SIGMOD. 647–658.[25] John Paparrizos and Luis Gravano. 2015. k-shape: Efficient and accurate clustering

of time series. In SIGMOD. ACM, 1855–1870.[26] Namyong Park, ByungSoo Jeon, Jungwoo Lee, and U. Kang. 2016. BIGtensor:

Mining Billion-Scale Tensor Made Easy. In CIKM. 2457–2460.[27] KV Ravi Kanth, Divyakant Agrawal, and Ambuj Singh. 1998. Dimensionality

reduction for similarity searching in dynamic databases. In SIGMOD, Vol. 27.ACM, 166–176.

[28] Galen Reeves, Jie Liu, Suman Nath, and Feng Zhao. 2009. Managing massive timeseries streams with multi-scale compressed trickles. PVLDB 2, 1 (2009), 97–108.

[29] Attila Reiss and Didier Stricker. 2012. Introducing a New Benchmarked Datasetfor Activity Monitoring. In ISWC. 108–109.

[30] David A Ross, Jongwoo Lim, Ruei-Sung Lin, and Ming-Hsuan Yang. 2008. In-cremental learning for robust visual tracking. International journal of computer

vision 77, 1 (2008), 125–141.[31] Lee Sael, Inah Jeon, and U Kang. 2015. Scalable Tensor Mining. Big Data Research

2, 2 (2015), 82 – 86. https://doi.org/10.1016/j.bdr.2015.01.004 Visions on Big Data.[32] Yasushi Sakurai, Spiros Papadimitriou, and Christos Faloutsos. 2005. Braid:

Stream mining through group lag correlations. In SIGMOD. ACM, 599–610.[33] Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl. 2002. Incremen-

tal singular value decomposition algorithms for highly scalable recommendersystems. In ICIS. 27–28.

[34] Krzysztof Simek, Krzysztof Fujarewicz, Andrzej Świerniak, Marek Kimmel, Bar-bara Jarzab, Małgorzata Wiench, and Joanna Rzeszowska. 2004. Using SVDand SVM methods for selection, classification, clustering and modeling of DNAmicroarray data. Engineering Applications of Artificial Intelligence (2004).

[35] Stephan Spiegel, Julia Gaebler, Andreas Lommatzsch, Ernesto De Luca, and SahinAlbayrak. 2011. Pattern recognition and classification for multivariate time series.In Sensor-KDD. ACM, 34–42.

[36] Hanghang Tong, Christos Faloutsos, and Jia-yu Pan. 2006. Fast Random Walkwith Restart and Its Applications. In ICDM. 613–622.

[37] Machiko Toyoda, Yasushi Sakurai, and Yoshiharu Ishikawa. 2013. Pattern dis-covery in data streams under the time warping distance. VLDBJ 22, 3 (2013),295–318.

[38] Lloyd N Trefethen and David Bau III. 1997. Numerical linear algebra. Vol. 50.Siam.

[39] Michael E Wall, Andreas Rechtsteiner, and Luis M Rocha. 2003. Singular valuedecomposition and principal component analysis. In A practical approach to

microarray data analysis. Springer, 91–109.[40] Reza Bosagh Zadeh, Xiangrui Meng, Alexander Ulanov, Burak Yavuz, Li Pu,

Shivaram Venkataraman, Evan R. Sparks, Aaron Staple, and Matei Zaharia. 2016.Matrix Computations and Optimization in Apache Spark. In KDD. 31–38.

[41] Yunyue Zhu and Dennis Shasha. 2002. Statstream: Statistical monitoring ofthousands of data streams in real time. In VLDB. VLDB Endowment.