braid: discovering lag correlations in multiple streams

59
BRAID: Discovering Lag Correlations in Multiple Streams Yasushi Sakurai (NTT Cyber Space Labs) Spiros Papadimitriou (Carnegie Mellon Univ.) Christos Faloutsos (Carnegie Mellon Univ.)

Upload: abril

Post on 20-Jan-2016

23 views

Category:

Documents


0 download

DESCRIPTION

BRAID: Discovering Lag Correlations in Multiple Streams. Yasushi Sakurai (NTT Cyber Space Labs) Spiros Papadimitriou (Carnegie Mellon Univ.) Christos Faloutsos (Carnegie Mellon Univ.). Motivation. Data-stream applications Network analysis Sensor monitoring Financial data analysis - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: BRAID: Discovering Lag Correlations in Multiple Streams

BRAID: Discovering Lag Correlations in Multiple Streams

Yasushi Sakurai (NTT Cyber Space Labs)

Spiros Papadimitriou (Carnegie Mellon Univ.)

Christos Faloutsos (Carnegie Mellon Univ.)

Page 2: BRAID: Discovering Lag Correlations in Multiple Streams

SIGMOD 2005 Y. Sakurai et al 2

Motivation

Data-stream applications Network analysis Sensor monitoring Financial data analysis Moving object tracking

Goal Monitor multiple numerical streams Determine which pairs are correlated with lags Report the value of each such lag (if any)

Page 3: BRAID: Discovering Lag Correlations in Multiple Streams

SIGMOD 2005 Y. Sakurai et al 3

Lag Correlations Examples

A decrease in interest rates typically precedes an increase in house sales by a few months

Higher amounts of fluoride in the drinking water leads to fewer dental cavities, some years later

High CPU utilization on server 1 precedes high CPU utilization for server 2 by a few minutes

Page 4: BRAID: Discovering Lag Correlations in Multiple Streams

SIGMOD 2005 Y. Sakurai et al 4

Lag Correlations Example of lag-correlated sequences

These sequences are correlated with lag l=1300 time-ticks

CCF (Cross-Correlation Function)

Page 5: BRAID: Discovering Lag Correlations in Multiple Streams

SIGMOD 2005 Y. Sakurai et al 5

Lag Correlations

CCF (Cross-Correlation Function)

Example of lag-correlated sequences Fast

(high performance) Nimble

(Low memory

consumption) Accurate

(good approximation)

Page 6: BRAID: Discovering Lag Correlations in Multiple Streams

SIGMOD 2005 Y. Sakurai et al 6

Problem #1: PAIR of sequences For given two co-evolving sequences X and

Y, determine Whether there is a lag correlation If yes, what is the lag length l

Any time, on semi-infinite streams

? yes;l = 1,300

X

Y

Page 7: BRAID: Discovering Lag Correlations in Multiple Streams

SIGMOD 2005 Y. Sakurai et al 7

Problem #2: k-way

For given k numerical sequences, X1,…,Xk , report Which pairs (if any) have a lag correlation The corresponding lag for such pairs

again, ‘any time’, streaming fashion

? X1 and X2; l = 1,300...

X1

...

X2

Xk

Page 8: BRAID: Discovering Lag Correlations in Multiple Streams

SIGMOD 2005 Y. Sakurai et al 8

Our solution, BRAID

characteristics: ‘Any-time’ processing, and fast

Computation time per time tick is constant Nimble

Memory space requirement is sub-linear of sequence length

AccurateApproximation introduces small error

Page 9: BRAID: Discovering Lag Correlations in Multiple Streams

SIGMOD 2005 Y. Sakurai et al 9

Sequence indexing Agrawal et al. (FODO 1993) Faloutsos et al. (SIGMOD 1994) Keogh et al. (SIGMOD 2001)

Compression (wavelet and random projections) Gilbert et al. (VLDB 2001) Guha et al. (VLDB 2004) Dobra et al.(SIGMOD 2002) Ganguly et al.(SIGMOD 2003)

Related Work

Page 10: BRAID: Discovering Lag Correlations in Multiple Streams

SIGMOD 2005 Y. Sakurai et al 10

Data Stream Management Abadi et al. (VLDB Journal 2003) Motwani et al. (CIDR 2003) Chandrasekaran et al. (CIDR 2003) Cranor et al. (SIGMOD 2003)

Related Work

Page 11: BRAID: Discovering Lag Correlations in Multiple Streams

SIGMOD 2005 Y. Sakurai et al 11

Related Work Pattern discovery

Clustering for data streamsGuha et al. (TKDE 2003)

Monitoring multiple streamsZhu et al. (VLDB 2002)

ForecastingYi et al. (ICDE 2000)Papadimitriou et al. (VLDB 2003)

None of previously published methods focuses on the problem

Page 12: BRAID: Discovering Lag Correlations in Multiple Streams

SIGMOD 2005 Y. Sakurai et al 12

Overview

Introduction / Related work Background Main ideas Theoretical analysis Experimental results

Page 13: BRAID: Discovering Lag Correlations in Multiple Streams

SIGMOD 2005 Y. Sakurai et al 13

Background

CCF (Cross-Correlation Function)

positively correlated

un-correlated

+

anti-correlated(lower than -)

Lag correlation

Lag

Cor

rela

tion

Page 14: BRAID: Discovering Lag Correlations in Multiple Streams

SIGMOD 2005 Y. Sakurai et al 14

Background

Definition of ‘score’, the absolute value of R(l)

Lag correlation Given a threshold , A local maximum The earliest such maximum, if more maxima exist

)()( lRlscore

ln

t t

n

lt t

n

lt ltt

yyxx

yyxxlR

1

2

1

2

1

)()(

))(()(

)(lscore

details

Page 15: BRAID: Discovering Lag Correlations in Multiple Streams

SIGMOD 2005 Y. Sakurai et al 15

Overview

Introduction / Related work Background Main ideas Theoretical analysis Experimental results

Page 16: BRAID: Discovering Lag Correlations in Multiple Streams

SIGMOD 2005 Y. Sakurai et al 16

Why not ‘naive’? Naive solution:

Compute correlation coefficient for each lag

l = 0, 1, 2, 3, …, n/2 But,

O(n) space O(n2) time

or O(n log n) time w/ FFT

t=nTimeLag

Cor

rela

tion

n/20

Page 17: BRAID: Discovering Lag Correlations in Multiple Streams

SIGMOD 2005 Y. Sakurai et al 17

Main Idea (1)

Incremental computing: the correlation coefficient of two sequences is

‘algebraic’ -> can be computed incrementally

we need to maintain only 6 ‘sufficient statistics’: Sequence length n Sum of X, Square sum of X Sum of Y, Square sum of Y Inner-product for X and the shifted Y

Page 18: BRAID: Discovering Lag Correlations in Multiple Streams

SIGMOD 2005 Y. Sakurai et al 18

Main Idea (1) Incremental computing:

Sequence length n Sum of X : Square sum of X : Inner-product for X and the shifted Y :

Compute R(l) incrementally:

Covariance of X and Y:

Variance of X:

n

lt ltt yxlSxy1

)(

n

t txnSx1

),1(

n

t txnSxx1

2),1(

),1(),1(

)()(

lnVynlVx

lClR

ln

lnSynlSxlSxylC

),1(),1(

)()(

ln

nlSxnlSxxnlVx

2)),1((

),1(),1(

details

Page 19: BRAID: Discovering Lag Correlations in Multiple Streams

SIGMOD 2005 Y. Sakurai et al 19

Main Idea (1)

Complexity

Naive Naive

(incremental)

BRAID

Space O(n) O(n)

Comp. time O(n log n) O(n)

Better, but not good enough!

Page 20: BRAID: Discovering Lag Correlations in Multiple Streams

SIGMOD 2005 Y. Sakurai et al 20

Main Idea (2)

Lag

Cor

rela

tion

Geometric lag probing

Page 21: BRAID: Discovering Lag Correlations in Multiple Streams

SIGMOD 2005 Y. Sakurai et al 21

Main Idea (2)

0 1 2 4 8

Lag

Cor

rela

tion

Geometric lag probing ie., compute the correlation coefficient for lag:

l = 0, 1, 2, 4, ... 2h

O(log n) estimations

Page 22: BRAID: Discovering Lag Correlations in Multiple Streams

SIGMOD 2005 Y. Sakurai et al 22

Main Idea (2)

Geometric lag probing

But, so far, we still need O(n) space because the longest lag is n/2

Naive Naive

(incremental)

BRAID

Space O(n) O(n)

Comp. time O(n log n) O(n) O(log n)

Page 23: BRAID: Discovering Lag Correlations in Multiple Streams

SIGMOD 2005 Y. Sakurai et al 23

Main Idea (3)

Lag

Cor

rela

tion

Sequence smoothing

t=nTime

Reminder: Naïve:

Page 24: BRAID: Discovering Lag Correlations in Multiple Streams

SIGMOD 2005 Y. Sakurai et al 24

Main Idea (3)

Lag

Cor

rela

tion

Level

h=0t=nTime

Sequence smoothing Means of windows for each level Sufficient statistics computed from the means CCF computed from the sufficient statistics But, it allows a partial redundancy

Page 25: BRAID: Discovering Lag Correlations in Multiple Streams

SIGMOD 2005 Y. Sakurai et al 25

Putting it all together:

Lag

Cor

rela

tion

Level

h=0t=nTime

Geometric lag probing + smoothing Use colored windows Keep track of only a geometric progression of the

lag values: l={0,1,2,4,8,…,2h,…}

Page 26: BRAID: Discovering Lag Correlations in Multiple Streams

SIGMOD 2005 Y. Sakurai et al 26

Putting it all together:

Geometric lag probing + smoothing Use colored windows Keep track of only a geometric progression of the

lag values: l={0,1,2,4,8,…,2h,…}

Lag

Cor

rela

tion

Level

h=0t=nTime

h=0Y

X

l=0

Page 27: BRAID: Discovering Lag Correlations in Multiple Streams

SIGMOD 2005 Y. Sakurai et al 27

Putting it all together:

Geometric lag probing + smoothing Use colored windows Keep track of only a geometric progression of the

lag values: l={0,1,2,4,8,…,2h,…}

Lag

Cor

rela

tion

Level

h=0t=nTime

h=0Y

X

l=1

Page 28: BRAID: Discovering Lag Correlations in Multiple Streams

SIGMOD 2005 Y. Sakurai et al 28

Putting it all together:

Geometric lag probing + smoothing Use colored windows Keep track of only a geometric progression of the

lag values: l={0,1,2,4,8,…,2h,…}

Lag

Cor

rela

tion

Level

h=1th=n/2Time

h=1Y

X

l=2

Page 29: BRAID: Discovering Lag Correlations in Multiple Streams

SIGMOD 2005 Y. Sakurai et al 29

Putting it all together:

Geometric lag probing + smoothing Use colored windows Keep track of only a geometric progression of the

lag values: l={0,1,2,4,8,…,2h,…}

Lag

Cor

rela

tion

Level

h=2Time

h=2Y

Xth=n/4

l=4

Page 30: BRAID: Discovering Lag Correlations in Multiple Streams

SIGMOD 2005 Y. Sakurai et al 30

Putting it all together:

Geometric lag probing + smoothing Use colored windows Keep track of only a geometric progression of the

lag values: l={0,1,2,4,8,…,2h,…}

Lag

Cor

rela

tion

Level

h=3Time

h=3Y

Xth=n/8

l=8

Page 31: BRAID: Discovering Lag Correlations in Multiple Streams

SIGMOD 2005 Y. Sakurai et al 31

Putting it all together:

Lag

Cor

rela

tion

Level

h=0t=nTime

Geometric lag probing + smoothing Use colored windows Keep track of only a geometric progression of the

lag values: l={0,1,2,4,8,…,2h,…} Use a cubic spline to interpolate

Page 32: BRAID: Discovering Lag Correlations in Multiple Streams

SIGMOD 2005 Y. Sakurai et al 32

Thus:

Complexity

Naive Naive

(incremental)

BRAID

Space O(n) O(n) O(log n)

Comp. time O(n log n) O(n) O(1) *

(*) Computation time: O(logn)And actually, amortized time: O(1)

Page 33: BRAID: Discovering Lag Correlations in Multiple Streams

SIGMOD 2005 Y. Sakurai et al 33

Overview

Introduction / Related work Background Main ideas

enhancing the accuracy Theoretical analysis Experimental results

details

Page 34: BRAID: Discovering Lag Correlations in Multiple Streams

SIGMOD 2005 Y. Sakurai et al 34

Enhanced Probing Scheme

Q: How to probe more densely than 2h ?

Lag

Cor

rela

tion

Level

h=0t=nTime

Page 35: BRAID: Discovering Lag Correlations in Multiple Streams

SIGMOD 2005 Y. Sakurai et al 35

Enhanced Probing Scheme

Q: How to probe more densely than 2h ? A: probe in a mixture of geometric and arithmetic

progressions

Lag

Cor

rela

tion

Level

h=0t=nTime

Page 36: BRAID: Discovering Lag Correlations in Multiple Streams

SIGMOD 2005 Y. Sakurai et al 36

Enhanced Probing Scheme

Basic scheme: b=1 (one number for each level) Enhanced scheme: b>1

Example of b=4 Probing the CCF in a mixture of geometric and arithmetic

progressions: l={0,1,…,7;8,10,12,14;16,20,24,28;32,40,…}

Level

h=0

Time t=n

Cor

rela

tion

Lag

step:1 step: 2 step: 4

Page 37: BRAID: Discovering Lag Correlations in Multiple Streams

SIGMOD 2005 Y. Sakurai et al 37

Overview

Introduction / Related work Background Main ideas Theoretical analysis Experimental results

Page 38: BRAID: Discovering Lag Correlations in Multiple Streams

SIGMOD 2005 Y. Sakurai et al 38

Theoretical Analysis - Accuracy Effect of smoothing

Effect of geometric lag probing

For sequences with low frequencies, smoothing introduces only small error

BRAIDS will provide no error, if lag probing satisfies the sampling theorem (Nyquist’s)

Page 39: BRAID: Discovering Lag Correlations in Multiple Streams

SIGMOD 2005 Y. Sakurai et al 39

Effect of geometric lag probing Informally, BRAIDS will provide no error, if lag

probing satisfies the sampling theorem (Nyquist’s) Formally: Theorem 2

fR: the Nyquist frequency of CCF, fR=min(fx, fy)

fx, fy: the Nyquist frequencies of X and Y

Theoretical Analysis - Accuracy

BRAID will find the lag correlations perfectly, if

Rf

bl

20

details

Page 40: BRAID: Discovering Lag Correlations in Multiple Streams

SIGMOD 2005 Y. Sakurai et al 40

Theoretical Analysis - ComplexityNaive solution

O(n) space

O(n) time per time tick

BRAID O(log n) space

O(1) time for updating sufficient statistics

O(log n) time for interpolating (when output is required)

details

Page 41: BRAID: Discovering Lag Correlations in Multiple Streams

SIGMOD 2005 Y. Sakurai et al 41

Overview

Introduction / Related work Background Main ideas Theoretical analysis Experimental results

Page 42: BRAID: Discovering Lag Correlations in Multiple Streams

SIGMOD 2005 Y. Sakurai et al 42

Experimental results

Setup Intel Xeon 2.8GHz, 1GB memory, Linux Datasets:

Synthetic: Sines, SpikeTrains,

Real: Humidity, Light, Temperature, Kursk, Sunspots Enhanced BRAID, b=16

Page 43: BRAID: Discovering Lag Correlations in Multiple Streams

SIGMOD 2005 Y. Sakurai et al 43

Experimental results

Evaluation Accuracy for CCF Accuracy for the lag estimation Computation time k-way lag correlations

Page 44: BRAID: Discovering Lag Correlations in Multiple Streams

SIGMOD 2005 Y. Sakurai et al 44

Accuracy for CCF (1) Sines

CCF (Cross-Correlation Function)

BRAID perfectly estimates the correlation coefficients

of the sinusoidal wave

Page 45: BRAID: Discovering Lag Correlations in Multiple Streams

SIGMOD 2005 Y. Sakurai et al 45

Accuracy for CCF (2) SpikeTrains

CCF (Cross-Correlation Function)

BRAID closely estimates the correlation coefficients

Page 46: BRAID: Discovering Lag Correlations in Multiple Streams

SIGMOD 2005 Y. Sakurai et al 46

Accuracy for CCF (3) Humidity (Real data)

CCF (Cross-Correlation Function)

BRAID closely estimates the correlation coefficients

Page 47: BRAID: Discovering Lag Correlations in Multiple Streams

SIGMOD 2005 Y. Sakurai et al 47

Accuracy for CCF (4) Light (Real data)

CCF (Cross-Correlation Function)

BRAID closely estimates the correlation coefficients

Page 48: BRAID: Discovering Lag Correlations in Multiple Streams

SIGMOD 2005 Y. Sakurai et al 48

Accuracy for CCF (5) Kursk (Real data)

CCF (Cross-Correlation Function)

BRAID closely estimates the correlation coefficients

Page 49: BRAID: Discovering Lag Correlations in Multiple Streams

SIGMOD 2005 Y. Sakurai et al 49

Accuracy for CCF (6) Sunspots (Real data)

CCF (Cross-Correlation Function)

BRAID closely estimates the correlation coefficients

Page 50: BRAID: Discovering Lag Correlations in Multiple Streams

SIGMOD 2005 Y. Sakurai et al 50

Experimental results

Evaluation Accuracy for CCF Accuracy for the lag estimation Computation time k-way lag correlations

Page 51: BRAID: Discovering Lag Correlations in Multiple Streams

SIGMOD 2005 Y. Sakurai et al 51

Estimation Error of Lag Correlations

Largest relative error is about 1%

Datasets

Lag correlation Estimation

error (%)Naive BRAID

Sines 716 716 0.000

SpikeTrains 2841 2830 0.387

Humidity 3842 3855 0.338

Light 567 570 0.529

Kursk 1463 1472 0.615

Sunspots 1156 1168 1.038

Page 52: BRAID: Discovering Lag Correlations in Multiple Streams

SIGMOD 2005 Y. Sakurai et al 52

Experimental results

Evaluation Accuracy for CCF Accuracy for the lag estimation Computation time k-way lag correlations

Page 53: BRAID: Discovering Lag Correlations in Multiple Streams

SIGMOD 2005 Y. Sakurai et al 53

Computation time

Reduce computation time dramatically Up to 40,000 times faster

Page 54: BRAID: Discovering Lag Correlations in Multiple Streams

SIGMOD 2005 Y. Sakurai et al 54

Experimental results

Evaluation Accuracy for CCF Accuracy for the lag estimation Computation time k-way lag correlations

Page 55: BRAID: Discovering Lag Correlations in Multiple Streams

SIGMOD 2005 Y. Sakurai et al 55

Group Lag Correlations 55 Temperature sequences Two correlated pairs

Estimation of CCF of #16 and #19 Estimation of CCF of #47 and #48

#16 #19 #47 #48

Page 56: BRAID: Discovering Lag Correlations in Multiple Streams

SIGMOD 2005 Y. Sakurai et al 56

Conclusions Automatic lag correlation detection on data

stream 1. ‘Any-time’ 2. Nimble

O(log n) space, O(1) time to update the statistics

3. Fast Up to 40,000 times faster than the naive

implementation

4. Accurate within 1% relative error or less

Page 57: BRAID: Discovering Lag Correlations in Multiple Streams

SIGMOD 2005 Y. Sakurai et al 57

Effect of geometric lag probing Informally, BRAIDS will provide no error, if lag

probing satisfies the sampling theorem (Nyquist’s) Formally: Theorem 2

fR: the Nyquist frequency of CCF, fR=min(fx, fy)

fx, fy: the Nyquist frequencies of X and Y

Theoretical Analysis - Accuracy

BRAID will find the lag correlations perfectly, if

Rf

bl

20

details

Page 58: BRAID: Discovering Lag Correlations in Multiple Streams

SIGMOD 2005 Y. Sakurai et al 58

Effect of Probing

Dataset: Sines Lag correlation with b=1 lR=1024

Page 59: BRAID: Discovering Lag Correlations in Multiple Streams

SIGMOD 2005 Y. Sakurai et al 59

Effect of Probing

Dataset: Light Lag correlation with b=1 lR=630