BRAID: Stream Mining through Group Lag Correlations
Yasushi Sakurai Spiros Papadimitriou Christos Faloutsos
SIGMOD 2005
IntroductionIntroduction
Lag correlations :Lag correlations : For example:For example:
Higher amounts of fluoride in water Higher amounts of fluoride in water → → fewer dental cavities some years laterfewer dental cavities some years later
Goal : Goal : Monitor multiple numerical streams Monitor multiple numerical streams
determine the pair correlated with lag and determine the pair correlated with lag and the valuethe value
Introduction Introduction
k numerical sequences k numerical sequences XX11,…X,…Xk k , , repreport all pair of ort all pair of XXii and and XXjj which which XXii follo follow w XXjj with lag with lag ll
Introduction Introduction
IntroductionIntroduction
In this paper, propose BRAID handle In this paper, propose BRAID handle data stream of semi-infinite lengthdata stream of semi-infinite length Any time processing, and fastAny time processing, and fast NimbleNimble AccurateAccurate Small resource consumptionSmall resource consumption
Proposed methodProposed method
Data stream Data stream X X : {: {xx11, …, , …, xxtt, ..., , ..., xxnn} , } , xxnn is the is the most recent valuemost recent value
RR(0) : X and Y with the same length n and (0) : X and Y with the same length n and have zero lag have zero lag
ρρ Coefficient : Coefficient :
Proposed methodProposed method
For lag For lag ll ,consider common part of ,consider common part of XX and and shifted shifted Y Y , only n-, only n-l l time tickstime ticks
Proposed methodProposed method
Proposed methodProposed method
RR((ll) : correlation coefficient, X is delayed ) : correlation coefficient, X is delayed by by ll
Score at lag Score at lag l l ::
Proposed methodProposed method
RR((ll) for large value of lag ) for large value of lag ll ≈ ≈ nn, the origi, the original and shifted time sequence have too fnal and shifted time sequence have too few overlappingew overlapping Restrict maximum lag Restrict maximum lag mm to be to be nn/2/2
Proposed methodProposed method
Naive solution :Naive solution : At time At time nn, access all value of , access all value of XX and and YY, compu, compu
te te RR((ll) of all value lag ) of all value lag ll(=0,1,…)(=0,1,…) Choose earliest max score above Choose earliest max score above r r , or repor, or repor
t no lagt no lag The solution based on three major stepThe solution based on three major step
Proposed methodProposed method
Need some sufficient statistics for Need some sufficient statistics for RR to c to computed easilyomputed easily SSxx((ll,,nn) = : sum of ) = : sum of XX of length of length nn SSxxxx((ll,,nn) = : sum of square ) = : sum of square XX of length of length nn SSxyxy((ll) = : sum of square ) = : sum of square XX of length of length nn
n
tx
1t
2
1
n
ttx
n
lt
ttyx1
1
Proposed methodProposed method
RR((ll) is obtained :) is obtained :
Proposed methodProposed method
RR((ll) can estimate at any point time, only ) can estimate at any point time, only need to keep track five sufficient statistineed to keep track five sufficient statisticscs
It still needs linear time to compute the It still needs linear time to compute the cross-correlation function between two cross-correlation function between two sequencessequences
Proposed methodProposed method
Propose to keep track of only a geometric Propose to keep track of only a geometric progression of the lag value : progression of the lag value : ll= 0,1,2,..2= 0,1,2,..2ii,.,.
Only O(logOnly O(lognn) number to track of, instead o) number to track of, instead of O(f O(nn) that “Naïve solution” requires) that “Naïve solution” requires
Space required grow linearly with length Space required grow linearly with length nn
Proposed methodProposed method In order to compute In order to compute RR((ll) at any time, keep slidi) at any time, keep slidi
ng window of size ng window of size ll, , mm==nn/2 need O(/2 need O(nn) space) space
Instead of operating on original time sequence,Instead of operating on original time sequence, also compute their smoothed version by com also compute their smoothed version by computing non-overlapping windowsputing non-overlapping windows
Proposed methodProposed method Window size : power of g=2Window size : power of g=2 XX : original time sequence : original time sequence AAxxh h : smoothed version with window of length 2: smoothed version with window of length 2hh
AAxx00 : original sequence, A : original sequence, Axx11 : consists of n/2 ticks : consists of n/2 ticks ,..etc ,..etc
AAxxh h ‘s sufficient statistic need compute every 2‘s sufficient statistic need compute every 2hh time tickstime ticks
At time n, need O(log At time n, need O(log nn) level, for each level com) level, for each level compute sufficient statisticpute sufficient statistic
Proposed methodProposed method
In contrast with small lags, the larger onIn contrast with small lags, the larger one are sparsee are sparse Use cubic spline to interpolate the missing cUse cubic spline to interpolate the missing c
orrelation coefficient orrelation coefficient
Proposed methodProposed method
AAxxhh(t) : window average at time tick t for (t) : window average at time tick t for level level hh
AAxxhh(0) ≡ (0) ≡ xxt t
Proposed methodProposed method Sufficient statistics:Sufficient statistics:
Enhanced BRAIDEnhanced BRAID
If two sequence of size If two sequence of size ≈ 2≈ 22020, require ab, require about 5*log 2out 5*log 22020 = 5*20=100 float numbers , = 5*20=100 float numbers , about 800 bytes about 800 bytes
Large memory available, propose a soluLarge memory available, propose a solution to probe more but use O(log tion to probe more but use O(log nn) spac) spacee
Use mix of arithmetic plus geometric proUse mix of arithmetic plus geometric probingbing
Enhanced BRAIDEnhanced BRAID
BRAID use only one window at each smoBRAID use only one window at each smoothing levelothing level
Propose use b>1 windows, b=4 insteadPropose use b>1 windows, b=4 instead Algorithm before b=1,with exception botAlgorithm before b=1,with exception bot
tom level has 2b coefficienttom level has 2b coefficient While computing While computing RR((ll), use mixture geom), use mixture geom
etric and arithmetic progression:etric and arithmetic progression:
Enhanced BRAIDEnhanced BRAID
Example of enhanced BRAID of b=4Example of enhanced BRAID of b=4
The algorithm behind if b=1 also The algorithm behind if b=1 also equal to the algorithm beforeequal to the algorithm before
Conclusion Conclusion
Proposed BRAID to detection lag Proposed BRAID to detection lag correlation on streaming datacorrelation on streaming data At any timeAt any time Low resource consumptionLow resource consumption High accuracyHigh accuracy
Thank you very much~Thank you very much~