cmu scs data mining meets systems: tools and case studies christos faloutsos scs cmu
Post on 20-Dec-2015
218 views
TRANSCRIPT
CMU SCS
Data Mining Meets Systems:Tools and Case Studies
Christos Faloutsos
SCS CMU
PDL 2008 C. Faloutsos #2
CMU SCS
Thanks
Spiros Papadimitriou (CMU->IBM)
Mengzhi Wang (CMU->Google)
Jimeng Sun (CMU -> IBM)
PDL 2008 C. Faloutsos #3
CMU SCS
Outline
• Problem 1: workload characterization
• Problem 2: self-* monitoring
• Problem 3: BGP mining
• (Problem 4: sensor mining)
• (Problem 5: Large graphs & hadoop)
fractals
SVDwavelets
tensors
PageRank
PDL 2008 C. Faloutsos #4
CMU SCS
Problem #1:
Goal: given a signal (eg., #bytes over time)
Find: patterns, periodicities, and/or compress
time
#bytes Bytes per 30’(packets per day;earthquakes per year)
PDL 2008 C. Faloutsos #5
CMU SCS
Problem #1
• model bursty traffic
• generate realistic traces
• (Poisson does not work)
time
# bytes
Poisson
PDL 2008 C. Faloutsos #6
CMU SCS
Motivation
• predict queue length distributions (e.g., to give probabilistic guarantees)
• “learn” traffic, for buffering, prefetching, ‘active disks’, web servers
PDL 2008 C. Faloutsos #7
CMU SCS
Q: any ‘pattern’?
time
# bytes• Not Poisson• spike; silence; more
spikes; more silence…• any rules?
PDL 2008 C. Faloutsos #8
CMU SCS
solution: self-similarity
# bytes
time time
# bytes
PDL 2008 C. Faloutsos #9
CMU SCS
But:
• Q1: How to generate realistic traces; extrapolate?
• Q2: How to estimate the model parameters?
PDL 2008 C. Faloutsos #10
CMU SCS
Approach
• Q1: How to generate a sequence, that is– bursty– self-similar– and has similar queue length distributions
PDL 2008 C. Faloutsos #11
CMU SCS
Approach
• A: ‘binomial multifractal’ [Wang+02]
• ~ 80-20 ‘law’:– 80% of bytes/queries etc on first half– repeat recursively
• b: bias factor (eg., 80%)
PDL 2008 C. Faloutsos #12
CMU SCS
binary multifractals20 80
PDL 2008 C. Faloutsos #13
CMU SCS
binary multifractals20 80
PDL 2008 C. Faloutsos #14
CMU SCS
Parameter estimation
• Q2: How to estimate the bias factor b?
PDL 2008 C. Faloutsos #15
CMU SCS
Parameter estimation
• Q2: How to estimate the bias factor b?
• A: MANY ways [Crovella+96]– Hurst exponent– variance plot– even DFT amplitude spectrum! (‘periodogram’)– More robust: ‘entropy plot’ [Wang+02]
Mengzhi Wang, Tara Madhyastha, Ngai Hang Chang, Spiros Papadimitriou and Christos Faloutsos, Data Mining Meets Performance Evaluation: Fast Algorithms for Modeling Bursty Traffic, ICDE 2002
PDL 2008 C. Faloutsos #16
CMU SCS
Entropy plot
• Rationale:– burstiness: inverse of uniformity– entropy measures uniformity of a distribution– find entropy at several granularities, to see
whether/how our distribution is close to uniform.
PDL 2008 C. Faloutsos #17
CMU SCS
Entropy plot
• Entropy E(n) after n levels of splits
• n=1: E(1)= - p1 log2(p1)- p2 log2(p2)
p1 p2% of bytes
here
PDL 2008 C. Faloutsos #18
CMU SCS
Entropy plot
• Entropy E(n) after n levels of splits
• n=1: E(1)= - p1 log(p1)- p2 log(p2)
• n=2: E(2) = - p2,i * log2 (p2,i)
p2,1 p2,2 p2,3 p2,4
PDL 2008 C. Faloutsos #19
CMU SCS
Real traffic
• Has linear entropy plot (-> self-similar)
# of levels (n)
EntropyE(n)
0.73
PDL 2008 C. Faloutsos #20
CMU SCS
Observation - intuition:
intuition: slope =
intrinsic dimensionality =~
‘degrees of freedom’ or
info-bits per coordinate-bit– unif. Dataset: slope =1
– multi-point: slope = 0
# of levels (n)
EntropyE(n)
0.73
PDL 2008 C. Faloutsos #35
CMU SCS
Some more entropy plots:
• Poisson vs real
Poisson: slope = ~1 -> uniformly distributed
1 0.73
PDL 2008 C. Faloutsos #36
CMU SCS
B-model
• b-model traffic gives perfectly linear plot
• Lemma: its slope isslope = -b log2b - (1-b) log2 (1-b)
• Fitting: do entropy plot; get slope; solve for b
E(n)
n
PDL 2008 C. Faloutsos #37
CMU SCS
Experimental setup
• Disk traces (from HP [Wilkes 93])
• web traces from LBLhttp://repository.cs.vt.edu/lbl-conn-7.tar.Z
PDL 2008 C. Faloutsos #38
CMU SCS
Model validation
• Linear entropy plots
Bias factors b: 0.6-0.8 smallest b / smoothest: nntp traffic
PDL 2008 C. Faloutsos #39
CMU SCS
Web traffic - results
• LBL, NCDF of queue lengths (log-log scales)
(queue length l)
Prob( >l)
PDL 2008 C. Faloutsos #40
CMU SCS
Conclusions
• Multifractals (80/20, ‘b-model’, Multiplicative Wavelet Model (MWM)) for analysis and synthesis of bursty traffic
PDL 2008 C. Faloutsos #41
CMU SCS
Books
• Fractals: Manfred Schroeder: Fractals, Chaos, Power Laws: Minutes from an Infinite Paradise W.H. Freeman and Company, 1991 (Probably the BEST book on fractals!)
PDL 2008 C. Faloutsos #42
CMU SCS
Outline
• Problem 1: workload characterization
• Problem 2: self-* monitoring
• Problem 3: BGP mining
• (Problem 4: sensor mining)
• (Problem 5: Large graphs & hadoop)
PDL 2008 C. Faloutsos #43
CMU SCS
Clusters/data center monitoring
• Monitor correlations of multiple measurements• Automatically flag anomalous behavior• Intemon: intelligent monitoring system
– warsteiner.db.cs.cmu.edu/demo/intemon.jsp
PDL 2008 C. Faloutsos #44
CMU SCS
Publication
Evan Hoke, Jimeng Sun, John D. Strunk, Gregory R. Ganger, Christos Faloutsos. InteMon: Continuous Mining of Sensor Data in Large-scale Self-* Infrastructures. ACM SIGOPS Operating Systems Review, 40(3):38-44. ACM Press, July 2006
PDL 2008 C. Faloutsos #45
CMU SCS
Under the hood: SVD
• Singular Value Decomposition
• Done incrementally
Spiros Papadimitriou, Jimeng Sun and Christos Faloutsos Streaming Pattern Discovery in Multiple Time-Series VLDB 2005, Trondheim, Norway.
PDL 2008 C. Faloutsos #46
CMU SCS
Singular Value Decomposition (SVD)
• SVD (~LSI ~ KL ~ PCA ~ spectral analysis...)
LSI: S. Dumais; M. Berry
KL: eg, Duda+Hart
PCA: eg., Jolliffe
Details: [Press+]
u of CPU1
u ofCPU2
t=1t=2
PDL 2008 C. Faloutsos #47
CMU SCS
Singular Value Decomposition (SVD)
• SVD (~LSI ~ KL ~ PCA ~ spectral analysis...)
u of CPU1
u ofCPU2
t=1t=2
PDL 2008 C. Faloutsos #48
CMU SCS
Singular Value Decomposition (SVD)
• SVD (~LSI ~ KL ~ PCA ~ spectral analysis...)
u of CPU1
u ofCPU2
t=1t=2
PDL 2008 C. Faloutsos #49
CMU SCS
Singular Value Decomposition (SVD)
• SVD (~LSI ~ KL ~ PCA ~ spectral analysis...)
u of CPU1
u ofCPU2
t=1t=2
PDL 2008 C. Faloutsos #50
CMU SCS
Outline
• Problem 1: workload characterization
• Problem 2: self-* monitoring
• Problem 3: BGP mining
• (Problem 4: sensor mining)
• (Problem 5: Large graphs & hadoop)
PDL 2008 C. Faloutsos #51
CMU SCS
BGP updates
With • Aditya Prakash (CMU)
• Michalis Faloutsos (UC Riverside)
• Nicholas Valler (UC Riverside)
• Dave Andersen (CMU)
PDL 2008 C. Faloutsos #52
CMU SCS
Time Series: #Updates per 600s, Washington Router 09/2004-09/2006
Tool #0: Time plot
PDL 2008 C. Faloutsos #53
CMU SCS
Tool #0: Time plot
• Observation #1: Missing values• Observation #2: Bursty
PDL 2008 C. Faloutsos #54
CMU SCS
Tool #1: Wavelets
PDL 2008 C. Faloutsos #55
CMU SCS
Wavelets - DWT
• Short window Fourier transform (SWFT)
• But: how short should be the window?
time
freq
time
value
PDL 2008 C. Faloutsos #56
CMU SCS
Wavelets - DWT
• Answer: multiple window sizes! -> DWT
time
freq
Timedomain DFT SWFT DWT
PDL 2008 C. Faloutsos #57
CMU SCS
Haar Wavelets
• subtract sum of left half from right half
• repeat recursively for quarters, eight-ths, ...
PDL 2008 C. Faloutsos #58
CMU SCS
‘Tornado Plot’ for Washington Router: Dark areas correspond to high energy
Low freq.
High freq.
time
PDL 2008 C. Faloutsos #59
CMU SCS
Tornado Plot: Wavelet Transformfor Washington Router 09/2004-09/2006, All coefficients andDetail levels 1-12
Observations:
1.Obvious Spikes (E1): tornados that “touch down”
2. Prolonged Spikes (E2 and E3): when coarser scales have high values but finer scales do not
3.Intermittent Waves (E4 and E5): High-energy entries at nearby scales correspond to local periodic motion
PDL 2008 C. Faloutsos #60
CMU SCS
E2: Prolonged Spike Sustained Period of relatively high Activity
Magnification of updates on 28th Aug. 2005
time
# updates
PDL 2008 C. Faloutsos #61
CMU SCS
Tool #2: logarithms
PDL 2008 C. Faloutsos #62
CMU SCS
Tool #2: logarithms
Prominent `clothesline’ at ~ 50 updates per 600 secs.
Culprit IP addresses:
192.211.42.0/24216.109.38.0/24207.157.115.0/24
All from Alabama (Supercomputing Center)!
PDL 2008 C. Faloutsos #63
CMU SCS
Outline
• Problem 1: workload characterization
• Problem 2: self-* monitoring
• Problem 3: BGP mining
• (Problem 4: sensor mining)
• (Problem 5: Large graphs & hadoop)
fractals
SVDwavelets
tensors
PageRank
PDL 2008 C. Faloutsos #64
CMU SCS
Main point
Two-way street:
<- DM can use such infrastructures to find patterns
-> DM can help such systems/networks etc to become self-healing, self-adjusting, ‘self-*’
Hot topic in Data Mining: finding patterns in Tera- and Peta-bytes
PDL 2008 C. Faloutsos #65
CMU SCS
Additional resources
• Machine learning classes at SCS/MLD• Tom Mitchell’s book on Machine Learning
– Classification– Clustering/Anomaly detection– Support vector machines– Graphical models– Bayesian networks– <etc etc>
PDL 2008 C. Faloutsos #66
CMU SCS
www.cs.cmu.edu/~christos
For code, papers etc
WeH 7107 christos <at> cs