sensor and graph mining
DESCRIPTION
Sensor and Graph Mining. Christos Faloutsos Carnegie Mellon University & IBM www.cs.cmu.edu/~christos. Joint work with. Anthony Brockwell (CMU/Stat) Deepayan Chakrabarti (CMU) Spiros Papadimitriou (CMU) Chenxi Wang (CMU) Yang Wang (CMU). Outline. Introduction - motivation - PowerPoint PPT PresentationTRANSCRIPT
INTEL 04 C. Faloutsos 1
School of Computer ScienceCarnegie Mellon
Sensor and Graph Mining
Christos Faloutsos
Carnegie Mellon University & IBMwww.cs.cmu.edu/~christos
INTEL 04 C. Faloutsos 2
School of Computer ScienceCarnegie Mellon
Joint work with
• Anthony Brockwell (CMU/Stat)
• Deepayan Chakrabarti (CMU)
• Spiros Papadimitriou (CMU)
• Chenxi Wang (CMU)• Yang Wang (CMU)
INTEL 04 C. Faloutsos 3
School of Computer ScienceCarnegie Mellon
Outline
• Introduction - motivation
• Problem #1: Stream Mining– Motivation– Main idea– Experimental results
• Problem #2: Graphs & Virus propagation
• Conclusions
INTEL 04 C. Faloutsos 4
School of Computer ScienceCarnegie Mellon
Introduction• Sensor devices
– Temperature, weather measurements– Road traffic data– Geological observations– Patient physiological data
• Embedded devices– Network routers– Intelligent (active) disks
INTEL 04 C. Faloutsos 5
School of Computer ScienceCarnegie Mellon
Introduction• Limited resources
– Memory– Bandwidth– Power– CPU
• Remote environments– No human intervention
INTEL 04 C. Faloutsos 6
School of Computer ScienceCarnegie Mellon
Introduction – problem dfn• Given a emi-infinite stream of values (time
series) x1, x2, …, xt, …
• Find patterns, forecasts, outliers…
INTEL 04 C. Faloutsos 7
School of Computer ScienceCarnegie Mellon
Introduction
Periodicity? (daily)
Periodicity? (twice daily)
“Noise”??
• E.g.,
INTEL 04 C. Faloutsos 8
School of Computer ScienceCarnegie Mellon
Introduction
Periodicity? (daily)
Periodicity? (twice daily)
“Noise”??
• Can we capture these patterns– automatically– with limited resources?
INTEL 04 C. Faloutsos 9
School of Computer ScienceCarnegie Mellon
Related workStatistics: Time series forecasting
• Main problem:
“[…] The first step in the analysis of any time series is to plot the data [and inspect the graph]” [Brockwell 91]
• Typically:• Resource intensive
• Cannot update online
• AR(I)MA and seasonal variants• ARFIMA, GARCH, …
INTEL 04 C. Faloutsos 10
School of Computer ScienceCarnegie Mellon
Related workDatabases: Continuous Queries
• Typically, different focus:– “Compression”– Not generative models
• Largely orthogonal problem…– Gilbert, Guha, Indyk et al. (STOC 2002)– Garofalakis, Gibbons (SIGMOD 2002)– Chen, Dong, Han et al. (VLDB 2002); Bulut, Singh (ICDE 2003)– Gehrke, Korn, et al. (SIGMOD 2001), Dobra, Garofalakis, Gehrke
et al. (SIGMOD 2002)– Guha, Koudas (ICDE 2003) Datar, Gionis, Indyk et al. (SODA
2002)– Madden+ [SIGMOD02], [SIGMOD03]
INTEL 04 C. Faloutsos 11
School of Computer ScienceCarnegie Mellon
Goals
• Adapt and handle arbitrary periodic components
• No human intervention/tuning
Also:
• Single pass over the data
• Limited memory (logarithmic)
• Constant-time update
INTEL 04 C. Faloutsos 12
School of Computer ScienceCarnegie Mellon
Outline
• Introduction - motivation
• Problem #1: Stream Mining– Motivation– Main idea– Experimental results
• Problem #2: Graphs & Virus propagation
• Conclusions
INTEL 04 C. Faloutsos 13
School of Computer ScienceCarnegie Mellon
Wavelets“Straight” signal
t
I1
t
I2
t
I3
t
I4
t
I5
t
I6
t
I7
t
I8
time
t
xt
INTEL 04 C. Faloutsos 14
School of Computer ScienceCarnegie Mellon
WaveletsIntroduction – Haar
t
W1,1
t
W1,2
t
W1,3
t
W1,4
t
W2,1
t
W2,2
t
W3,1
t
V4,1
time
frequ
ency
t
xt
INTEL 04 C. Faloutsos 15
School of Computer ScienceCarnegie Mellon
Wavelets
• So?
• Wavelets compress many real signals well…– Image compression and processing– Vision; Astronomy, seismology, …
• Wavelet coefficients can be updated as new points arrive [Kotidis+]
INTEL 04 C. Faloutsos 16
School of Computer ScienceCarnegie Mellon
WaveletsCorrelations
t
W1,1
t
W1,2
t
W1,3
t
W1,4
t
W2,1
t
W2,2
t
W3,1
t
V4,1
time
frequ
ency
xt
t
=
INTEL 04 C. Faloutsos 17
School of Computer ScienceCarnegie Mellon
WaveletsCorrelations
t
W1,1
t
W1,2
t
W1,3
t
W1,4
t
W2,1
t
W2,2
t
W3,1
t
V4,1
time
frequ
ency
xt
t
INTEL 04 C. Faloutsos 18
School of Computer ScienceCarnegie Mellon
Main ideaCorrelations
• Wavelets are good…
• …we can do even better– One number…– …and the fact that they are
equal/correlated
INTEL 04 C. Faloutsos 19
School of Computer ScienceCarnegie Mellon
Proposed method
Wl,tWl,t-1Wl,t-2Wl,t l,1Wl,t-1 l,2Wl,t-2 …
Wl’,t’-1Wl’,t’-2Wl’,t’
Wl’,t’ l’,1Wl’,t’-1 l’,2Wl’,t’-2 …
Small windows suffice… (k~4)
INTEL 04 C. Faloutsos 20
School of Computer ScienceCarnegie Mellon
More details…
• Update of wavelet coefficients
• Update of linear models
• Feature selection– Not all correlations are significant– Throw away the insignificant ones– very important!!
[see paper]
(incremental)
(incremental; RLS)
(single-pass)
INTEL 04 C. Faloutsos 21
School of Computer ScienceCarnegie Mellon
Complexity• Model update
Space: OlgN + mk2 OlgNTime: Ok2 O1
Where– N: number of points (so far)– k: number of regression coefficients; fixed– m: number of linear models; OlgN
[see paper]
SKIP
INTEL 04 C. Faloutsos 22
School of Computer ScienceCarnegie Mellon
Outline
• Introduction - motivation
• Problem #1: Stream Mining– Motivation– Main idea– Experimental results
• Problem #2: Graphs & Virus propagation
• Conclusions
INTEL 04 C. Faloutsos 23
School of Computer ScienceCarnegie Mellon
Setup
• First half used for model estimation
• Models applied forward to forecast entire second half
• AR, Seasonal AR (SAR): R– Simplest possible estimation – no maximum
likelihood estimation (MLE), etc.
• … vs. Python scripts
INTEL 04 C. Faloutsos 24
School of Computer ScienceCarnegie Mellon
ResultsSynthetic data – Triangle pulse
• Triangle pulse• AR captures wrong trend (or none)• Seasonal AR (SAR) estimation fails
INTEL 04 C. Faloutsos 25
School of Computer ScienceCarnegie Mellon
ResultsSynthetic data – Mix
• Mix (sine + square pulse)• AR captures wrong trend (or none)• Seasonal AR estimation fails
INTEL 04 C. Faloutsos 26
School of Computer ScienceCarnegie Mellon
ResultsReal data – Automobile
• Automobile traffic– Daily periodicity with rush-hour peaks– Bursty “noise” at smaller time scales
(filtered)
INTEL 04 C. Faloutsos 27
School of Computer ScienceCarnegie Mellon
ResultsReal data – Automobile
• Automobile traffic– Daily periodicity with rush-hour peaks– Bursty “noise” at smaller time scales
• AR fails to capture any trend (average)• Seasonal AR estimation fails
INTEL 04 C. Faloutsos 28
School of Computer ScienceCarnegie Mellon
ResultsReal data – Automobile
• Automobile traffic– Daily periodicity with rush-hour peaks– Bursty “noise” at smaller time scales
• AWSOM spots periodicities, automatically
INTEL 04 C. Faloutsos 29
School of Computer ScienceCarnegie Mellon
ResultsReal data – Automobile
• Automobile traffic– Daily periodicity with rush-hour peaks– Bursty “noise” at smaller time scales
• Generation with identified noise
INTEL 04 C. Faloutsos 30
School of Computer ScienceCarnegie Mellon
ResultsReal data – Sunspot
• Sunspot intensity – Slightly time-varying “period”• AR captures wrong trend (average)• Seasonal ARIMA
– Captures immediate wrong downward trend– Requires human to determine seasonal component period (fixed)
INTEL 04 C. Faloutsos 31
School of Computer ScienceCarnegie Mellon
ResultsReal data – Sunspot
• Sunspot intensity – Slightly time-varying “period”
Estimation: 40 minutes (R) vs. 9 seconds (Python)
INTEL 04 C. Faloutsos 32
School of Computer ScienceCarnegie Mellon
Variance
• Variance (log-power) vs. scale:– “Noise” diagnostic (if decreasing linear…)
– Can use to estimate noise parameters
~ 1 hour
SKIP
~Hurst exponent
INTEL 04 C. Faloutsos 33
School of Computer ScienceCarnegie Mellon
Running time
stream size (N)
tim
e (
t)
INTEL 04 C. Faloutsos 34
School of Computer ScienceCarnegie Mellon
Space requirements
Equal total number of model parameters
INTEL 04 C. Faloutsos 35
School of Computer ScienceCarnegie Mellon
Conclusion
Adapt and handle arbitrary periodic components
No human intervention/tuning
Single pass over the dataLimited memory (logarithmic)Constant-time update
INTEL 04 C. Faloutsos 36
School of Computer ScienceCarnegie Mellon
Conclusion
Adapt and handle arbitrary periodic components
No human intervention/tuning
Single pass over the dataLimited memory (logarithmic)Constant-time update
no human
limitedresources
INTEL 04 C. Faloutsos 37
School of Computer ScienceCarnegie Mellon
Outline
• Introduction - motivation• Problem #1: Streams• Problem #2: Graphs & Virus propagation
– Motivation & problem definition– Related work– Main idea– Experiments
• Conclusions
INTEL 04 C. Faloutsos 38
School of Computer ScienceCarnegie Mellon
Introduction
Internet Map [lumeta.com]
Food Web [Martinez ’91]
Protein Interactions [genomebiology.com]
Friendship Network [Moody ’01]
► Graphs are ubiquitious
INTEL 04 C. Faloutsos 39
School of Computer ScienceCarnegie Mellon
Introduction
• What can we do with graph analysis?– Immunization;– Information
Dissemination– network value of a
customer [Domingos+] “Needle exchange” networks of drug users
[Weeks et al. 2002]
“bridges”
INTEL 04 C. Faloutsos 40
School of Computer ScienceCarnegie Mellon
Problem definition
• Q1: How does a virus spread across an arbitrary network?
• Q2: will it create an epidemic?
• (in a sensor setting, with a ‘gossip’ protocol, will a rumor/query spread?)
INTEL 04 C. Faloutsos 41
School of Computer ScienceCarnegie Mellon
Framework
• Susceptible-Infected-Susceptible (SIS) model – Cured nodes immediately become susceptible
Susceptible/
healthy
Infected &
infectious
Infected by neighbor
Cured internally
INTEL 04 C. Faloutsos 43
School of Computer ScienceCarnegie Mellon
The model
• (virus) Birth rate β : probability than an infected neighbor attacks
• (virus) Death rate δ : probability that an infected node heals
Infected
Healthy
NN1
N3
N2Prob. β
Prob. β
Prob. δ
INTEL 04 C. Faloutsos 44
School of Computer ScienceCarnegie Mellon
Epidemic threshold
Defined as the value of , such that
if / < an epidemic can not happen
Thus,
• given a graph
• compute its epidemic threshold
INTEL 04 C. Faloutsos 45
School of Computer ScienceCarnegie Mellon
Epidemic threshold
What should depend on?
• avg. degree? and/or highest degree?
• and/or variance of degree?
• and/or determinant of the adjacency matrix?
INTEL 04 C. Faloutsos 46
School of Computer ScienceCarnegie Mellon
Basic Homogeneous Model
Homogeneous graphs [Kephart-White ’91, ’93]
• Epidemic threshold = 1/<k>• Homogeneous connectivity <k>, ie, all
nodes have ~same degree unrealistic
INTEL 04 C. Faloutsos 47
School of Computer ScienceCarnegie Mellon
Power-law Networks
• Model for Barabási-Albert networks– [Pastor-Satorras &
Vespignani, ’01, ’02]
– Epidemic threshold = <k> / <k2>
– for BA type networks, with only γ = 3 (γ = slope of power-law exponent)
INTEL 04 C. Faloutsos 48
School of Computer ScienceCarnegie Mellon
Epidemic threshold
• Homogeneous graphs: 1/<k>• BA (=3) <k> / <k2>
• more complicated graphs ?
• arbitrary, REAL graphs ?
• how many parameters??
INTEL 04 C. Faloutsos 49
School of Computer ScienceCarnegie Mellon
Epidemic threshold
• [Theorem] We have no epidemic, if
β/δ <τ = 1/ λ1,A
INTEL 04 C. Faloutsos 50
School of Computer ScienceCarnegie Mellon
Epidemic threshold
• [Theorem] We have no epidemic, if
β/δ <τ = 1/ λ1,A
largest eigenvalueof adj. matrix A
attack prob.
recovery prob.epidemic threshold
Proof: [Wang+03]
INTEL 04 C. Faloutsos 51
School of Computer ScienceCarnegie Mellon
Epidemic threshold for various networks
• sanity checks / older results:
• Homogeneous networks– λ1,A = <k>; τ = 1/<k>
– where <k> = average degree– This is the same result as of Kephart & White !
INTEL 04 C. Faloutsos 52
School of Computer ScienceCarnegie Mellon
Epidemic threshold for various networks
• sanity checks / older results:
• Star networks– λ1,A = sqrt(d); τ = 1/ sqrt(d)
– where d = the degree of the central node
INTEL 04 C. Faloutsos 53
School of Computer ScienceCarnegie Mellon
Epidemic threshold for various networks
• sanity checks / older results:
• Infinite, power-law networks– λ1,A = ∞; τ = 0 : *any* virus has a chance!
[Barabasi et al]
• Finite power-law networks– τ = 1/ λ1,A
INTEL 04 C. Faloutsos 54
School of Computer ScienceCarnegie Mellon
Outline
• Introduction - motivation• Problem #1: Streams• Problem #2: Graphs & Virus propagation
– Motivation & problem definition– Related work– Main idea– Experiments
• Conclusions
INTEL 04 C. Faloutsos 55
School of Computer ScienceCarnegie Mellon
Experiments
• 2 graphs– Star network: one “hub” + 99 “spokes”– “Oregon” Internet AS graph:
• 10,900 nodes, 31180 edges
• topology.eecs.umich.edu/data.html
• More in our paper: [SRDS ’03]
INTEL 04 C. Faloutsos 56
School of Computer ScienceCarnegie Mellon
β/δ > τ (above threshold)
β/δ = τ (at the threshold)
β/δ < τ (below threshold)
Experiments (Star)
INTEL 04 C. Faloutsos 57
School of Computer ScienceCarnegie Mellon
Experiments (Oregon)
β/δ > τ (above threshold)
β/δ = τ (at the threshold)
β/δ < τ (below threshold)
INTEL 04 C. Faloutsos 58
School of Computer ScienceCarnegie Mellon
Our prediction vs. previous prediction
• our predictions are more accurate
Oregon Star
PL3PL3
OurOur
Nu
mb
er o
f in
fect
ed n
odes
β/δ β/δ
INTEL 04 C. Faloutsos 59
School of Computer ScienceCarnegie Mellon
Conclusions
We found an epidemic threshold
√ that applies to any network topology
√ and it depends only on one parameter of the graph
INTEL 04 C. Faloutsos 60
School of Computer ScienceCarnegie Mellon
Overall conclusions
• Automatic stream mining: AWSOM
• graphs and virus propagation: eigenvalue
INTEL 04 C. Faloutsos 61
School of Computer ScienceCarnegie Mellon
Ongoing / related work
• Streams– how to find hidden variables on multiple
streams [w/ Spiros and Jimeng Sun]– ‘network tomography’ [w/ Airoldi +]
• Graphs– graph partitioning [w/ Deepay+]– important subgraphs [w/ Tomkins + McCurley]– graph generators [RMAT, w/ Deepay]
INTEL 04 C. Faloutsos 62
School of Computer ScienceCarnegie Mellon
Thank you!
Contact info:christos @ cs.cmu.edu
spapadim @ cs.cmu.edu
deepay @ cs.cmu.edu
INTEL 04 C. Faloutsos 63
School of Computer ScienceCarnegie Mellon
Main References
• Spiros Papadimitriou, Anthony Brockwell and Christos Faloutsos Adaptive, Hands-Off Stream Mining VLDB 2003, Berlin, Germany, Sept. 2003.
• [Wang+03] Yang Wang, Deepayan Chakrabarti, Chenxi Wang and Christos Faloutsos: Epidemic Spreading in Real Networks: an Eigenvalue Viewpoint, SRDS 2003, Florence, Italy.
INTEL 04 C. Faloutsos 64
School of Computer ScienceCarnegie Mellon
Additional References
• Connection Subgraphs, C. Faloutsos, K. McCurley, A. Tomkins, SIAM-DM 2004 workshop on link analysis
• RMAT: A recursive graph generator, D. Chakrabarti, Y. Zhan, C. Faloutsos, SIAM-DM 2004
• iFilter: Network tomography using particle filters, Edoardo Airoldi, Christos Faloutsos (submitted)