thank you! data mining using fractals and power laws...
TRANSCRIPT
-
1
Waterloo, 2006 C. Faloutsos 1
School of Computer Science
Carnegie Mellon
Data Mining using Fractals and
Power laws
Christos Faloutsos
Carnegie Mellon University
Waterloo, 2006 C. Faloutsos 2
School of Computer Science
Carnegie Mellon
THANK YOU!
Prof. Ed Chan
Debbie Mustin
Waterloo, 2006 C. Faloutsos 3
School of Computer Science
Carnegie Mellon
Thanks to
Deepayan Chakrabarti (CMU/Yahoo)
Michalis Faloutsos (UCR)
George Siganos (UCR)
Waterloo, 2006 C. Faloutsos 4
School of Computer Science
Carnegie Mellon
Overview
Goals/ motivation: find patterns in large
datasets:
(A) Sensor data
(B) network/graph data
Solutions: self-similarity and power laws
Discussion
Waterloo, 2006 C. Faloutsos 5
School of Computer Science
Carnegie Mellon
Applications of sensors/streams
Smart house: monitoring temperature,
humidity etc
Financial, sales, economic series
Waterloo, 2006 C. Faloutsos 6
School of Computer Science
Carnegie Mellon
Applications of sensors/streams
Smart house: monitoring temperature,
humidity etc
Financial, sales, economic series
Tamer; Ihab
-
2
Waterloo, 2006 C. Faloutsos 7
School of Computer Science
Carnegie Mellon
Motivation - Applications
Medical: ECGs +; blood
pressure etc monitoring
Scientific data: seismological;
astronomical; environment /
anti-pollution; meteorologicalSunspot Data
0
50
100
150
200
250
300
Waterloo, 2006 C. Faloutsos 8
School of Computer Science
Carnegie Mellon
Motivation - Applications
(contd)
civil/automobile infrastructure
bridge vibrations [Oppenheim+02]
road conditions / traffic monitoring
Automobile traffic
0200400600800
100012001400160018002000
time
# cars
Waterloo, 2006 C. Faloutsos 9
School of Computer Science
Carnegie Mellon
Motivation - Applications
(contd)
Computer systems
web servers (buffering, prefetching)
network traffic monitoring
...
0
200000
400000
600000
800000
1e+06
1.2e+06
0 2e+06 4e+06 6e+06 8e+06 1e+07
num
ber
of K
b
time (in milliseconds)
http://repository.cs.vt.edu/lbl-conn-7.tar.Z
Waterloo, 2006 C. Faloutsos 10
School of Computer Science
Carnegie Mellon
Web traffic
[Crovella Bestavros, SIGMETRICS96]
Chronological time (slots of 1000 sec)
Byte
s
0 1000 2000 3000
05*1
0^6
10^7
1.5
*10^7
2*1
0^7
2.5
*10^7
3*1
0^7
Figure 2: Trac Bursts over Four Orders of Magnitude; Upper Left: 1000, Upper Right: 100, Lower Left: 10, and Lower Right:1 Second Aggegrations. (Actual Transfers)
this eect is by visually inspecting time series plots of tracdemand.
In Figure 2 we show four time series plots of the WWWtrac induced by our reference traces. The plots are producedby aggregating byte trac into discrete bins of 1, 10, 100, or1000 seconds.
The upper left plot is a complete presentation of the entiretrac time series using 1000 second (16.6 minute) bins. Thediurnal cycle of network demand is clearly evident, and dayto day activity shows noticeable bursts. However, even withinthe active portion of a single day there is signicant burstiness;this is shown in the upper right plot, which uses a 100 secondtimescale and is taken from a typical day in the middle of thedataset. Finally, the lower left plot shows a portion of the 100second plot, expanded to 10 second detail; and the lower rightplot shows a portion of the lower left expanded to 1 seconddetail. These plots show signicant bursts occurring at thesecond-to-second level.
4.2.2 Statistical Analysis
We used the four methods for assessing self-similarity describedin Section 2: the variance-time plot, the rescaled range (orR/S) plot, the periodogram plot, and the Whittle estimator.We concentrated on individual hours from our trac series, soas to provide as nearly a stationary dataset as possible.
To provide an example of these approaches, analysis of asingle hour (4pm to 5pm, Thursday 5 Feb 1995) is shown inFigure 3. The gure shows plots for the three graphical meth-ods: variance-time (upper left), rescaled range (upper right),and periodogram (lower center). The variance-time plot is lin-
ear and shows a slope that is distinctly dierent from -1 (whichis shown for comparison); the slope is estimated using regres-sion as -0.48, yielding an estimate for H of 0.76. The R/S plotshows an asymptotic slope that is dierent from 0.5 and from1.0 (shown for comparision); it is estimated using regressionas 0.75, which is also the corresponding estimate of H. Theperiodogram plot shows a slope of -0.66 (the regression line isshown), yielding an estimate of H as 0.83. Finally, the Whittleestimator for this dataset (not a graphical method) yields anestimate of H = 0:82 with a 95% condence interval of (0.77,0.87).
As discussed in Section 2.1, the Whittle estimator is theonly method that yields condence intervals on H, but short-range dependence in the timeseries can introduce inaccura-cies in its results. These inaccuracies are minimized by m-aggregating the timeseries for successively large values of m,and looking for a value of H around which the Whittle esti-mator stabilizes.
The results of this method for four busy hours are shown inFigure 4. Each hour is shown in one plot, from the busiest hourin the upper left to the least busy hour in the lower right. Inthese gures the solid line is the value of the Whittle estimateof H as a function of the aggregation level m of the dataset.The upper and lower dotted lines are the limits of the 95%condence interval on H. The three level lines represent theestimate of H for the unaggregated dataset as given by thevariance-time, R-S, and periodogram methods.
The gure shows that for each dataset, the estimate of Hstays relatively consistent as the aggregation level is increased,and that the estimates given by the three graphical methods
Waterloo, 2006 C. Faloutsos 11
School of Computer Science
Carnegie Mellon
...
survivable,self-managing storage
infrastructure
...
a storage brick(0.55 TB)~1 PB
self-* = self-managing, self-tuning, self-healing,
Goal: 1 petabyte (PB) for CMU researchers
www.pdl.cmu.edu/SelfStar
Self-* Storage (Ganger+)
Waterloo, 2006 C. Faloutsos 12
School of Computer Science
Carnegie Mellon
...
survivable,self-managing storage
infrastructure
...
a storage brick(0.55 TB)~1 PB
self-* = self-managing, self-tuning, self-healing,
Self-* Storage (Ganger+)
Ashraf, Ihab, Ken
-
3
Waterloo, 2006 C. Faloutsos 13
School of Computer Science
Carnegie Mellon
Problem definition
Given: one or more sequences
x1 , x2 , , xt , ; (y1, y2, , yt, )
Find
patterns; clusters; outliers; forecasts;
Waterloo, 2006 C. Faloutsos 14
School of Computer Science
Carnegie Mellon
Problem #1
Find patterns, in large
datasets
time
# bytes
Waterloo, 2006 C. Faloutsos 15
School of Computer Science
Carnegie Mellon
Problem #1
Find patterns, in large
datasets
time
# bytes
Poisson
indep.,
ident. distr
Waterloo, 2006 C. Faloutsos 16
School of Computer Science
Carnegie Mellon
Problem #1
Find patterns, in large
datasets
time
# bytes
Poisson
indep.,
ident. distr
Waterloo, 2006 C. Faloutsos 17
School of Computer Science
Carnegie Mellon
Problem #1
Find patterns, in large
datasets
time
# bytes
Poisson
indep.,
ident. distr
Q: Then, how to generate
such bursty traffic?
Waterloo, 2006 C. Faloutsos 18
School of Computer Science
Carnegie Mellon
Overview
Goals/ motivation: find patterns in large datasets:
(A) Sensor data
(B) network/graph data
Solutions: self-similarity and power laws
Discussion
-
4
Waterloo, 2006 C. Faloutsos 19
School of Computer Science
Carnegie Mellon
Problem #2 - network and
graph mining
How does the Internet look like?
How does the web look like?
What constitutes a normal social
network?
What is the network value of a
customer?
which gene/species affects the others
the most?
Waterloo, 2006 C. Faloutsos 20
School of Computer Science
Carnegie Mellon
Network and graph mining
Food Web[Martinez 91]
Protein Interactions[genomebiology.com]
Friendship Network[Moody 01]
Graphs are everywhere!
Waterloo, 2006 C. Faloutsos 21
School of Computer Science
Carnegie Mellon
Problem#2
Given a graph:
which node to market-to /
defend / immunize first?
Are there un-natural sub-
graphs? (eg., criminals rings)?
[from Lumeta: ISPs 6/1999]
Waterloo, 2006 C. Faloutsos 22
School of Computer Science
Carnegie Mellon
Solutions
New tools: power laws, self-similarity and
fractals work, where traditional
assumptions fail
Lets see the details:
Waterloo, 2006 C. Faloutsos 23
School of Computer Science
Carnegie Mellon
Overview
Goals/ motivation: find patterns in large
datasets:
(A) Sensor data
(B) network/graph data
Solutions: self-similarity and power laws
Discussion
Waterloo, 2006 C. Faloutsos 24
School of Computer Science
Carnegie Mellon
What is a fractal?
= self-similar point set, e.g., Sierpinski triangle:
...zero area: (3/4)^inf
infinite length!
(4/3)^inf
Q: What is its dimensionality??
-
5
Waterloo, 2006 C. Faloutsos 25
School of Computer Science
Carnegie Mellon
What is a fractal?
= self-similar point set, e.g., Sierpinski triangle:
...zero area: (3/4)^inf
infinite length!
(4/3)^inf
Q: What is its dimensionality??
A: log3 / log2 = 1.58 (!?!)
Waterloo, 2006 C. Faloutsos 26
School of Computer Science
Carnegie Mellon
Intrinsic (fractal) dimension
Q: fractal dimension
of a line?
Q: fd of a plane?
Waterloo, 2006 C. Faloutsos 27
School of Computer Science
Carnegie Mellon
Intrinsic (fractal) dimension
Q: fractal dimensionof a line?
A: nn (
-
6
Waterloo, 2006 C. Faloutsos 31
School of Computer Science
Carnegie Mellon
time
#bytes
Solution #1: traffic
disk traces: self-similar: (also: [Leland+94])
How to generate such traffic?
Waterloo, 2006 C. Faloutsos 32
School of Computer Science
Carnegie Mellon
Solution #1: traffic
disk traces (80-20 law) multifractals
time
#bytes
20% 80%
Waterloo, 2006 C. Faloutsos 33
School of Computer Science
Carnegie Mellon
80-20 / multifractals20 80
0
200000
400000
600000
800000
1e+06
1.2e+06
0 2e+06 4e+06 6e+06 8e+06 1e+07
num
ber
of K
b
time (in milliseconds)
Waterloo, 2006 C. Faloutsos 34
School of Computer Science
Carnegie Mellon
80-20 / multifractals
20 p ; (1-p) in general
yes, there are
dependencies
80
Waterloo, 2006 C. Faloutsos 35
School of Computer Science
Carnegie Mellon
More on 80/20: PQRS
Part of self-* storage project
time
cylinder# Waterloo, 2006 C. Faloutsos 36
School of Computer Science
Carnegie Mellon
More on 80/20: PQRS
Part of self-* storage project
p q
r s
q
r s
-
7
Waterloo, 2006 C. Faloutsos 37
School of Computer Science
Carnegie Mellon
Overview
Goals/ motivation: find patterns in large datasets: (A) Sensor data
(B) network/graph data
Solutions: self-similarity and power laws
sensor/traffic data
network/graph data
Discussion
Waterloo, 2006 C. Faloutsos 38
School of Computer Science
Carnegie Mellon
Problem #2 - topology
How does the Internet look like? Any rules?
Waterloo, 2006 C. Faloutsos 39
School of Computer Science
Carnegie Mellon
Patterns?
avg degree is, say 3.3
pick a node at random
guess its degree,
exactly (-> mode)
degree
count
avg: 3.3Waterloo, 2006 C. Faloutsos 40
School of Computer Science
Carnegie Mellon
Patterns?
avg degree is, say 3.3
pick a node at random
guess its degree,
exactly (-> mode)
A: 1!!
degree
count
avg: 3.3
Waterloo, 2006 C. Faloutsos 41
School of Computer Science
Carnegie Mellon
Patterns?
avg degree is, say 3.3
pick a node at random- what is the degreeyou expect it to have?
A: 1!!
A: very skewed distr.
Corollary: the mean ismeaningless!
(and std -> infinity (!))
degree
count
avg: 3.3Waterloo, 2006 C. Faloutsos 42
School of Computer Science
Carnegie Mellon
Solution#2: Rank exponent R A1: Power law in the degree distribution
[SIGCOMM99]
internet domains
log(rank)
log(degree)
-0.82
att.com
ibm.com
-
8
Waterloo, 2006 C. Faloutsos 43
School of Computer Science
Carnegie Mellon
Solution#2: Eigen Exponent E
A2: power law in the eigenvalues of the adjacencymatrix
E = -0.48
Exponent = slope
Eigenvalue
Rank of decreasing eigenvalue
May 2001
Waterloo, 2006 C. Faloutsos 44
School of Computer Science
Carnegie Mellon
Power laws - discussion
do they hold, over time?
do they hold on other graphs/domains?
Waterloo, 2006 C. Faloutsos 45
School of Computer Science
Carnegie Mellon
Power laws - discussion
do they hold, over time?
Yes! for multiple years [Siganos+]
do they hold on other graphs/domains?
Yes!
web sites and links [Tomkins+], [Barabasi+]
peer-to-peer graphs (gnutella-style)
who-trusts-whom (epinions.com)
Waterloo, 2006 C. Faloutsos 46
School of Computer Science
Carnegie Mellon
Time Evolution: rank R
-1
-0.9
-0.8
-0.7
-0.6
-0.5
0 200 400 600 800
Instances in time: Nov'97 and on
Ra
nk
ex
po
ne
nt
The rank exponent has not changed!
[Siganos+]
Domain
level
log(rank)
log(degree)
-0.82
att.com
ibm.com
Waterloo, 2006 C. Faloutsos 47
School of Computer Science
Carnegie Mellon
The Peer-to-Peer Topology
Number of immediate peers (= degree), follows a
power-law
[Jovanovic+]
degree
count
Waterloo, 2006 C. Faloutsos 48
School of Computer Science
Carnegie Mellon
epinions.com
who-trusts-whom
[Richardson +
Domingos, KDD
2001]
(out) degree
count
-
9
Waterloo, 2006 C. Faloutsos 49
School of Computer Science
Carnegie Mellon
Why care about these patterns?
better graph generators [BRITE, INET]
for simulations
extrapolations
abnormal graph and subgraph detection
Waterloo, 2006 C. Faloutsos 50
School of Computer Science
Carnegie Mellon
Recent discoveries [KDD05]
How do graphs evolve?
degree-exponent seems constant - anything
else?
Waterloo, 2006 C. Faloutsos 51
School of Computer Science
Carnegie Mellon
Evolution of diameter?
Prior analysis, on power-law-like graphs,
hints that
diameter ~ O(log(N)) or
diameter ~ O( log(log(N)))
i.e.., slowly increasing with network size
Q: What is happening, in reality?
Waterloo, 2006 C. Faloutsos 52
School of Computer Science
Carnegie Mellon
Evolution of diameter?
Prior analysis, on power-law-like graphs,
hints that
diameter ~ O(log(N)) or
diameter ~ O( log(log(N)))
i.e.., slowly increasing with network size
Q: What is happening, in reality?
A: It shrinks(!!), towards a constant value
Waterloo, 2006 C. Faloutsos 53
School of Computer Science
Carnegie Mellon
Shrinking diameter
[Leskovec+05a]
Citations among physicspapers
11yrs; @ 2003:
29,555 papers
352,807 citations
For each month M, create agraph of all citations up tomonth M
time
diameter
Waterloo, 2006 C. Faloutsos 54
School of Computer Science
Carnegie Mellon
Shrinking diameter Authors &
publications
1992
318 nodes
272 edges
2002
60,000 nodes
20,000 authors
38,000 papers
133,000 edges
-
10
Waterloo, 2006 C. Faloutsos 55
School of Computer Science
Carnegie Mellon
Shrinking diameter
Patents & citations
1975
334,000 nodes
676,000 edges
1999
2.9 million nodes
16.5 million edges
Each year is a
datapoint
Waterloo, 2006 C. Faloutsos 56
School of Computer Science
Carnegie Mellon
Shrinking diameter
Autonomous
systems
1997
3,000 nodes
10,000 edges
2000
6,000 nodes
26,000 edges
One graph per day N
diameter
Waterloo, 2006 C. Faloutsos 57
School of Computer Science
Carnegie Mellon
Temporal evolution of graphs
N(t) nodes; E(t) edges at time t
suppose that
N(t+1) = 2 * N(t)
Q: what is your guess for
E(t+1) =? 2 * E(t)
Waterloo, 2006 C. Faloutsos 58
School of Computer Science
Carnegie Mellon
Temporal evolution of graphs
N(t) nodes; E(t) edges at time t
suppose that
N(t+1) = 2 * N(t)
Q: what is your guess for
E(t+1) =? 2 * E(t)
A: over-doubled!
Waterloo, 2006 C. Faloutsos 59
School of Computer Science
Carnegie Mellon
Temporal evolution of graphs
A: over-doubled - but obeying:
E(t) ~ N(t)a for all t
where 1
-
11
Waterloo, 2006 C. Faloutsos 61
School of Computer Science
Carnegie Mellon
Densification Power Law
ArXiv: Physics papers
and their citations
1.69
N(t)
E(t)
tree
1
Waterloo, 2006 C. Faloutsos 62
School of Computer Science
Carnegie Mellon
Densification Power Law
ArXiv: Physics papers
and their citations
1.69
N(t)
E(t)
clique
2
Waterloo, 2006 C. Faloutsos 63
School of Computer Science
Carnegie Mellon
Densification Power Law
U.S. Patents, citing each
other
1.66
N(t)
E(t)
Waterloo, 2006 C. Faloutsos 64
School of Computer Science
Carnegie Mellon
Densification Power Law
Autonomous Systems
1.18
N(t)
E(t)
Waterloo, 2006 C. Faloutsos 65
School of Computer Science
Carnegie Mellon
Densification Power Law
ArXiv: authors & papers
1.15
N(t)
E(t)
Waterloo, 2006 C. Faloutsos 66
School of Computer Science
Carnegie Mellon
Outline
problems
Fractals
Solutions
Discussion
what else can they solve?
how frequent are fractals?
-
12
Waterloo, 2006 C. Faloutsos 67
School of Computer Science
Carnegie Mellon
What else can they solve?
separability [KDD02]
forecasting [CIKM02]
dimensionality reduction [SBBD00]
non-linear axis scaling [KDD02]
disk trace modeling [PEVA02]
selectivity of spatial/multimedia queries
[PODS94, VLDB95, ICDE00]
...
spatial d.m.
Ed, Ihab, Tamer
Waterloo, 2006 C. Faloutsos 68
School of Computer Science
Carnegie Mellon
Problem #3 - spatial d.m.
Galaxies (Sloan Digital Sky Survey w/ B.
Nichol)
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
"elliptical.cut.dat""spiral.cut.dat"
- spiral and elliptical
galaxies
- patterns? (not Gaussian; not
uniform)
-attraction/repulsion?
- separability??
Waterloo, 2006 C. Faloutsos 69
School of Computer Science
Carnegie Mellon
Solution#3: spatial d.m.
1
10
100
1000
10000
100000
1e+06
1e+07
1e+08
1e+09
1e+10
1e-081e-071e-061e-050.00010.001 0.01 0.1 1 10 100
"ell-ell.points.ns""spi-spi.points.ns"
"spi.dat-ell.dat.points"
log(r)
log(#pairs within
-
13
Waterloo, 2006 C. Faloutsos 73
School of Computer Science
Carnegie Mellon
What else can they solve?
separability [KDD02]
forecasting [CIKM02]
dimensionality reduction [SBBD00]
non-linear axis scaling [KDD02]
disk trace modeling [PEVA02]
selectivity of spatial/multimedia queries
[PODS94, VLDB95, ICDE00]
...
Waterloo, 2006 C. Faloutsos 74
School of Computer Science
Carnegie Mellon
Problem#4: dim. reduction
given attributes x1, ... xn possibly, non-linearly correlated
drop the useless ones mpg
cc
Waterloo, 2006 C. Faloutsos 75
School of Computer Science
Carnegie Mellon
Problem#4: dim. reduction
given attributes x1, ... xn possibly, non-linearly correlated
drop the useless ones
(Q: why?
A: to avoid the dimensionality curse)
Solution: keep on dropping attributes, untilthe f.d. changes! [w/ Traina+, SBBD00]
mpg
cc
Waterloo, 2006 C. Faloutsos 76
School of Computer Science
Carnegie Mellon
Outline
problems
Fractals
Solutions
Discussion
what else can they solve?
how frequent are fractals?
Waterloo, 2006 C. Faloutsos 77
School of Computer Science
Carnegie Mellon
Fractals & power laws:
appear in numerous settings:
medical
geographical / geological
social
computer-system related
Waterloo, 2006 C. Faloutsos 78
School of Computer Science
Carnegie Mellon
Fractals: Brain scans
brain-scans
7
8
9
10
11
12
13
14
15
16
4 4.5 5 5.5 6 6.5 7
log
2(
octr
ee
le
ave
s)
level k
"oct-count"x*2.63871-2.9122
octree levels
Log(#octants)
2.63 =
fd
-
14
Waterloo, 2006 C. Faloutsos 79
School of Computer Science
Carnegie Mellon
More fractals
periphery of malignant tumors: ~1.5
benign: ~1.3
[Burdet+]
Waterloo, 2006 C. Faloutsos 80
School of Computer Science
Carnegie Mellon
More fractals:
cardiovascular system: 3 (!) lungs: ~2.9
Waterloo, 2006 C. Faloutsos 81
School of Computer Science
Carnegie Mellon
Fractals & power laws:
appear in numerous settings:
medical
geographical / geological
social
computer-system related
Waterloo, 2006 C. Faloutsos 82
School of Computer Science
Carnegie Mellon
More fractals:
Coastlines: 1.2-1.58
1 1.1
1.3
Waterloo, 2006 C. Faloutsos 83
School of Computer Science
Carnegie Mellon
Waterloo, 2006 C. Faloutsos 84
School of Computer Science
Carnegie Mellon
More fractals:
the fractal dimension for the Amazon river
is 1.85 (Nile: 1.4)
[ems.gphys.unc.edu/nonlinear/fractals/examples.html]
-
15
Waterloo, 2006 C. Faloutsos 85
School of Computer Science
Carnegie Mellon
More fractals:
the fractal dimension for the Amazon river
is 1.85 (Nile: 1.4)
[ems.gphys.unc.edu/nonlinear/fractals/examples.html]
Waterloo, 2006 C. Faloutsos 86
School of Computer Science
Carnegie Mellon
Cross-roads of
Montgomery county:
any rules?
GIS points
Waterloo, 2006 C. Faloutsos 87
School of Computer Science
Carnegie Mellon
GIS
A: self-similarity:
intrinsic dim. = 1.51
log( r )
log(#pairs(within = area))
log(area)
-
16
Waterloo, 2006 C. Faloutsos 91
School of Computer Science
Carnegie Mellon
More power laws: Korcak
1
10
100
1000
1 10 100 1000 10000 100000
Num
ber
of is
lands
Area
Area distribution of Japan Archipelago
Japan islands;
area vs cumulative
count (log-log axes) log(area)
log(count( >= area))
Waterloo, 2006 C. Faloutsos 92
School of Computer Science
Carnegie Mellon
More power laws
Energy of earthquakes (Gutenberg-Richter
law) [simscience.org]
log(count)
Magnitude = log(energy)day
Energy
released
Waterloo, 2006 C. Faloutsos 93
School of Computer Science
Carnegie Mellon
Fractals & power laws:
appear in numerous settings:
medical
geographical / geological
social
computer-system related
Waterloo, 2006 C. Faloutsos 94
School of Computer Science
Carnegie Mellon
A famous power law: Zipfs
law
Bible - rank vs.
frequency (log-log)
log(rank)
log(freq)
a
the
Rank/frequency plot
Waterloo, 2006 C. Faloutsos 95
School of Computer Science
Carnegie Mellon
TELCO data
# of service units
count of
customers
best customer
Waterloo, 2006 C. Faloutsos 96
School of Computer Science
Carnegie Mellon
SALES data store#96
# units sold
count of
products
aspirin
-
17
Waterloo, 2006 C. Faloutsos 97
School of Computer Science
Carnegie Mellon
Olympic medals (Sidney00,
Athens04):
log( rank)
log(#medals)
0
0.5
1
1.5
2
2.5
0 0.5 1 1.5 2
athens
sidney
Waterloo, 2006 C. Faloutsos 98
School of Computer Science
Carnegie Mellon
Olympic medals (Sidney00,
Athens04):
log( rank)
log(#medals)
0
0.5
1
1.5
2
2.5
0 0.5 1 1.5 2
athens
sidney
Waterloo, 2006 C. Faloutsos 99
School of Computer Science
Carnegie Mellon
Even more power laws:
Income distribution (Paretos law)
size of firms
publication counts (Lotkas law)
Waterloo, 2006 C. Faloutsos 100
School of Computer Science
Carnegie Mellon
Even more power laws:
library science (Lotkas law of publication
count); and citation counts:
(citeseer.nj.nec.com 6/2001)
1
10
100
100 1000 10000
log c
ount
log # citations
cited.pdf
log(#citations)
log(count)
Ullman
Waterloo, 2006 C. Faloutsos 101
School of Computer Science
Carnegie Mellon
Even more power laws:
web hit counts [w/ A. Montgomery]
Web Site Traffic
log(freq)
log(count)
Zipf
yahoo.com
Waterloo, 2006 C. Faloutsos 102
School of Computer Science
Carnegie Mellon
Fractals & power laws:
appear in numerous settings:
medical
geographical / geological
social
computer-system related
-
18
Waterloo, 2006 C. Faloutsos 103
School of Computer Science
Carnegie Mellon
Power laws, contd
In- and out-degree distribution of web sites
[Barabasi], [IBM-CLEVER]
log indegree
- log(freq)
from [Ravi Kumar,
Prabhakar Raghavan,
Sridhar Rajagopalan,
Andrew Tomkins ]
Waterloo, 2006 C. Faloutsos 104
School of Computer Science
Carnegie Mellon
Power laws, contd
In- and out-degree distribution of web sites
[Barabasi], [IBM-CLEVER]
log indegree
log(freq)
from [Ravi Kumar,
Prabhakar Raghavan,
Sridhar Rajagopalan,
Andrew Tomkins ]
Waterloo, 2006 C. Faloutsos 105
School of Computer Science
Carnegie Mellon
Power laws, contd
In- and out-degree distribution of web sites
[Barabasi], [IBM-CLEVER]
log indegree
log(freq)
Q: how can we use
these power laws?
Waterloo, 2006 C. Faloutsos 106
School of Computer Science
Carnegie Mellon
Foiled by power law
[Broder+, WWW00]
(log) in-degree
(log) count
Waterloo, 2006 C. Faloutsos 107
School of Computer Science
Carnegie Mellon
Foiled by power law
[Broder+, WWW00]
The anomalous bump at 120
on the x-axis
is due a large clique
formed by a single spammer
(log) in-degree
(log) count
Waterloo, 2006 C. Faloutsos 108
School of Computer Science
Carnegie Mellon
Power laws, contd
In- and out-degree distribution of web sites
[Barabasi], [IBM-CLEVER]
length of file transfers [Crovella+Bestavros
96]
duration of UNIX jobs
-
19
Waterloo, 2006 C. Faloutsos 109
School of Computer Science
Carnegie Mellon
Additional projects
Find anomalies in traffic matrices [underreview]
Find correlations in sensor/stream data[VLDB05]
Chlorine measurements, with Civ. Eng.
temperature measurements (INTEL/MIT)
Virus propagation (SIS, SIR) [Wang+, 03]
Graph partitioning [Chakrabarti+, KDD04]
Waterloo, 2006 C. Faloutsos 110
School of Computer Science
Carnegie Mellon
Conclusions
Fascinating problems in Data Mining: find
patterns in
sensors/streams
graphs/networks
Waterloo, 2006 C. Faloutsos 111
School of Computer Science
Carnegie Mellon
Conclusions - contd
New tools for Data Mining: self-similarity &
power laws: appear in many cases
Bad news:
lead to skewed distributions
(no Gaussian, Poisson,
uniformity, independence,
mean, variance)
Good news:
correlation integralfor separability
rank/frequency plots
80-20 (multifractals) (Hurst exponent,
strange attractors,
renormalization theory,
++)
Waterloo, 2006 C. Faloutsos 112
School of Computer Science
Carnegie Mellon
Resources
Manfred Schroeder Chaos, Fractals and
Power Laws, 1991
Waterloo, 2006 C. Faloutsos 113
School of Computer Science
Carnegie Mellon
References
[vldb95] Alberto Belussi and Christos Faloutsos,Estimating the Selectivity of Spatial Queries Using the`Correlation' Fractal Dimension Proc. of VLDB, p. 299-310, 1995
[Broder+00] Andrei Broder, Ravi Kumar , FarzinMaghoul1, Prabhakar Raghavan , Sridhar Rajagopalan ,
Raymie Stata, Andrew Tomkins , Janet Wiener, Graphstructure in the web , WWW00
M. Crovella and A. Bestavros, Self similarity in Worldwide web traffic: Evidence and possible causes ,SIGMETRICS 96.
Waterloo, 2006 C. Faloutsos 114
School of Computer Science
Carnegie Mellon
References
J. Considine, F. Li, G. Kollios and J. Byers,
Approximate Aggregation Techniques for Sensor
Databases (ICDE04, best paper award).
[pods94] Christos Faloutsos and Ibrahim Kamel,
Beyond Uniformity and Independence: Analysis of
R-trees Using the Concept of Fractal Dimension,
PODS, Minneapolis, MN, May 24-26, 1994, pp.
4-13
-
20
Waterloo, 2006 C. Faloutsos 115
School of Computer Science
Carnegie Mellon
References
[vldb96] Christos Faloutsos, Yossi Matias and Avi
Silberschatz, Modeling Skewed Distributions
Using Multifractals and the `80-20 Law Conf. on
Very Large Data Bases (VLDB), Bombay, India,
Sept. 1996.
[sigmod2000] Christos Faloutsos, Bernhard
Seeger, Agma J. M. Traina and Caetano Traina Jr.,
Spatial Join Selectivity Using Power Laws,
SIGMOD 2000
Waterloo, 2006 C. Faloutsos 116
School of Computer Science
Carnegie Mellon
References
[vldb96] Christos Faloutsos and Volker Gaede
Analysis of the Z-Ordering Method Using the
Hausdorff Fractal Dimension VLD, Bombay,
India, Sept. 1996
[sigcomm99] Michalis Faloutsos, Petros Faloutsos
and Christos Faloutsos, What does the Internet
look like? Empirical Laws of the Internet
Topology, SIGCOMM 1999
Waterloo, 2006 C. Faloutsos 117
School of Computer Science
Carnegie Mellon
References
[Leskovec 05] Jure Leskovec, Jon M. Kleinberg,
Christos Faloutsos: Graphs over time:
densification laws, shrinking diameters and
possible explanations. KDD 2005: 177-187
Waterloo, 2006 C. Faloutsos 118
School of Computer Science
Carnegie Mellon
References
[ieeeTN94] W. E. Leland, M.S. Taqqu, W.
Willinger, D.V. Wilson, On the Self-Similar
Nature of Ethernet Traffic, IEEE Transactions on
Networking, 2, 1, pp 1-15, Feb. 1994.
[brite] Alberto Medina, Anukool Lakhina, Ibrahim
Matta, and John Byers. BRITE: An Approach to
Universal Topology Generation. MASCOTS '01
Waterloo, 2006 C. Faloutsos 119
School of Computer Science
Carnegie Mellon
References
[icde99] Guido Proietti and Christos Faloutsos,
I/O complexity for range queries on region data
stored using an R-tree (ICDE99)
Stan Sclaroff, Leonid Taycher and Marco La
Cascia , "ImageRover: A content-based image
browser for the world wide web" Proc. IEEE
Workshop on Content-based Access of Image and
Video Libraries, pp 2-9, 1997.
Waterloo, 2006 C. Faloutsos 120
School of Computer Science
Carnegie Mellon
References
[kdd2001] Agma J. M. Traina, Caetano Traina Jr.,
Spiros Papadimitriou and Christos Faloutsos: Tri-
plots: Scalable Tools for Multidimensional Data
Mining, KDD 2001, San Francisco, CA.
-
21
Waterloo, 2006 C. Faloutsos 121
School of Computer Science
Carnegie Mellon
Thank you!
Contact info:
christos cs.cmu.edu
www. cs.cmu.edu /~christos
(w/ papers, datasets, code for fractal dimension
estimation, etc)