thank you! data mining using fractals and power laws...

Download THANK YOU! Data Mining using Fractals and Power laws ...dsg.uwaterloo.ca/seminars/notes/christos.pdf · Data Mining using Fractals and Power laws ... • Prof. Ed Chan • Debbie

If you can't read please download the document

Upload: vuongkien

Post on 06-Feb-2018

220 views

Category:

Documents


0 download

TRANSCRIPT

  • 1

    Waterloo, 2006 C. Faloutsos 1

    School of Computer Science

    Carnegie Mellon

    Data Mining using Fractals and

    Power laws

    Christos Faloutsos

    Carnegie Mellon University

    Waterloo, 2006 C. Faloutsos 2

    School of Computer Science

    Carnegie Mellon

    THANK YOU!

    Prof. Ed Chan

    Debbie Mustin

    Waterloo, 2006 C. Faloutsos 3

    School of Computer Science

    Carnegie Mellon

    Thanks to

    Deepayan Chakrabarti (CMU/Yahoo)

    Michalis Faloutsos (UCR)

    George Siganos (UCR)

    Waterloo, 2006 C. Faloutsos 4

    School of Computer Science

    Carnegie Mellon

    Overview

    Goals/ motivation: find patterns in large

    datasets:

    (A) Sensor data

    (B) network/graph data

    Solutions: self-similarity and power laws

    Discussion

    Waterloo, 2006 C. Faloutsos 5

    School of Computer Science

    Carnegie Mellon

    Applications of sensors/streams

    Smart house: monitoring temperature,

    humidity etc

    Financial, sales, economic series

    Waterloo, 2006 C. Faloutsos 6

    School of Computer Science

    Carnegie Mellon

    Applications of sensors/streams

    Smart house: monitoring temperature,

    humidity etc

    Financial, sales, economic series

    Tamer; Ihab

  • 2

    Waterloo, 2006 C. Faloutsos 7

    School of Computer Science

    Carnegie Mellon

    Motivation - Applications

    Medical: ECGs +; blood

    pressure etc monitoring

    Scientific data: seismological;

    astronomical; environment /

    anti-pollution; meteorologicalSunspot Data

    0

    50

    100

    150

    200

    250

    300

    Waterloo, 2006 C. Faloutsos 8

    School of Computer Science

    Carnegie Mellon

    Motivation - Applications

    (contd)

    civil/automobile infrastructure

    bridge vibrations [Oppenheim+02]

    road conditions / traffic monitoring

    Automobile traffic

    0200400600800

    100012001400160018002000

    time

    # cars

    Waterloo, 2006 C. Faloutsos 9

    School of Computer Science

    Carnegie Mellon

    Motivation - Applications

    (contd)

    Computer systems

    web servers (buffering, prefetching)

    network traffic monitoring

    ...

    0

    200000

    400000

    600000

    800000

    1e+06

    1.2e+06

    0 2e+06 4e+06 6e+06 8e+06 1e+07

    num

    ber

    of K

    b

    time (in milliseconds)

    http://repository.cs.vt.edu/lbl-conn-7.tar.Z

    Waterloo, 2006 C. Faloutsos 10

    School of Computer Science

    Carnegie Mellon

    Web traffic

    [Crovella Bestavros, SIGMETRICS96]

    Chronological time (slots of 1000 sec)

    Byte

    s

    0 1000 2000 3000

    05*1

    0^6

    10^7

    1.5

    *10^7

    2*1

    0^7

    2.5

    *10^7

    3*1

    0^7

    Figure 2: Trac Bursts over Four Orders of Magnitude; Upper Left: 1000, Upper Right: 100, Lower Left: 10, and Lower Right:1 Second Aggegrations. (Actual Transfers)

    this eect is by visually inspecting time series plots of tracdemand.

    In Figure 2 we show four time series plots of the WWWtrac induced by our reference traces. The plots are producedby aggregating byte trac into discrete bins of 1, 10, 100, or1000 seconds.

    The upper left plot is a complete presentation of the entiretrac time series using 1000 second (16.6 minute) bins. Thediurnal cycle of network demand is clearly evident, and dayto day activity shows noticeable bursts. However, even withinthe active portion of a single day there is signicant burstiness;this is shown in the upper right plot, which uses a 100 secondtimescale and is taken from a typical day in the middle of thedataset. Finally, the lower left plot shows a portion of the 100second plot, expanded to 10 second detail; and the lower rightplot shows a portion of the lower left expanded to 1 seconddetail. These plots show signicant bursts occurring at thesecond-to-second level.

    4.2.2 Statistical Analysis

    We used the four methods for assessing self-similarity describedin Section 2: the variance-time plot, the rescaled range (orR/S) plot, the periodogram plot, and the Whittle estimator.We concentrated on individual hours from our trac series, soas to provide as nearly a stationary dataset as possible.

    To provide an example of these approaches, analysis of asingle hour (4pm to 5pm, Thursday 5 Feb 1995) is shown inFigure 3. The gure shows plots for the three graphical meth-ods: variance-time (upper left), rescaled range (upper right),and periodogram (lower center). The variance-time plot is lin-

    ear and shows a slope that is distinctly dierent from -1 (whichis shown for comparison); the slope is estimated using regres-sion as -0.48, yielding an estimate for H of 0.76. The R/S plotshows an asymptotic slope that is dierent from 0.5 and from1.0 (shown for comparision); it is estimated using regressionas 0.75, which is also the corresponding estimate of H. Theperiodogram plot shows a slope of -0.66 (the regression line isshown), yielding an estimate of H as 0.83. Finally, the Whittleestimator for this dataset (not a graphical method) yields anestimate of H = 0:82 with a 95% condence interval of (0.77,0.87).

    As discussed in Section 2.1, the Whittle estimator is theonly method that yields condence intervals on H, but short-range dependence in the timeseries can introduce inaccura-cies in its results. These inaccuracies are minimized by m-aggregating the timeseries for successively large values of m,and looking for a value of H around which the Whittle esti-mator stabilizes.

    The results of this method for four busy hours are shown inFigure 4. Each hour is shown in one plot, from the busiest hourin the upper left to the least busy hour in the lower right. Inthese gures the solid line is the value of the Whittle estimateof H as a function of the aggregation level m of the dataset.The upper and lower dotted lines are the limits of the 95%condence interval on H. The three level lines represent theestimate of H for the unaggregated dataset as given by thevariance-time, R-S, and periodogram methods.

    The gure shows that for each dataset, the estimate of Hstays relatively consistent as the aggregation level is increased,and that the estimates given by the three graphical methods

    Waterloo, 2006 C. Faloutsos 11

    School of Computer Science

    Carnegie Mellon

    ...

    survivable,self-managing storage

    infrastructure

    ...

    a storage brick(0.55 TB)~1 PB

    self-* = self-managing, self-tuning, self-healing,

    Goal: 1 petabyte (PB) for CMU researchers

    www.pdl.cmu.edu/SelfStar

    Self-* Storage (Ganger+)

    Waterloo, 2006 C. Faloutsos 12

    School of Computer Science

    Carnegie Mellon

    ...

    survivable,self-managing storage

    infrastructure

    ...

    a storage brick(0.55 TB)~1 PB

    self-* = self-managing, self-tuning, self-healing,

    Self-* Storage (Ganger+)

    Ashraf, Ihab, Ken

  • 3

    Waterloo, 2006 C. Faloutsos 13

    School of Computer Science

    Carnegie Mellon

    Problem definition

    Given: one or more sequences

    x1 , x2 , , xt , ; (y1, y2, , yt, )

    Find

    patterns; clusters; outliers; forecasts;

    Waterloo, 2006 C. Faloutsos 14

    School of Computer Science

    Carnegie Mellon

    Problem #1

    Find patterns, in large

    datasets

    time

    # bytes

    Waterloo, 2006 C. Faloutsos 15

    School of Computer Science

    Carnegie Mellon

    Problem #1

    Find patterns, in large

    datasets

    time

    # bytes

    Poisson

    indep.,

    ident. distr

    Waterloo, 2006 C. Faloutsos 16

    School of Computer Science

    Carnegie Mellon

    Problem #1

    Find patterns, in large

    datasets

    time

    # bytes

    Poisson

    indep.,

    ident. distr

    Waterloo, 2006 C. Faloutsos 17

    School of Computer Science

    Carnegie Mellon

    Problem #1

    Find patterns, in large

    datasets

    time

    # bytes

    Poisson

    indep.,

    ident. distr

    Q: Then, how to generate

    such bursty traffic?

    Waterloo, 2006 C. Faloutsos 18

    School of Computer Science

    Carnegie Mellon

    Overview

    Goals/ motivation: find patterns in large datasets:

    (A) Sensor data

    (B) network/graph data

    Solutions: self-similarity and power laws

    Discussion

  • 4

    Waterloo, 2006 C. Faloutsos 19

    School of Computer Science

    Carnegie Mellon

    Problem #2 - network and

    graph mining

    How does the Internet look like?

    How does the web look like?

    What constitutes a normal social

    network?

    What is the network value of a

    customer?

    which gene/species affects the others

    the most?

    Waterloo, 2006 C. Faloutsos 20

    School of Computer Science

    Carnegie Mellon

    Network and graph mining

    Food Web[Martinez 91]

    Protein Interactions[genomebiology.com]

    Friendship Network[Moody 01]

    Graphs are everywhere!

    Waterloo, 2006 C. Faloutsos 21

    School of Computer Science

    Carnegie Mellon

    Problem#2

    Given a graph:

    which node to market-to /

    defend / immunize first?

    Are there un-natural sub-

    graphs? (eg., criminals rings)?

    [from Lumeta: ISPs 6/1999]

    Waterloo, 2006 C. Faloutsos 22

    School of Computer Science

    Carnegie Mellon

    Solutions

    New tools: power laws, self-similarity and

    fractals work, where traditional

    assumptions fail

    Lets see the details:

    Waterloo, 2006 C. Faloutsos 23

    School of Computer Science

    Carnegie Mellon

    Overview

    Goals/ motivation: find patterns in large

    datasets:

    (A) Sensor data

    (B) network/graph data

    Solutions: self-similarity and power laws

    Discussion

    Waterloo, 2006 C. Faloutsos 24

    School of Computer Science

    Carnegie Mellon

    What is a fractal?

    = self-similar point set, e.g., Sierpinski triangle:

    ...zero area: (3/4)^inf

    infinite length!

    (4/3)^inf

    Q: What is its dimensionality??

  • 5

    Waterloo, 2006 C. Faloutsos 25

    School of Computer Science

    Carnegie Mellon

    What is a fractal?

    = self-similar point set, e.g., Sierpinski triangle:

    ...zero area: (3/4)^inf

    infinite length!

    (4/3)^inf

    Q: What is its dimensionality??

    A: log3 / log2 = 1.58 (!?!)

    Waterloo, 2006 C. Faloutsos 26

    School of Computer Science

    Carnegie Mellon

    Intrinsic (fractal) dimension

    Q: fractal dimension

    of a line?

    Q: fd of a plane?

    Waterloo, 2006 C. Faloutsos 27

    School of Computer Science

    Carnegie Mellon

    Intrinsic (fractal) dimension

    Q: fractal dimensionof a line?

    A: nn (

  • 6

    Waterloo, 2006 C. Faloutsos 31

    School of Computer Science

    Carnegie Mellon

    time

    #bytes

    Solution #1: traffic

    disk traces: self-similar: (also: [Leland+94])

    How to generate such traffic?

    Waterloo, 2006 C. Faloutsos 32

    School of Computer Science

    Carnegie Mellon

    Solution #1: traffic

    disk traces (80-20 law) multifractals

    time

    #bytes

    20% 80%

    Waterloo, 2006 C. Faloutsos 33

    School of Computer Science

    Carnegie Mellon

    80-20 / multifractals20 80

    0

    200000

    400000

    600000

    800000

    1e+06

    1.2e+06

    0 2e+06 4e+06 6e+06 8e+06 1e+07

    num

    ber

    of K

    b

    time (in milliseconds)

    Waterloo, 2006 C. Faloutsos 34

    School of Computer Science

    Carnegie Mellon

    80-20 / multifractals

    20 p ; (1-p) in general

    yes, there are

    dependencies

    80

    Waterloo, 2006 C. Faloutsos 35

    School of Computer Science

    Carnegie Mellon

    More on 80/20: PQRS

    Part of self-* storage project

    time

    cylinder# Waterloo, 2006 C. Faloutsos 36

    School of Computer Science

    Carnegie Mellon

    More on 80/20: PQRS

    Part of self-* storage project

    p q

    r s

    q

    r s

  • 7

    Waterloo, 2006 C. Faloutsos 37

    School of Computer Science

    Carnegie Mellon

    Overview

    Goals/ motivation: find patterns in large datasets: (A) Sensor data

    (B) network/graph data

    Solutions: self-similarity and power laws

    sensor/traffic data

    network/graph data

    Discussion

    Waterloo, 2006 C. Faloutsos 38

    School of Computer Science

    Carnegie Mellon

    Problem #2 - topology

    How does the Internet look like? Any rules?

    Waterloo, 2006 C. Faloutsos 39

    School of Computer Science

    Carnegie Mellon

    Patterns?

    avg degree is, say 3.3

    pick a node at random

    guess its degree,

    exactly (-> mode)

    degree

    count

    avg: 3.3Waterloo, 2006 C. Faloutsos 40

    School of Computer Science

    Carnegie Mellon

    Patterns?

    avg degree is, say 3.3

    pick a node at random

    guess its degree,

    exactly (-> mode)

    A: 1!!

    degree

    count

    avg: 3.3

    Waterloo, 2006 C. Faloutsos 41

    School of Computer Science

    Carnegie Mellon

    Patterns?

    avg degree is, say 3.3

    pick a node at random- what is the degreeyou expect it to have?

    A: 1!!

    A: very skewed distr.

    Corollary: the mean ismeaningless!

    (and std -> infinity (!))

    degree

    count

    avg: 3.3Waterloo, 2006 C. Faloutsos 42

    School of Computer Science

    Carnegie Mellon

    Solution#2: Rank exponent R A1: Power law in the degree distribution

    [SIGCOMM99]

    internet domains

    log(rank)

    log(degree)

    -0.82

    att.com

    ibm.com

  • 8

    Waterloo, 2006 C. Faloutsos 43

    School of Computer Science

    Carnegie Mellon

    Solution#2: Eigen Exponent E

    A2: power law in the eigenvalues of the adjacencymatrix

    E = -0.48

    Exponent = slope

    Eigenvalue

    Rank of decreasing eigenvalue

    May 2001

    Waterloo, 2006 C. Faloutsos 44

    School of Computer Science

    Carnegie Mellon

    Power laws - discussion

    do they hold, over time?

    do they hold on other graphs/domains?

    Waterloo, 2006 C. Faloutsos 45

    School of Computer Science

    Carnegie Mellon

    Power laws - discussion

    do they hold, over time?

    Yes! for multiple years [Siganos+]

    do they hold on other graphs/domains?

    Yes!

    web sites and links [Tomkins+], [Barabasi+]

    peer-to-peer graphs (gnutella-style)

    who-trusts-whom (epinions.com)

    Waterloo, 2006 C. Faloutsos 46

    School of Computer Science

    Carnegie Mellon

    Time Evolution: rank R

    -1

    -0.9

    -0.8

    -0.7

    -0.6

    -0.5

    0 200 400 600 800

    Instances in time: Nov'97 and on

    Ra

    nk

    ex

    po

    ne

    nt

    The rank exponent has not changed!

    [Siganos+]

    Domain

    level

    log(rank)

    log(degree)

    -0.82

    att.com

    ibm.com

    Waterloo, 2006 C. Faloutsos 47

    School of Computer Science

    Carnegie Mellon

    The Peer-to-Peer Topology

    Number of immediate peers (= degree), follows a

    power-law

    [Jovanovic+]

    degree

    count

    Waterloo, 2006 C. Faloutsos 48

    School of Computer Science

    Carnegie Mellon

    epinions.com

    who-trusts-whom

    [Richardson +

    Domingos, KDD

    2001]

    (out) degree

    count

  • 9

    Waterloo, 2006 C. Faloutsos 49

    School of Computer Science

    Carnegie Mellon

    Why care about these patterns?

    better graph generators [BRITE, INET]

    for simulations

    extrapolations

    abnormal graph and subgraph detection

    Waterloo, 2006 C. Faloutsos 50

    School of Computer Science

    Carnegie Mellon

    Recent discoveries [KDD05]

    How do graphs evolve?

    degree-exponent seems constant - anything

    else?

    Waterloo, 2006 C. Faloutsos 51

    School of Computer Science

    Carnegie Mellon

    Evolution of diameter?

    Prior analysis, on power-law-like graphs,

    hints that

    diameter ~ O(log(N)) or

    diameter ~ O( log(log(N)))

    i.e.., slowly increasing with network size

    Q: What is happening, in reality?

    Waterloo, 2006 C. Faloutsos 52

    School of Computer Science

    Carnegie Mellon

    Evolution of diameter?

    Prior analysis, on power-law-like graphs,

    hints that

    diameter ~ O(log(N)) or

    diameter ~ O( log(log(N)))

    i.e.., slowly increasing with network size

    Q: What is happening, in reality?

    A: It shrinks(!!), towards a constant value

    Waterloo, 2006 C. Faloutsos 53

    School of Computer Science

    Carnegie Mellon

    Shrinking diameter

    [Leskovec+05a]

    Citations among physicspapers

    11yrs; @ 2003:

    29,555 papers

    352,807 citations

    For each month M, create agraph of all citations up tomonth M

    time

    diameter

    Waterloo, 2006 C. Faloutsos 54

    School of Computer Science

    Carnegie Mellon

    Shrinking diameter Authors &

    publications

    1992

    318 nodes

    272 edges

    2002

    60,000 nodes

    20,000 authors

    38,000 papers

    133,000 edges

  • 10

    Waterloo, 2006 C. Faloutsos 55

    School of Computer Science

    Carnegie Mellon

    Shrinking diameter

    Patents & citations

    1975

    334,000 nodes

    676,000 edges

    1999

    2.9 million nodes

    16.5 million edges

    Each year is a

    datapoint

    Waterloo, 2006 C. Faloutsos 56

    School of Computer Science

    Carnegie Mellon

    Shrinking diameter

    Autonomous

    systems

    1997

    3,000 nodes

    10,000 edges

    2000

    6,000 nodes

    26,000 edges

    One graph per day N

    diameter

    Waterloo, 2006 C. Faloutsos 57

    School of Computer Science

    Carnegie Mellon

    Temporal evolution of graphs

    N(t) nodes; E(t) edges at time t

    suppose that

    N(t+1) = 2 * N(t)

    Q: what is your guess for

    E(t+1) =? 2 * E(t)

    Waterloo, 2006 C. Faloutsos 58

    School of Computer Science

    Carnegie Mellon

    Temporal evolution of graphs

    N(t) nodes; E(t) edges at time t

    suppose that

    N(t+1) = 2 * N(t)

    Q: what is your guess for

    E(t+1) =? 2 * E(t)

    A: over-doubled!

    Waterloo, 2006 C. Faloutsos 59

    School of Computer Science

    Carnegie Mellon

    Temporal evolution of graphs

    A: over-doubled - but obeying:

    E(t) ~ N(t)a for all t

    where 1

  • 11

    Waterloo, 2006 C. Faloutsos 61

    School of Computer Science

    Carnegie Mellon

    Densification Power Law

    ArXiv: Physics papers

    and their citations

    1.69

    N(t)

    E(t)

    tree

    1

    Waterloo, 2006 C. Faloutsos 62

    School of Computer Science

    Carnegie Mellon

    Densification Power Law

    ArXiv: Physics papers

    and their citations

    1.69

    N(t)

    E(t)

    clique

    2

    Waterloo, 2006 C. Faloutsos 63

    School of Computer Science

    Carnegie Mellon

    Densification Power Law

    U.S. Patents, citing each

    other

    1.66

    N(t)

    E(t)

    Waterloo, 2006 C. Faloutsos 64

    School of Computer Science

    Carnegie Mellon

    Densification Power Law

    Autonomous Systems

    1.18

    N(t)

    E(t)

    Waterloo, 2006 C. Faloutsos 65

    School of Computer Science

    Carnegie Mellon

    Densification Power Law

    ArXiv: authors & papers

    1.15

    N(t)

    E(t)

    Waterloo, 2006 C. Faloutsos 66

    School of Computer Science

    Carnegie Mellon

    Outline

    problems

    Fractals

    Solutions

    Discussion

    what else can they solve?

    how frequent are fractals?

  • 12

    Waterloo, 2006 C. Faloutsos 67

    School of Computer Science

    Carnegie Mellon

    What else can they solve?

    separability [KDD02]

    forecasting [CIKM02]

    dimensionality reduction [SBBD00]

    non-linear axis scaling [KDD02]

    disk trace modeling [PEVA02]

    selectivity of spatial/multimedia queries

    [PODS94, VLDB95, ICDE00]

    ...

    spatial d.m.

    Ed, Ihab, Tamer

    Waterloo, 2006 C. Faloutsos 68

    School of Computer Science

    Carnegie Mellon

    Problem #3 - spatial d.m.

    Galaxies (Sloan Digital Sky Survey w/ B.

    Nichol)

    -1

    -0.8

    -0.6

    -0.4

    -0.2

    0

    0.2

    0.4

    0.6

    0.8

    1

    -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1

    "elliptical.cut.dat""spiral.cut.dat"

    - spiral and elliptical

    galaxies

    - patterns? (not Gaussian; not

    uniform)

    -attraction/repulsion?

    - separability??

    Waterloo, 2006 C. Faloutsos 69

    School of Computer Science

    Carnegie Mellon

    Solution#3: spatial d.m.

    1

    10

    100

    1000

    10000

    100000

    1e+06

    1e+07

    1e+08

    1e+09

    1e+10

    1e-081e-071e-061e-050.00010.001 0.01 0.1 1 10 100

    "ell-ell.points.ns""spi-spi.points.ns"

    "spi.dat-ell.dat.points"

    log(r)

    log(#pairs within

  • 13

    Waterloo, 2006 C. Faloutsos 73

    School of Computer Science

    Carnegie Mellon

    What else can they solve?

    separability [KDD02]

    forecasting [CIKM02]

    dimensionality reduction [SBBD00]

    non-linear axis scaling [KDD02]

    disk trace modeling [PEVA02]

    selectivity of spatial/multimedia queries

    [PODS94, VLDB95, ICDE00]

    ...

    Waterloo, 2006 C. Faloutsos 74

    School of Computer Science

    Carnegie Mellon

    Problem#4: dim. reduction

    given attributes x1, ... xn possibly, non-linearly correlated

    drop the useless ones mpg

    cc

    Waterloo, 2006 C. Faloutsos 75

    School of Computer Science

    Carnegie Mellon

    Problem#4: dim. reduction

    given attributes x1, ... xn possibly, non-linearly correlated

    drop the useless ones

    (Q: why?

    A: to avoid the dimensionality curse)

    Solution: keep on dropping attributes, untilthe f.d. changes! [w/ Traina+, SBBD00]

    mpg

    cc

    Waterloo, 2006 C. Faloutsos 76

    School of Computer Science

    Carnegie Mellon

    Outline

    problems

    Fractals

    Solutions

    Discussion

    what else can they solve?

    how frequent are fractals?

    Waterloo, 2006 C. Faloutsos 77

    School of Computer Science

    Carnegie Mellon

    Fractals & power laws:

    appear in numerous settings:

    medical

    geographical / geological

    social

    computer-system related

    Waterloo, 2006 C. Faloutsos 78

    School of Computer Science

    Carnegie Mellon

    Fractals: Brain scans

    brain-scans

    7

    8

    9

    10

    11

    12

    13

    14

    15

    16

    4 4.5 5 5.5 6 6.5 7

    log

    2(

    octr

    ee

    le

    ave

    s)

    level k

    "oct-count"x*2.63871-2.9122

    octree levels

    Log(#octants)

    2.63 =

    fd

  • 14

    Waterloo, 2006 C. Faloutsos 79

    School of Computer Science

    Carnegie Mellon

    More fractals

    periphery of malignant tumors: ~1.5

    benign: ~1.3

    [Burdet+]

    Waterloo, 2006 C. Faloutsos 80

    School of Computer Science

    Carnegie Mellon

    More fractals:

    cardiovascular system: 3 (!) lungs: ~2.9

    Waterloo, 2006 C. Faloutsos 81

    School of Computer Science

    Carnegie Mellon

    Fractals & power laws:

    appear in numerous settings:

    medical

    geographical / geological

    social

    computer-system related

    Waterloo, 2006 C. Faloutsos 82

    School of Computer Science

    Carnegie Mellon

    More fractals:

    Coastlines: 1.2-1.58

    1 1.1

    1.3

    Waterloo, 2006 C. Faloutsos 83

    School of Computer Science

    Carnegie Mellon

    Waterloo, 2006 C. Faloutsos 84

    School of Computer Science

    Carnegie Mellon

    More fractals:

    the fractal dimension for the Amazon river

    is 1.85 (Nile: 1.4)

    [ems.gphys.unc.edu/nonlinear/fractals/examples.html]

  • 15

    Waterloo, 2006 C. Faloutsos 85

    School of Computer Science

    Carnegie Mellon

    More fractals:

    the fractal dimension for the Amazon river

    is 1.85 (Nile: 1.4)

    [ems.gphys.unc.edu/nonlinear/fractals/examples.html]

    Waterloo, 2006 C. Faloutsos 86

    School of Computer Science

    Carnegie Mellon

    Cross-roads of

    Montgomery county:

    any rules?

    GIS points

    Waterloo, 2006 C. Faloutsos 87

    School of Computer Science

    Carnegie Mellon

    GIS

    A: self-similarity:

    intrinsic dim. = 1.51

    log( r )

    log(#pairs(within = area))

    log(area)

  • 16

    Waterloo, 2006 C. Faloutsos 91

    School of Computer Science

    Carnegie Mellon

    More power laws: Korcak

    1

    10

    100

    1000

    1 10 100 1000 10000 100000

    Num

    ber

    of is

    lands

    Area

    Area distribution of Japan Archipelago

    Japan islands;

    area vs cumulative

    count (log-log axes) log(area)

    log(count( >= area))

    Waterloo, 2006 C. Faloutsos 92

    School of Computer Science

    Carnegie Mellon

    More power laws

    Energy of earthquakes (Gutenberg-Richter

    law) [simscience.org]

    log(count)

    Magnitude = log(energy)day

    Energy

    released

    Waterloo, 2006 C. Faloutsos 93

    School of Computer Science

    Carnegie Mellon

    Fractals & power laws:

    appear in numerous settings:

    medical

    geographical / geological

    social

    computer-system related

    Waterloo, 2006 C. Faloutsos 94

    School of Computer Science

    Carnegie Mellon

    A famous power law: Zipfs

    law

    Bible - rank vs.

    frequency (log-log)

    log(rank)

    log(freq)

    a

    the

    Rank/frequency plot

    Waterloo, 2006 C. Faloutsos 95

    School of Computer Science

    Carnegie Mellon

    TELCO data

    # of service units

    count of

    customers

    best customer

    Waterloo, 2006 C. Faloutsos 96

    School of Computer Science

    Carnegie Mellon

    SALES data store#96

    # units sold

    count of

    products

    aspirin

  • 17

    Waterloo, 2006 C. Faloutsos 97

    School of Computer Science

    Carnegie Mellon

    Olympic medals (Sidney00,

    Athens04):

    log( rank)

    log(#medals)

    0

    0.5

    1

    1.5

    2

    2.5

    0 0.5 1 1.5 2

    athens

    sidney

    Waterloo, 2006 C. Faloutsos 98

    School of Computer Science

    Carnegie Mellon

    Olympic medals (Sidney00,

    Athens04):

    log( rank)

    log(#medals)

    0

    0.5

    1

    1.5

    2

    2.5

    0 0.5 1 1.5 2

    athens

    sidney

    Waterloo, 2006 C. Faloutsos 99

    School of Computer Science

    Carnegie Mellon

    Even more power laws:

    Income distribution (Paretos law)

    size of firms

    publication counts (Lotkas law)

    Waterloo, 2006 C. Faloutsos 100

    School of Computer Science

    Carnegie Mellon

    Even more power laws:

    library science (Lotkas law of publication

    count); and citation counts:

    (citeseer.nj.nec.com 6/2001)

    1

    10

    100

    100 1000 10000

    log c

    ount

    log # citations

    cited.pdf

    log(#citations)

    log(count)

    Ullman

    Waterloo, 2006 C. Faloutsos 101

    School of Computer Science

    Carnegie Mellon

    Even more power laws:

    web hit counts [w/ A. Montgomery]

    Web Site Traffic

    log(freq)

    log(count)

    Zipf

    yahoo.com

    Waterloo, 2006 C. Faloutsos 102

    School of Computer Science

    Carnegie Mellon

    Fractals & power laws:

    appear in numerous settings:

    medical

    geographical / geological

    social

    computer-system related

  • 18

    Waterloo, 2006 C. Faloutsos 103

    School of Computer Science

    Carnegie Mellon

    Power laws, contd

    In- and out-degree distribution of web sites

    [Barabasi], [IBM-CLEVER]

    log indegree

    - log(freq)

    from [Ravi Kumar,

    Prabhakar Raghavan,

    Sridhar Rajagopalan,

    Andrew Tomkins ]

    Waterloo, 2006 C. Faloutsos 104

    School of Computer Science

    Carnegie Mellon

    Power laws, contd

    In- and out-degree distribution of web sites

    [Barabasi], [IBM-CLEVER]

    log indegree

    log(freq)

    from [Ravi Kumar,

    Prabhakar Raghavan,

    Sridhar Rajagopalan,

    Andrew Tomkins ]

    Waterloo, 2006 C. Faloutsos 105

    School of Computer Science

    Carnegie Mellon

    Power laws, contd

    In- and out-degree distribution of web sites

    [Barabasi], [IBM-CLEVER]

    log indegree

    log(freq)

    Q: how can we use

    these power laws?

    Waterloo, 2006 C. Faloutsos 106

    School of Computer Science

    Carnegie Mellon

    Foiled by power law

    [Broder+, WWW00]

    (log) in-degree

    (log) count

    Waterloo, 2006 C. Faloutsos 107

    School of Computer Science

    Carnegie Mellon

    Foiled by power law

    [Broder+, WWW00]

    The anomalous bump at 120

    on the x-axis

    is due a large clique

    formed by a single spammer

    (log) in-degree

    (log) count

    Waterloo, 2006 C. Faloutsos 108

    School of Computer Science

    Carnegie Mellon

    Power laws, contd

    In- and out-degree distribution of web sites

    [Barabasi], [IBM-CLEVER]

    length of file transfers [Crovella+Bestavros

    96]

    duration of UNIX jobs

  • 19

    Waterloo, 2006 C. Faloutsos 109

    School of Computer Science

    Carnegie Mellon

    Additional projects

    Find anomalies in traffic matrices [underreview]

    Find correlations in sensor/stream data[VLDB05]

    Chlorine measurements, with Civ. Eng.

    temperature measurements (INTEL/MIT)

    Virus propagation (SIS, SIR) [Wang+, 03]

    Graph partitioning [Chakrabarti+, KDD04]

    Waterloo, 2006 C. Faloutsos 110

    School of Computer Science

    Carnegie Mellon

    Conclusions

    Fascinating problems in Data Mining: find

    patterns in

    sensors/streams

    graphs/networks

    Waterloo, 2006 C. Faloutsos 111

    School of Computer Science

    Carnegie Mellon

    Conclusions - contd

    New tools for Data Mining: self-similarity &

    power laws: appear in many cases

    Bad news:

    lead to skewed distributions

    (no Gaussian, Poisson,

    uniformity, independence,

    mean, variance)

    Good news:

    correlation integralfor separability

    rank/frequency plots

    80-20 (multifractals) (Hurst exponent,

    strange attractors,

    renormalization theory,

    ++)

    Waterloo, 2006 C. Faloutsos 112

    School of Computer Science

    Carnegie Mellon

    Resources

    Manfred Schroeder Chaos, Fractals and

    Power Laws, 1991

    Waterloo, 2006 C. Faloutsos 113

    School of Computer Science

    Carnegie Mellon

    References

    [vldb95] Alberto Belussi and Christos Faloutsos,Estimating the Selectivity of Spatial Queries Using the`Correlation' Fractal Dimension Proc. of VLDB, p. 299-310, 1995

    [Broder+00] Andrei Broder, Ravi Kumar , FarzinMaghoul1, Prabhakar Raghavan , Sridhar Rajagopalan ,

    Raymie Stata, Andrew Tomkins , Janet Wiener, Graphstructure in the web , WWW00

    M. Crovella and A. Bestavros, Self similarity in Worldwide web traffic: Evidence and possible causes ,SIGMETRICS 96.

    Waterloo, 2006 C. Faloutsos 114

    School of Computer Science

    Carnegie Mellon

    References

    J. Considine, F. Li, G. Kollios and J. Byers,

    Approximate Aggregation Techniques for Sensor

    Databases (ICDE04, best paper award).

    [pods94] Christos Faloutsos and Ibrahim Kamel,

    Beyond Uniformity and Independence: Analysis of

    R-trees Using the Concept of Fractal Dimension,

    PODS, Minneapolis, MN, May 24-26, 1994, pp.

    4-13

  • 20

    Waterloo, 2006 C. Faloutsos 115

    School of Computer Science

    Carnegie Mellon

    References

    [vldb96] Christos Faloutsos, Yossi Matias and Avi

    Silberschatz, Modeling Skewed Distributions

    Using Multifractals and the `80-20 Law Conf. on

    Very Large Data Bases (VLDB), Bombay, India,

    Sept. 1996.

    [sigmod2000] Christos Faloutsos, Bernhard

    Seeger, Agma J. M. Traina and Caetano Traina Jr.,

    Spatial Join Selectivity Using Power Laws,

    SIGMOD 2000

    Waterloo, 2006 C. Faloutsos 116

    School of Computer Science

    Carnegie Mellon

    References

    [vldb96] Christos Faloutsos and Volker Gaede

    Analysis of the Z-Ordering Method Using the

    Hausdorff Fractal Dimension VLD, Bombay,

    India, Sept. 1996

    [sigcomm99] Michalis Faloutsos, Petros Faloutsos

    and Christos Faloutsos, What does the Internet

    look like? Empirical Laws of the Internet

    Topology, SIGCOMM 1999

    Waterloo, 2006 C. Faloutsos 117

    School of Computer Science

    Carnegie Mellon

    References

    [Leskovec 05] Jure Leskovec, Jon M. Kleinberg,

    Christos Faloutsos: Graphs over time:

    densification laws, shrinking diameters and

    possible explanations. KDD 2005: 177-187

    Waterloo, 2006 C. Faloutsos 118

    School of Computer Science

    Carnegie Mellon

    References

    [ieeeTN94] W. E. Leland, M.S. Taqqu, W.

    Willinger, D.V. Wilson, On the Self-Similar

    Nature of Ethernet Traffic, IEEE Transactions on

    Networking, 2, 1, pp 1-15, Feb. 1994.

    [brite] Alberto Medina, Anukool Lakhina, Ibrahim

    Matta, and John Byers. BRITE: An Approach to

    Universal Topology Generation. MASCOTS '01

    Waterloo, 2006 C. Faloutsos 119

    School of Computer Science

    Carnegie Mellon

    References

    [icde99] Guido Proietti and Christos Faloutsos,

    I/O complexity for range queries on region data

    stored using an R-tree (ICDE99)

    Stan Sclaroff, Leonid Taycher and Marco La

    Cascia , "ImageRover: A content-based image

    browser for the world wide web" Proc. IEEE

    Workshop on Content-based Access of Image and

    Video Libraries, pp 2-9, 1997.

    Waterloo, 2006 C. Faloutsos 120

    School of Computer Science

    Carnegie Mellon

    References

    [kdd2001] Agma J. M. Traina, Caetano Traina Jr.,

    Spiros Papadimitriou and Christos Faloutsos: Tri-

    plots: Scalable Tools for Multidimensional Data

    Mining, KDD 2001, San Francisco, CA.

  • 21

    Waterloo, 2006 C. Faloutsos 121

    School of Computer Science

    Carnegie Mellon

    Thank you!

    Contact info:

    christos cs.cmu.edu

    www. cs.cmu.edu /~christos

    (w/ papers, datasets, code for fractal dimension

    estimation, etc)