school of computer science carnegie mellon llnl, feb. '07c. faloutsos1 mining static and...
TRANSCRIPT
LLNL, Feb. '07 C. Faloutsos 1
School of Computer ScienceCarnegie Mellon
Mining static and time-evolving graphs
Christos Faloutsos
Carnegie Mellon University
LLNL, Feb. '07 C. Faloutsos 2
School of Computer ScienceCarnegie Mellon
Overview
• Mining Static graphs– CenterPiece Subgraphs (CePS)– Fast RWR computation– ‘best-effort’ subgraph matching (in progress)
• Mining time-evolving graphs– Tensors + intrusion detection– Sparse graphs
• Other topics– Graph sampling– Graph generators
LLNL, Feb. '07 C. Faloutsos 3
School of Computer ScienceCarnegie Mellon
CePS
• w/ Hanghang Tong, KDD 2006
LLNL, Feb. '07 C. Faloutsos 4
School of Computer ScienceCarnegie Mellon
Center-Piece Subgraph(Ceps)
• Given Q query nodes• Find Center-piece ( )
• Input of Ceps– Q Query nodes– Budget b– K softand coefficient
• App.– Social Networks– Law Inforcement, …
A C
B
A C
B
A C
B
b
LLNL, Feb. '07 C. Faloutsos 5
School of Computer ScienceCarnegie Mellon
Challenges in Ceps• Q1: How to measure the importance?
• Q2: How to extract connection subgraph?
• Q3: How to do it efficiently?
LLNL, Feb. '07 C. Faloutsos 6
School of Computer ScienceCarnegie Mellon
An Illustrating Example
1
2
3
4
5
6
789
11
10 13
12•Starting from 1
•Randomly to neighbor
•Some p to return to 1
Prob (RW will finally stay at j)
LLNL, Feb. '07 C. Faloutsos 7
School of Computer ScienceCarnegie Mellon
Individual Score Calculation
Q1 Q2 Q3
Node 1Node 2Node 3Node 4Node 5Node 6Node 7Node 8Node 9Node 10Node 11Node 12Node 13
0.5767 0.0088 0.0088 0.1235 0.0076 0.0076 0.0283 0.0283 0.0283 0.0076 0.1235 0.0076 0.0088 0.5767 0.0088 0.0076 0.0076 0.1235 0.0088 0.0088 0.5767 0.0333 0.0024 0.1260 0.1260 0.0024 0.0333 0.1260 0.0333 0.0024 0.0333 0.1260 0.0024 0.0024 0.1260 0.0333 0.0024 0.0333 0.1260
1
10
11
9 8
12
13
4
3
62
0.5767
0.1260
0.1235
0.1260
0.0283
0.0333
0.0024
0.0088
0.0076
0.00760.00240.0333
0.0088
7
5
LLNL, Feb. '07 C. Faloutsos 8
School of Computer ScienceCarnegie Mellon
Individual Score Calculation
Q1 Q2 Q3
Node 1Node 2Node 3Node 4Node 5Node 6Node 7Node 8Node 9Node 10Node 11Node 12Node 13
0.5767 0.0088 0.0088 0.1235 0.0076 0.0076 0.0283 0.0283 0.0283 0.0076 0.1235 0.0076 0.0088 0.5767 0.0088 0.0076 0.0076 0.1235 0.0088 0.0088 0.5767 0.0333 0.0024 0.1260 0.1260 0.0024 0.0333 0.1260 0.0333 0.0024 0.0333 0.1260 0.0024 0.0024 0.1260 0.0333 0.0024 0.0333 0.1260
Individual Score matrix
1
10
11
9 8
12
13
4
3
62
0.5767
0.1260
0.1235
0.1260
0.0283
0.0333
0.0024
0.0088
0.0076
0.00760.00240.0333
0.0088
7
5
LLNL, Feb. '07 C. Faloutsos 9
School of Computer ScienceCarnegie Mellon
AND: Combining Scores
• Q: How to combine scores?
• A: Multiply• …= prob. 3 random
particles coincide on node j
LLNL, Feb. '07 C. Faloutsos 10
School of Computer ScienceCarnegie Mellon
K_SoftAnd: Combining Scores
Generalization – SoftAND:
We want nodes close to k of Q (k<Q) query nodes.
Q: How to do that?
LLNL, Feb. '07 C. Faloutsos 11
School of Computer ScienceCarnegie Mellon
K_SoftAnd: Combine Scores
Generalization – softAND:
We want nodes close to k of Q (k<Q) query nodes.
Q: How to do that?
A: Prob(at least k-out-of-Q will meet each other at j)
LLNL, Feb. '07 C. Faloutsos 12
School of Computer ScienceCarnegie Mellon
AND query vs. K_SoftAnd query
And Query 2_SoftAnd Query
x 1e-4
1 7
5
10
11
9 8
12
13
4
3
62
0.4505
0.1010
0.0710
0.1010
0.2267
0.1010
0.1010
0.4505
0.0710
0.07100.10100.1010
0.4505
1 7
5
10
11
9 8
12
13
4
3
62
0.0103
0.0046
0.0019
0.0046
0.0024
0.0046
0.0046
0.0103
0.0019
0.00190.00460.0046
0.0103
LLNL, Feb. '07 C. Faloutsos 13
School of Computer ScienceCarnegie Mellon
1 7
5
10
11
9 8
12
13
4
3
62
0.0103
0.1617
0.1387
0.1617
0.0849
0.1617
0.1617
0.0103
0.1387
0.13870.16170.1617
0.0103
1_SoftAnd query = OR query
LLNL, Feb. '07 C. Faloutsos 14
School of Computer ScienceCarnegie Mellon
Challenges in Ceps• Q1: How to measure the importance?
• Q2: How to extract connection subgraph?
• Q3: How to do it efficiently?
LLNL, Feb. '07 C. Faloutsos 15
School of Computer ScienceCarnegie Mellon
• Goal– Maximize total scores and– ‘Appropriate’ Connections
• How to…”Extract” Alg.– Dynamic Programming– Greedy Alg.
• Pickup promising node• Find ‘best’ path
“Extract” Alg.
1
2
3
54
6
7
8
910
11
12
13
14 15 16
1
2
3
54
6
7
8
910
11
12
13
LLNL, Feb. '07 C. Faloutsos 16
School of Computer ScienceCarnegie Mellon
Challenges in Ceps• Q1: How to measure the importance?
• Q2: How to extract connection subgraph?
• Q3: How to do it efficiently?
LLNL, Feb. '07 C. Faloutsos 17
School of Computer ScienceCarnegie Mellon
Graph Partition: Efficiency Issue
• Straightforward way– Q linear system: – linear to # of edges
• Observation– Skewed dist.
• How to…– Graph partition
LLNL, Feb. '07 C. Faloutsos 18
School of Computer ScienceCarnegie Mellon
Even better:
• We can correct for the deleted edges (Tong+, ICDM’06, best paper award)
• But let’s omit the details
LLNL, Feb. '07 C. Faloutsos 19
School of Computer ScienceCarnegie Mellon
Experimental Setup
• Dataset– DBLP/authorship
– Author-Paper
– 315k nodes
– 1,800k edges
LLNL, Feb. '07 C. Faloutsos 20
School of Computer ScienceCarnegie Mellon
Experimental Setup• We want to check
– Does the goodness criteria make sense?– Does “extract” alg. capture most of important
nodes/edge?– Efficiency
LLNL, Feb. '07 C. Faloutsos 21
School of Computer ScienceCarnegie Mellon
Case Study: AND query
R. Agrawal Jiawei Han
V. Vapnik M. Jordan
H.V. Jagadish
Laks V.S. Lakshmanan
Heikki Mannila
Christos Faloutsos
Padhraic Smyth
Corinna Cortes
15 1013
1 1
6
1 1
4 Daryl Pregibon
10
2
11
3
16
LLNL, Feb. '07 C. Faloutsos 22
School of Computer ScienceCarnegie Mellon
R. Agrawal Jiawei Han
V. Vapnik M. Jordan
H.V. Jagadish
Laks V.S. Lakshmanan
Umeshwar Dayal
Bernhard Scholkopf
Peter L. Bartlett
Alex J. Smola
1510
13
3 3
5 2 2
327
4
2_SoftAnd query
Statistic
database
LLNL, Feb. '07 C. Faloutsos 24
School of Computer ScienceCarnegie Mellon
Running Time vs. Quality for Fast Ceps
Running Time
Quality
~90% quality
6:1 speedup
LLNL, Feb. '07 C. Faloutsos 25
School of Computer ScienceCarnegie Mellon
Conclusion
• Q1:How to measure the importance?• A1: RWR+K_SoftAnd• Q2: How to find connection subgraph?• A2:”Extract” Alg.• Q3:How to do it efficiently?• A3:Graph Partition (Fast Ceps)
– ~90% quality– 6:1 speedup
LLNL, Feb. '07 C. Faloutsos 26
School of Computer ScienceCarnegie Mellon
Overview
• Mining Static graphs– CenterPiece Subgraphs (CePS)– Fast RWR computation– ‘best-effort’ subgraph matching (in progress)
• Mining time-evolving graphs
• Other topics
LLNL, Feb. '07 C. Faloutsos 27
School of Computer ScienceCarnegie Mellon
Random walk with restart
Node 4
Node 1Node 2Node 3Node 4Node 5Node 6Node 7Node 8Node 9Node 10Node 11Node 12
0.130.100.130.220.130.050.050.080.040.030.040.02
1
4
3
2
56
7
910
811
120.13
0.10
0.13
0.13
0.05
0.05
0.08
0.04
0.02
0.04
0.03
Ranking vector More red, more relevant
Nearby nodes, higher scores
4r
LLNL, Feb. '07 C. Faloutsos 28
School of Computer ScienceCarnegie Mellon Computing RWR
1
43
2
5 6
7
9 10
811
12
0.13 0 1/3 1/3 1/3 0 0 0 0 0 0 0 0
0.10 1/3 0 1/3 0 0 0 0 1/4 0 0 0
0.13
0.22
0.13
0.050.9
0.05
0.08
0.04
0.03
0.04
0.02
0
1/3 1/3 0 1/3 0 0 0 0 0 0 0 0
1/3 0 1/3 0 1/4 0 0 0 0 0 0 0
0 0 0 1/3 0 1/2 1/2 1/4 0 0 0 0
0 0 0 0 1/4 0 1/2 0 0 0 0 0
0 0 0 0 1/4 1/2 0 0 0 0 0 0
0 1/3 0 0 1/4 0 0 0 1/2 0 1/3 0
0 0 0 0 0 0 0 1/4 0 1/3 0 0
0 0 0 0 0 0 0 0 1/2 0 1/3 1/2
0 0 0 0 0 0 0 1/4 0 1/3 0 1/2
0 0 0 0 0 0 0 0 0 1/3 1/3 0
0.13 0
0.10 0
0.13 0
0.22
0.13 0
0.05 00.1
0.05 0
0.08 0
0.04 0
0.03 0
0.04 0
2 0
1
0.0
n x n n x 1n x 1
Ranking vector Starting vectorAdjacency matrix
@(t+1) @t
LLNL, Feb. '07 C. Faloutsos 29
School of Computer ScienceCarnegie Mellon
Alternatives
• On-the-fly: precompute nothing -> slow
• Precompute everything -> O(N*N) space
LLNL, Feb. '07 C. Faloutsos 30
School of Computer ScienceCarnegie Mellon
Alternatives
• On-the-fly: precompute nothing -> slow
• Precompute a little, and adjust on-the-fly
• Precompute everything -> O(N*N) space
LLNL, Feb. '07 C. Faloutsos 31
School of Computer ScienceCarnegie Mellon Computing RWR
1
4
3
2
56
7
910
811
12
LLNL, Feb. '07 C. Faloutsos 32
School of Computer ScienceCarnegie Mellon Computing RWR
1
4
3
2
56
7
910
811
12
Break into ‘communities’
LLNL, Feb. '07 C. Faloutsos 33
School of Computer ScienceCarnegie Mellon
FastRWR
• Instead of ONE BIG (and dense) inverted matrix
LLNL, Feb. '07 C. Faloutsos 34
School of Computer ScienceCarnegie Mellon
FastRWR
• Instead of ONE BIG (and dense) inverted matrix
• Several, smaller matrices, plus info about the ‘bridges’
LLNL, Feb. '07 C. Faloutsos 35
School of Computer ScienceCarnegie Mellon
FastRWR
• Instead of ONE BIG (and dense) inverted matrix
• Several, smaller matrices, plus info about the ‘bridges’
LLNL, Feb. '07 C. Faloutsos 36
School of Computer ScienceCarnegie Mellon
FastRWR
• Instead of ONE BIG (and dense) inverted matrix
• Several, smaller matrices, plus info about the ‘bridges’
LLNL, Feb. '07 C. Faloutsos 37
School of Computer ScienceCarnegie Mellon
Query Time vs. Pre-Compute Time
Log Query Time
Log Pre-compute Time
•Quality: 90%+ •On-line:
•Up to 150x speedup•Pre-computation:
•Two orders saving
LLNL, Feb. '07 C. Faloutsos 38
School of Computer ScienceCarnegie Mellon
Query Time vs. Pre-Storage
Log Query Time
Log Storage
•Quality: 90%+ •On-line:
•Up to 150x speedup•Pre-storage:
•Three orders saving
LLNL, Feb. '07 C. Faloutsos 39
School of Computer ScienceCarnegie Mellon
Conclusion
• FastRWR– Good accuracy (90%+)– 150x speed-up: query time– Orders of magnitude saving: pre-compute & storage
LLNL, Feb. '07 C. Faloutsos 40
School of Computer ScienceCarnegie Mellon
Overview
• Mining Static graphs– CenterPiece Subgraphs (CePS)– Fast RWR computation– ‘best-effort’ subgraph matching (in progress)
• Mining time-evolving graphs
• Other topics
LLNL, Feb. '07 C. Faloutsos 41
School of Computer ScienceCarnegie Mellon
Best-effort Sub-Graph Matching, on Attributed Graphs
LLNL, Feb. '07 C. Faloutsos 42
School of Computer ScienceCarnegie Mellon
• Nodes have one (categorical) attribute• query: Eg., loop -> ‘money laundering’
Synthetic data
‘Best-effort’: problem dfn.
LLNL, Feb. '07 C. Faloutsos 43
School of Computer ScienceCarnegie Mellon
‘Best-effort’: problem dfn.
• Loop-Query • Results
LLNL, Feb. '07 C. Faloutsos 45
School of Computer ScienceCarnegie Mellon
DBLP dataset
• Authorship Graph– Nodes: authors– Edges: # of co-authored paper– Attributes: Conference and Year
• ~300k nodes, ~1m edges
LLNL, Feb. '07 C. Faloutsos 46
School of Computer ScienceCarnegie Mellon
Line Query:
Results
People People People People
STOC05 SIGMOD96 ICML93 ISBMS
Moses Charikar
Surajit Chaudhuri
Usama M. Fayyad
Sidney Fels
STOC05 SIGMOD96 ICML93 ISBMS
Pietro Perona
Max Welling
Geoffrey E. Hinton
Gagan Aggarwal
Hector Garcia Molina
Sebastian Thrun
Gbor Szkely
STOC05 SIGMOD96 ICML93 ISBMS
Wolfram Burgard
Hans Burkhardt
Haymo Kurz
James Davis
Footnote for results-Red nodes: qualifying nodes-white nodes: immediate nodes.
LLNL, Feb. '07 C. Faloutsos 47
School of Computer ScienceCarnegie Mellon
Star-queryPeople
People
People
People
IAT
PODS
ISBMS
Li Yan Yuan
Xiaobo Li
Xianchang Wang
PODS
ISBMS
Huowang Chen
Lei Xu
IAT
Jia-Huai You
Results
Haixun Wang
Reinhard Mnner
Bing Liu
PODS
ISBMS
Zhong-Fei (Mark)Zhang
IAT
Phillips. S. Yu
Footnote for results-Red nodes: qualifying nodes-white nodes: immediate nodes.
LLNL, Feb. '07 C. Faloutsos 48
School of Computer ScienceCarnegie Mellon
Loop-Query:
Results
People
ICML93
RECOMB00
People
People INFOCOM00PeopleKDD96
Stan Matwin
ICML93
RECOMB00
Richard M. Karp
Scott Shenker INFOCOM00
Haym HirshKDD96
Amir Ben Dor
Nathalie Japkowicz
Yonatan Aumann
Amy P. Felty
Andrew W Appel
Kai Lei
LLNL, Feb. '07 C. Faloutsos 49
School of Computer ScienceCarnegie Mellon
P.I.T. Terrorist Relations
• Nodes: Terrorist Relationship – Attributes:
• Family Contact Colleague Congregate
• Edges: Two Relationship shares a common person
• ~1k nodes and ~8k edges
LLNL, Feb. '07 C. Faloutsos 50
School of Computer ScienceCarnegie Mellon
Star-Query
Results
Contact
Family
Colleague
Congregate
418
759 515
33
430 418
799
615 500
31
LLNL, Feb. '07 C. Faloutsos 51
School of Computer ScienceCarnegie Mellon
Overview
• Mining Static graphs– CenterPiece Subgraphs (CePS)– Fast RWR computation– ‘best-effort’ subgraph matching (in progress)
• Mining time-evolving graphs– Tensors + intrusion detection– Other tools (MDL)
• Other topics
LLNL, Feb. '07 C. Faloutsos 52
School of Computer ScienceCarnegie Mellon
Tensors for time evolving graphs
• [Jimeng Sun+ KDD’06]
• [ “ , SMD’07]• [ CF, Kolda, Sun,
SDM’07 tutorial]
LLNL, Feb. '07 C. Faloutsos 53
School of Computer ScienceCarnegie Mellon
Social network analysis• Static: find community structures • Dynamic: monitor community structure evolution;
spot abnormal individuals; abnormal time-stamps
DB
Aut
hors
Keywords
DM
DB
1990
2004
LLNL, Feb. '07 C. Faloutsos 54
School of Computer ScienceCarnegie Mellon
Network Forensics• Directional network flows• A large ISP with 100 POPs, each POP 10Gbps
link capacity [Hotnets2004]• Task: Identify abnormal traffic pattern and find
out the causenormal trafficabnormal traffic
dest
inati
on
source
dest
inati
on
source
LLNL, Feb. '07 C. Faloutsos 55
School of Computer ScienceCarnegie Mellon
Tensors - outline
• Motivation
• Main ideas
• Experiments
LLNL, Feb. '07 C. Faloutsos 56
School of Computer ScienceCarnegie Mellon
Static case• For a timestamp, data can be modeled
using a tensor (matrix == 2-mode tensor)
Lo
cati
on
Type
Time = 0
temperaturelight
LLNL, Feb. '07 C. Faloutsos 57
School of Computer ScienceCarnegie Mellon
Dynamic case: Tensor streams
Loca
tion
Type
Time = 0
temperaturelight
LLNL, Feb. '07 C. Faloutsos 58
School of Computer ScienceCarnegie Mellon
Dynamic Data model: Tensor streams
time
(Jimeng’s Desk, light)Lo
catio
n
Type
LLNL, Feb. '07 C. Faloutsos 59
School of Computer ScienceCarnegie Mellon
Dynamic Data model: Tensor streams
• Streams come with structure– (time, location, sensor-modality)– (time, host-id, measurement-type)
time
(Jimeng’s Desk, light)Lo
catio
n
Type
LLNL, Feb. '07 C. Faloutsos 60
School of Computer ScienceCarnegie Mellon
What is the factor?
• Factor is a set of 1D summaries
1st factor
Loc
atio
n
Type
Time
LLNL, Feb. '07 C. Faloutsos 61
School of Computer ScienceCarnegie Mellon
What is the factor?
• Factor is a set of 1D summaries• Multi-linear approximation on
all aspects
1st factor
Loc
atio
n
Type
TimeDay NightNight
Close towindow
Away from window
Day NightNight
Close towindow
Away from window
LLNL, Feb. '07 C. Faloutsos 62
School of Computer ScienceCarnegie Mellon
Tensors - outline
• Motivation
• Main ideas
• Experiments
LLNL, Feb. '07 C. Faloutsos 63
School of Computer ScienceCarnegie Mellon
1st factor Scaling factor 250
typelocationtime
WTA on real sensor data
• 1st factor consists of the main trends:– Daily periodicity on time– Uniform on all locations– Temp, Light and Volt are positively correlated while
negatively correlated with Humid
Lo
catio
n
Type Time
LLNL, Feb. '07 C. Faloutsos 64
School of Computer ScienceCarnegie Mellon
WTA on real sensor data (cont.)
• 2nd factor captures an atypical trend:– Uniformly across all time
– Concentrating on 3 locations
– Mainly due to voltage
• Interpretation: two sensors have low battery, and the other one has high battery.
2nd factorScaling factor 154
typelocationtime
LLNL, Feb. '07 C. Faloutsos 65
School of Computer ScienceCarnegie Mellon
DB
DM
Application 1: Multiway latent semantic indexing (LSI)
DB
2004
1990
Michael Stonebreak
er
QueryPattern
Ukeyword
auth
ors
keyword
Ua
uth
ors
• Projection matrices specify the clusters
• Core tensors give cluster activation level
Philip Yu
LLNL, Feb. '07 C. Faloutsos 66
School of Computer ScienceCarnegie Mellon
Bibliographic data (DBLP)
• Papers from VLDB and KDD conferences• Construct 2nd order tensors with yearly
windows with <author, keywords> • Each tensor: 45843741 • 11 timestamps (years)
LLNL, Feb. '07 C. Faloutsos 67
School of Computer ScienceCarnegie Mellon
Multiway LSIAuthors Keywords Yearmichael carey, michaelstonebreaker, h. jagadish,hector garcia-molina
queri,parallel,optimization,concurr,objectorient
1995
surajit chaudhuri,mitch cherniack,michaelstonebreaker,ugur etintemel
distribut,systems,view,storage,servic,process,cache
2004
jiawei han,jian pei,philip s. yu,jianyong wang,charu c. aggarwal
streams,pattern,support, cluster, index,gener,queri
2004
• Two groups are correctly identified: Databases and Data mining
• People and concepts are drifting over time
DM
DB
LLNL, Feb. '07 C. Faloutsos 68
School of Computer ScienceCarnegie Mellon
Application 2:Network Anomaly Detection
• Anomaly detection– Reconstruction error driven
– Multiple resolution
• Data– TCP flows collected at CMU backbone– Raw data 500GB with compression– Construct 3rd order tensors with hourly windows
with <source, destination, port #>– 1200 timestamps (hours)
LLNL, Feb. '07 C. Faloutsos 69
School of Computer ScienceCarnegie Mellon
dest
inat
ion
source
Network anomaly detection
• Identify when and where anomalies occurred. • Prominent difference between normal and abnormal ones is
mainly due to unusual scanning activity (confirmed by the campus admin).
scanners
Time (hour)
dest
inat
ion
source
err
or
Abnormal Normal
LLNL, Feb. '07 C. Faloutsos 70
School of Computer ScienceCarnegie Mellon
Computational cost
3rd order network tensor 2nd order DBLP tensor• OTA is the offline tensor analysis• Performance metric: CPU time (sec)• Observations:
– DTA and STA are orders of magnitude faster than OTA– The slight upward trend in DBLP is due to the increasing number of papers each
year (data become denser over time)
LLNL, Feb. '07 C. Faloutsos 71
School of Computer ScienceCarnegie Mellon
Accuracy comparison
• Performance metric: the ratio of reconstruction error between DTA/STA and OTA; fixing the error of OTA to 20%
• Observation: DTA performs very close to OTA in both datasets, STA performs worse in DBLP due to the bigger changes.
3rd order network tensor 2nd order DBLP tensor
LLNL, Feb. '07 C. Faloutsos 72
School of Computer ScienceCarnegie Mellon
InteMon: intelligent monitoring system on large
clusters[VLDB06 demo]
[Operating System Review 06]
LLNL, Feb. '07 C. Faloutsos 74
School of Computer ScienceCarnegie Mellon
Case 1: Environmental Monitoring
• Abnormal dehumidification and reheating cycle is identified
Temperature
Humidity
LLNL, Feb. '07 C. Faloutsos 76
School of Computer ScienceCarnegie Mellon
Overview
• Mining Static graphs– CenterPiece Subgraphs (CePS)– Fast RWR computation– ‘best-effort’ subgraph matching (in progress)
• Mining time-evolving graphs– Tensors + intrusion detection– Other tools (MDL)
• Other topics
LLNL, Feb. '07 C. Faloutsos 77
School of Computer ScienceCarnegie Mellon
Parameter-free mining
• Using MDL, to– Find ‘natural’ communities– ‘natural’ cut-points
• (under submission)
LLNL, Feb. '07 C. Faloutsos 78
School of Computer ScienceCarnegie Mellon
MDL mining on time-evolving graph (Enron emails)
LLNL, Feb. '07 C. Faloutsos 79
School of Computer ScienceCarnegie Mellon
Overview
• Mining Static graphs
• Mining time-evolving graphs
• Other topics– Graph sampling– Graph generators
LLNL, Feb. '07 C. Faloutsos 82
School of Computer ScienceCarnegie Mellon
Realistic graph generation
• Kronecker graphs [Leskovec+, PKDD’05]
• [Leskovec+, under review]
LLNL, Feb. '07 C. Faloutsos 83
School of Computer ScienceCarnegie Mellon
Why fitting graph models?
• Parameters tell us about the structure of a graph• Extrapolation: given a graph today, how will it
look in a year?• Sampling: can I get a smaller graph with similar
properties?• Anonymization: instead of releasing real graph
(e.g., email network), we can release a synthetic version of it
LLNL, Feb. '07 C. Faloutsos 84
School of Computer ScienceCarnegie Mellon
Experiments on real AS graphDegree distribution Hop plot
Network valueAdjacency matrix eigen values
LLNL, Feb. '07 C. Faloutsos 86
School of Computer ScienceCarnegie Mellon
Problem Definition• Given a growing graph with count of
nodes N1, N2, …
• Generate a realistic sequence of graphs that will obey all the patterns
• Idea: Self-similarity– Leads to power laws– Communities within communities– …
LLNL, Feb. '07 C. Faloutsos 87
School of Computer ScienceCarnegie Mellon
• There are many obvious (but wrong) ways
– Does not obey Densification Power Law– Has increasing diameter
• Kronecker Product is exactly what we need
Recursive Graph Generation• There are many obvious (but wrong) ways
Initial graph Recursive expansion
LLNL, Feb. '07 C. Faloutsos 88
School of Computer ScienceCarnegie Mellon
Adjacency matrix
Kronecker Product – a Graph
Intermediate stage
Adjacency matrix
LLNL, Feb. '07 C. Faloutsos 89
School of Computer ScienceCarnegie Mellon
Kronecker Product – a Graph
• Continuing multypling with G1 we obtain G4 and so on …
G4 adjacency matrix
LLNL, Feb. '07 C. Faloutsos 90
School of Computer ScienceCarnegie Mellon
Conclusions
• Static graphs: Random Walks, ``CePS’’, best-effort sub-graph matching
• Dynamic graphs: Tensors (intrusion/change detection
• Graph generation: Kronecker
LLNL, Feb. '07 C. Faloutsos 91
School of Computer ScienceCarnegie Mellon
References
• Hanghang Tong, Christos Faloutsos, and Jia-Yu Pan Fast Random Walk with Restart and Its Applications ICDM 2006, Hong Kong.
• Hanghang Tong, Christos Faloutsos Center-Piece Subgraphs: Problem Definition and Fast Solutions, KDD 2006, Philadelphia, PA
LLNL, Feb. '07 C. Faloutsos 92
School of Computer ScienceCarnegie Mellon
References
• Jure Leskovec, Jon Kleinberg and Christos Faloutsos Graphs over Time: Densification Laws, Shrinking Diameters and Possible Explanations KDD 2005, Chicago, IL. ("Best Research Paper" award).
• Jure Leskovec, Deepayan Chakrabarti, Jon Kleinberg, Christos Faloutsos Realistic, Mathematically Tractable Graph Generation and Evolution, Using Kronecker Multiplication (ECML/PKDD 2005), Porto, Portugal, 2005. [PDF]
LLNL, Feb. '07 C. Faloutsos 93
School of Computer ScienceCarnegie Mellon
References
• Jimeng Sun, Dacheng Tao, Christos Faloutsos Beyond Streams and Graphs: Dynamic Tensor Analysis, KDD 2006, Philadelphia, PA
• Jimeng Sun, Yinglian Xie, Hui Zhang, Christos Faloutsos. Less is More: Compact Matrix Decomposition for Large Sparse Graphs, SDM, Minneapolis, Minnesota, Apr 2007. [pdf]