jure leskovec, cornell/stanford university - cs.cmu.edujure/pub/ncp-cornell-oct08.pdfthe network...
TRANSCRIPT
Jure Leskovec, Cornell/Stanford UniversityJoint work with Kevin Lang, Anirban Dasgupta and Michael Mahoney, Yahoo! Research
Network: an interaction graph: • Nodes represent “entities”• Edges represent “interaction” between pairs of entities
2
Are there natural clusters, communities, partitions, etc.?
Concept-based clusters, link-based clusters, density-based clusters, …
3
Bid, click and impression information for “keyword x advertiser” pair
Mine information at query-time to provide new ads
Maximize CTR, RPS, advertiser ROI
4
Find micro-markets by partitioning the “query x advertiser” graph:
advertiser
qu
ery
5
Linear (low-rank) methods: If Gaussian, then low-rank space is good
Kernel (non-linear) methods: If low-dimensional manifold, then kernels are
good Hierarchical methods: Top-down and bottom-up – common in social
sciences Graph partitioning methods: Define “edge counting” metric – conductance,
expansion, modularity, etc. – and optimize!
“It is a matter of common experience that communities exist in networks ... Although not precisely defined, communities are usually thought of as sets of nodes with better connections amongst its members than with the rest of the world.”
6
Communities:
Sets of nodes with lots of connections inside and few to outside (the rest of the network)
Assumption:
Networks are (hierarchically) composed of communities Communities, clusters,
groups, modules7
Communities:
Sets of nodes with lots of connections inside and few to outside (the rest of the network)
Assumption:
Networks are (hierarchically) composed of communities
8
Question:Are large networks
really like this?
Hierarchical community structure
Let A be the adjacency matrix of G=(V,E).
The conductance of a set S of nodes is:
The Network Community Profile (NCP) plot of the graph is:
How community like is a set of nodes? S
S’
9
Score: Φ(S) = # edges cut / # edges inside
What is “best” community of
5 nodes?
10
Score: Φ(S) = # edges cut / # edges inside
Bad community
Φ=5/6 = 0.83
What is “best” community of
5 nodes?
11
Score: Φ(S) = # edges cut / # edges inside
Better community
Φ=5/7 = 0.7
Bad community
Φ=2/5 = 0.4
What is “best” community of
5 nodes?
12
Score: Φ(S) = # edges cut / # edges inside
Better community
Φ=5/7 = 0.7
Bad community
Φ=2/5 = 0.4
Best community
Φ=2/8 = 0.25
What is “best” community of
5 nodes?
13
Network community profile (NCP) plot
Plot the score of best community of size k
14
Community size, log k
log Φ(k)Φ(5)=0.25
Φ(7)=0.18
k=5 k=7
Idea: Use approximation algorithms for NP-hard graph partitioning problems as experimental probes of network structure. Spectral (quadratic approx): confuses “long paths” with “deep
cuts” Multi-commodity flow (log(n) approx): difficulty with expanders SDP (sqrt(log(n)) approx): best in theory Metis (multi-resolution heuristic): common in practice X+MQI: post-processing step on, e.g., MQI of Metis
Local Spectral - connected and tighter sets (empirically) Metis+MQI - best conductance (empirically)
We are not interested in partitions per se, but in probing network structure
15
d-dimensional meshes California road network
16
Zachary’s university karate club social network
During the study club split into 2
The split (squares vs. circles) corresponds to cut B
17
Collaborations between scientists in Networks [Newman, 2005]
18
19
[Ravasz&Barabasi, 2003]
[Clauset,Moore&Newman, 2008]
Previously researchers examined community structure of small networks (~100 nodes)
We examined more than 100 different largenetworks
Large networks look very different!
20
Typical example:General relativity collaboration network (4,158 nodes, 13,422 edges)
21
Φ(k
), (c
on
du
ctan
ce)
k, (community size)
Better and better communities
Communities get worse and worse
Best community has ~100 nodes
22
Definition: Whisker is a maximal set of nodes connected to the network by a single edge
Whiskers are responsible for downward slope of NCP plot
NCP plot
Largest whisker
23
Network structure: Core-periphery
(jellyfish, octopus)
Whiskers are responsible for
good communities
Denser and denser core of
the network
Core contains ~60% nodes and
~80% edges
24
Each new edge inside the community costs more
NCP plot
Φ=2/4 = 0.5
Φ=8/6 = 1.3
Φ=64/14 = 4.5
Each node has twice as many children
Φ=1/3 = 0.33
25
26
Whiskers:
Edge to cut
Whiskers in real networks are non-trivial (richer than trees)
27
WhiskersWhiskers in real networks are larger than
expected based on density and degree sequence
28
29
Nothing happens!Now we have 2-edge connected whiskers to
deal with. Indicates the recursiveness of our core-periphery structure: as we remove the
periphery, the core itself breaks into core and the periphery
What if we allow cuts that give disconnected communities?
•Cut all whiskers •Compose communities out of whiskers•How good “community” do we get?
30
LiveJournal
Rewired network
Local spectral
Bag-of-whiskers
Metis+MQI
31
Regularization properties: spectral embeddings stretch along directions in which the random-walk mixes slowly Resulting hyperplane cuts have "good" conductance
cuts, but may not yield the optimal cuts
spectral embedding flow based embedding
32
Metis+MQI (red) gives sets with better conductance.
Local Spectral (blue) gives tighter and more well-rounded sets.
33
ext/
int
Do
ts a
re c
on
nec
ted
clu
ster
s
Two ca. 500 node communities from Local Spectral:
Two ca. 500 node communities from Metis+MQI:
34
... can be computed from:
Spectral embedding (independent of balance)
SDP-based methods (for volume-balanced partitions)
35
What is a good model that explains such network structure?
None of the existing models work
Pref. attachment Small World Geometric Pref. Attachment
FlatDown and Flat
Flat and Down
36
37
Note: Sparsity is the issue, not heavy-tails per se. (Power laws with 2< <3 give us the appropriate sparsity)
Forest Fire [LKF05]:connections spread like a fire New node joins the network Selects a seed node Connects to some of its neighbors Continue recursively
38
Notes:• Preferential attachment flavor - second neighbor is not uniform at random.• Copying flavor - since burn seed’s neighbors.• Hierarchical flavor - seed is parent.• “Local” flavor - burn “near” --in a diffusion sense -- the seed vertex.
As community grows it blendsinto the core of
the network
rewired
network
Bag of whiskers
39
Whiskers:
Largest whisker has ~100 nodes
Whisker size is independent of network size
Core:
60% of the nodes, 80% edges
Core has little structure (hard to cut)
Still more structure than the random network
40
The Dunbar number 150 individuals is maximum community size On-line communities have 60 members and break down at around
80, military, churches, divisions, etc. – all close to the Dunbar's 150
Common bond vs. common identity theory Common bond (people are attached to individual community
members) are smaller and more cohesive Common identity (people are attached to the group as a whole)
focused around common interest and tend to be larger and more diverse
What edges “mean” and community identification social networks - reasons an individual adds a link to a friend can
vary enormously citation networks or web graphs - links are more “expensive” and
are more semantically uniform41
Networks with “ground truth” communities:
LiveJournal12: users create and explicitly join on-line groups
DBLP co-authorships: publication venues can be viewed as communities
Amazon product co-purchasing: each item belongs to one or more hierarchically organized
categories, as defined by Amazon
IMDB collaboration: countries of production and languages may be viewed as
communities42
LiveJournal DBLP
Amazon IMDB
Rewired
NetworkGround truth
43
NCP plot is a way to analyze network community structure
Our results agree with previous work on small networks (people did not hit the Dunbar’s limit)
But large networks are different:
Whiskers + Core (core-periphery) structure
Small well isolated communities blend into the core of the networks as they grow
44
45
Assume a recursive Kronecker model. Fit it to G. We get K =
What does this tell about the network structure?
46
0.9 0.5
0.5 0.1
Core0.9 edges
Periphery0.1 edges
0.5 edges
0.5 edges
As opposed to:
0.9 0.1
0.1 0.9
which gives a hierarchyCore-peripheryNo communities
No good cuts 0.9 0.9
0.1
0.1
Assume a recursive Kronecker model. Fit it to G. We get K =
What does this tell about the network structure?
47
0.9 0.5
0.5 0.1
Core0.9 edges
Periphery0.1 edges
0.5 edges
0.5 edges
As opposed to:
0.9 0.1
0.1 0.9
which gives a hierarchy
0.9 0.9
0.1
0.1
48