stanford university · pdf fileslope of the ncp corresponds to the

40
CS224W: Social and Information Network Analysis Jure Leskovec Stanford University Jure Leskovec, Stanford University http://cs224w.stanford.edu

Upload: vanphuc

Post on 20-Mar-2018

217 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Stanford University  · PDF fileSlope of the NCP corresponds to the

CS224W: Social and Information Network AnalysisJure Leskovec  Stanford UniversityJure Leskovec, Stanford University

http://cs224w.stanford.edu

Page 2: Stanford University  · PDF fileSlope of the NCP corresponds to the

Non overlapping vs overlapping communities Non‐overlapping vs. overlapping  communities

11/10/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 2

Page 3: Stanford University  · PDF fileSlope of the NCP corresponds to the

[Palla et al., ‘05]

A node belongs to many social circles A node belongs to many social circles

11/10/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 3

Page 4: Stanford University  · PDF fileSlope of the NCP corresponds to the

[Palla et al., ‘05]

Two nodes belong to the same community if theyTwo nodes belong to the same community if they can be connected through adjacent k‐cliques: k‐clique: Fully connected graph on k nodes

Adjacent k‐cliques: 4-cliqueAdjacent k cliques: overlap in k-1 nodes

k‐clique community Set of nodes that can be reached through a sequence of adjacent

adjacent3-cliques

sequence of adjacent k‐cliques

11/10/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 4

Page 5: Stanford University  · PDF fileSlope of the NCP corresponds to the

[Palla et al., ‘05]

Clique Percolation Method:Clique Percolation Method: Find maximal‐cliques (not k‐cliques!) Clique overlap matrix:q p Each clique is a node Connect two cliques if they overlap in at least k-1 nodesoverlap in at least k 1 nodes

Communities: Connected components of th li l t ithe clique overlap matrix

How to set k? Set k so that we get the “richest” (most widely distributed cluster sizes) community structure

11/10/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 5

Page 6: Stanford University  · PDF fileSlope of the NCP corresponds to the

[Palla et al., ‘05]

Start with graph g pand find maximal cliques

Create clique Create clique overlap matrix

Threshold the (1) Graph (2) Clique overlap 

matrix

matrix at value k‐1 If aij<k-1 set 0

Communities are Communities are the connected components of the thresholded matrix (3) Thresholdedthresholded matrix

11/10/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 6

(3) Thresholdedmatrix at k=4

(4) Communities(connected components)

Page 7: Stanford University  · PDF fileSlope of the NCP corresponds to the

[Palla et al., ‘07]

Communities in a “tiny” part of a phone 

ll   t k  f   calls network of 4 million users [Barabasi‐Palla, 2007]

11/10/2010 7Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

Page 8: Stanford University  · PDF fileSlope of the NCP corresponds to the

Each node is a community Each node is a community Nodes are weighted for community sizecommunity size

Links are weighted for overlap sizeoverlap size

DIP “core” data base of protein interactions (S. cerevisiase, yeast)

11/10/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 8

( y )

Page 9: Stanford University  · PDF fileSlope of the NCP corresponds to the

No nice way NP hard combinatorial problem No nice way, NP‐hard combinatorial problem Simple Algorithm: Start with max clique size s Start with max‐clique size s Choose node u, extract cliques of size s nodecliques of size s node u is member of Delete u and its edgesDelete u and its edges When graph is empty, s=s-1, restart on original graphrestart on original graph

11/10/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 9

Page 10: Stanford University  · PDF fileSlope of the NCP corresponds to the

[Palla et al., ‘05]

Finding cliques around u of size s: Finding cliques around u of size s: 2 sets A and B: Each node in B links to all nodes in A Each node in B links to all nodes in A Set A grows by moving nodes from B to it

Start with A={u} B={v: (u v)E}Start with A {u}, B {v: (u,v)E} Recursively move each possible 

v B to A and prune Bv B to A and prune B If B runs out of nodes before A reaches size s,  backtrack the recursion and try a different v

11/10/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 10

Page 11: Stanford University  · PDF fileSlope of the NCP corresponds to the

Let’s rethink what weLet s rethink what weare doing… Given a network Want to find clusters!

Need to: Formalize the notion of a cluster Need to design an algorithmNeed to design an algorithm that will find sets of nodes that are “good” clusters

More generally: How to think about clusters in large networks?

1111/10/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

Page 12: Stanford University  · PDF fileSlope of the NCP corresponds to the

What is a good cluster? SWhat is a good cluster? Many edges internally Few pointing outsideFew pointing outside

Formally, conductance:S’

Where: A(S)….volume

Small Φ(S) corresponds to good clusters

12Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu11/10/2010

Page 13: Stanford University  · PDF fileSlope of the NCP corresponds to the

[WWW ‘08]

Define:Network community profile (NCP) plotPlot the score of best community of size k

k=5 k=7

log Φ(k)Φ(5)=0.25

Φ(7)=0.18

13

Community size, log k

(7)

Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu11/10/2010

Page 14: Stanford University  · PDF fileSlope of the NCP corresponds to the

[WWW ‘08]

Meshes grids dense random graphs: Meshes, grids, dense random graphs:

d-dimensional meshes California road network

Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 14

d dimensional meshes

11/10/2010

Page 15: Stanford University  · PDF fileSlope of the NCP corresponds to the

[WWW ‘08]

Collaborations between scientists in networks [Newman, 2005]

 log Φ(k)

ductan

ce, 

Cond

15

Community size, log k

Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu11/10/2010

Page 16: Stanford University  · PDF fileSlope of the NCP corresponds to the

[Internet Mathematics ‘09]

Natural hypothesis about NCP:Natural hypothesis about NCP: NCP of real networks slopes downward

Slope of the NCP corresponds to the dimensionality of the network

What about large What about large networks?

16Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

Examine more than 100 large networks

11/10/2010

Page 17: Stanford University  · PDF fileSlope of the NCP corresponds to the

[Internet Mathematics ‘09]

Typical example: General Relativity collaborationsTypical example: General Relativity collaborations(n=4,158, m=13,422)

17Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu11/10/2010

Page 18: Stanford University  · PDF fileSlope of the NCP corresponds to the

[Internet Mathematics ‘09]

Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 1811/10/2010

Page 19: Stanford University  · PDF fileSlope of the NCP corresponds to the

[Internet Mathematics ‘09]

B tt   d b tt  

nce)

Better and better communities

nduc

tan

Communities get worse and worse

k), (co

nΦ(

Best community has ~100 nodes

k, (cluster size)11/10/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 19

Page 20: Stanford University  · PDF fileSlope of the NCP corresponds to the

[Internet Mathematics ‘09]

Each successive edge inside the Each successive edge inside the community costsmore cut‐edges

NCP plotΦ /3   0 33

Φ=2/4 = 0 5

Φ=1/3 = 0.33

Φ=2/4 = 0.5

Φ=8/6 = 1.3

Φ=64/14 = 4.5

Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 20

Each node has twice as many children

11/10/2010

Page 21: Stanford University  · PDF fileSlope of the NCP corresponds to the

[Internet Mathematics ‘09]

Empirically we note that best clusters (call them Empirically we note that best clusters (call them whiskers) are barely connected  to the network

If we remove whiskers..

How does NCP look like?

Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 21

Core‐periphery structure11/10/2010

Page 22: Stanford University  · PDF fileSlope of the NCP corresponds to the

[Internet Mathematics ‘09]

Nothing happens!Nestedness of the 

22Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

Nestedness of the core‐periphery structure

11/10/2010

Page 23: Stanford University  · PDF fileSlope of the NCP corresponds to the

Denser and denser Denser and denser network core

Small good communities

23Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu11/10/2010

Nested core‐periphery

Page 24: Stanford University  · PDF fileSlope of the NCP corresponds to the

[Internet Mathematics ‘09]

Practically Practically constant!

Each dot is a different network24Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu11/10/2010

Page 25: Stanford University  · PDF fileSlope of the NCP corresponds to the

[Internet Mathematics ‘09]

LiveJournal DBLPLiveJournal DBLP

RewiredNetwork

Amazon IMDB

Ground truth

2511/10/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

Page 26: Stanford University  · PDF fileSlope of the NCP corresponds to the

Some issues with community detection:Some issues with community detection: Many different formalizations of clustering objective functions  Objectives are NP‐hard to optimize exactly Methods can find clusters that are systematically “biased”biased Methods can perform well/poorly on some kinds of graphs

Questions: How well do algorithms optimize objectives? What clusters do different methods find?

11/10/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 26

Page 27: Stanford University  · PDF fileSlope of the NCP corresponds to the

[WWW ‘09]

Many algorithms to extract clusters: Heuristics: Girvan‐Newman, Modularity optimization: popular heuristicsheuristics Metis (multi‐resolution heuristic): common in practice [Karypis‐Kumar ‘98]yp

Theoretical approximation algorithms: Spectral partitioning: most practical but confuses “long Spectral partitioning: most practical but confuses  long paths” with “deep cuts” Local Spectral [Andersen‐Chung ‘07]

Leighton‐Rao: based on multi‐commodity flow

27Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu11/10/2010

Page 28: Stanford University  · PDF fileSlope of the NCP corresponds to the

[WWW ‘09]

Spectral embeddings stretch along directions in Spectral embeddings stretch along directions in which the random‐walk mixes slowly Resulting hyperplane cuts have "good" conductance Resulting hyperplane cuts have  good  conductance cuts, but may not yield the optimal cuts

spectral embedding flow based embedding

2811/10/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

Page 29: Stanford University  · PDF fileSlope of the NCP corresponds to the

[WWW ‘09]

Rewired networkRewired network

Local spectral

iMetis

LiveJournalLiveJournal

2911/10/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

Page 30: Stanford University  · PDF fileSlope of the NCP corresponds to the

[WWW ‘09]

500 node communities from Local Spectral: 500 node communities from Local Spectral: 

500 node communities from Metis: 

3011/10/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

Page 31: Stanford University  · PDF fileSlope of the NCP corresponds to the

[WWW ‘09]

Conductance of  bounding cut

L l S t lLocal Spectral

Disconnected Metis

Connected Metis

Metis (red) gives sets withMetis (red) gives sets with better conductance

Local Spectral (blue) gives tighter and more well‐rounded sets

3111/10/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

Page 32: Stanford University  · PDF fileSlope of the NCP corresponds to the

[WWW ‘09]

LeightonRao: based on Spectral

LRao conn

multi‐commodity flow Disconnected clusters vs. 

Lrao disconn

Connected clusters

Graclus prefers larger

Metis

Graclus prefers larger clusters

Newman’s modularityGraclus

Newman s modularity optimization similar to Local Spectral

Newman

Local Spectral

3211/10/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

Page 33: Stanford University  · PDF fileSlope of the NCP corresponds to the

[WWW ‘09]

Single‐criterion:d l ( )

S Modularity: m-E(m) Modularity Ratio: m-E(m) Volume: u d(u)=2m+cu ( ) Edges cut: c

Multi‐criterion: Conductance: c/(2m+c)

n: nodes in Sm: edges in S  d   i ti    Conductance: c/(2m+c)

Expansion: c/n Density: 1-m/n2

C tR ti / (N )

c: edges pointing   outside S

CutRatio: c/n(N-n) Normalized Cut: c/(2m+c) + c/2(M-m)+c Max ODF: max frac. of edges of a node pointing outside S Average‐ODF: avg. frac. of edges of a node pointing outside Flake‐ODF: frac. of nodes with mode than ½ edges inside

3311/10/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

Page 34: Stanford University  · PDF fileSlope of the NCP corresponds to the

[WWW ‘09]

Qualitatively similar Observations: Observations: Conductance, Expansion, Norm‐cut Cut‐ratio andcut, Cut ratio and Avg‐ODF are similar

Max‐ODF prefers smaller clusterssmaller clusters

Flake‐ODF prefers larger clusters

Internal density is Internal density is bad

Cut‐ratio has high variance

34

variance

11/10/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

Page 35: Stanford University  · PDF fileSlope of the NCP corresponds to the

[WWW ‘09]

Observations:Observations: All measures are monotonicmonotonic

Modularity  prefers large prefers large clusters Ignores small Ignores small clusters

3511/10/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

Page 36: Stanford University  · PDF fileSlope of the NCP corresponds to the

[WWW ‘09]

Lower bounds on conductance can be computed fromcan be computed from: Spectral embedding (independent of balance)

SDP‐based methods (for volume‐balanced partitions)

Algorithms find clusters close toAlgorithms find clusters close to theoretical lower bounds

3611/10/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

Page 37: Stanford University  · PDF fileSlope of the NCP corresponds to the

Denser and denser Denser and denser network core

So, what’s a Small good communities

good model?

37Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu11/10/2010

Nested core‐periphery

Page 38: Stanford University  · PDF fileSlope of the NCP corresponds to the

[Internet Mathematics ‘09]

Flat Down and FlatFlat and Down

Pref. attachment Small World Geometric Pref. Attachment

None of the common models works!

38Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu11/10/2010

Page 39: Stanford University  · PDF fileSlope of the NCP corresponds to the

[Internet Mathematics ‘09]

Forest  Fire: Model ingredients:

connections spread like a fire New node joins the network Selects a seed node

• Preferential attachment: second neighbor is not uniform at random Selects a seed node

Connects to some of its neighbors  Continue recursively 

• Copying flavor: ‐ since we burn seed’s neighbors

• Hierarchical flavor:‐ burn Hierarchical flavor: burn around the seed node

• “Local” flavor:‐ burn “near” the seed node

u

the seed node

As cluster grows it blends into the core

39

blends into the coreof the network

Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu11/10/2010

Page 40: Stanford University  · PDF fileSlope of the NCP corresponds to the

[Internet Mathematics ‘09]

rewired

networknetwork

40Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu11/10/2010