higher-order organization of complex networks

38
CEPDR CEPVR IL2R OLLR RIAL RIAR RIVL RIVR RMDDR RMDL RMDR RMDVL RMFL SMDDL SMDDR SMDVR URBR Higher-order organization of complex networks 9 10 8 7 2 0 4 3 11 6 5 1 David F. Gleich Purdue University Joint work with Austin Benson and Jure Leskovec, Stanford Supported by NSF CAREER CCF-1149756, IIS-1422918 DARPA SIMPLEX PCMI2016 David Gleich · Purdue 1 Code & Data snap.stanford.edu/higher-order github.com/arbenson/higher-order-organization-julia

Upload: david-gleich

Post on 15-Apr-2017

660 views

Category:

Science


1 download

TRANSCRIPT

Page 1: Higher-order organization of complex networks

CEPDR

CEPVR

IL2R

OLLRRIALRIAR

RIVLRIVR

RMDDR

RMDLRMDR

RMDVL

RMFLSMDDL

SMDDR

SMDVR

URBR

Higher-order organization !of complex networks

910

8

72

0

4

3

11

6

5

1

David F. Gleich!Purdue University!

Joint work with "Austin Benson and Jure Leskovec, Stanford "Supported by NSF CAREER CCF-1149756, IIS-1422918 DARPA SIMPLEX

PCMI2016 David Gleich · Purdue 1

Code & Data snap.stanford.edu/higher-order" github.com/arbenson/higher-order-organization-julia

Page 2: Higher-order organization of complex networks

Network analysis has two important observations about real-world networks

Real-world networks have modular organization!

Edge-based clustering and community detection sometimes expose this structure.

Control widgets are over-expressed in complex networks. !

We can expose this motif or graphlet analysis

PCMI2016 David Gleich · Purdue 2

Milo et al., Science, 2002. Co-author network

Page 3: Higher-order organization of complex networks

Nodes and edges are not the fundamental units of these networks.

Why should we look for structure "

in terms of them?

PCMI2016 David Gleich · Purdue 3

Page 4: Higher-order organization of complex networks

Idea Find clusters

PCMI2016 David Gleich · Purdue 4

Page 5: Higher-order organization of complex networks

Idea Find clusters of motifs

PCMI2016 David Gleich · Purdue 5

Page 6: Higher-order organization of complex networks

In practice, motifs organize real-world networks !amazing well and recover aquatic layers in food webs

Micronutrient !sources!

Benthic Fishes!

Benthic Macroinvertibrates!

Pelagic fishes !And benthic Prey!

http://marinebio.org/oceans/marine-zones/

We don’t know how to find this structure based on edge partitioning.

PCMI2016 David Gleich · Purdue 6

Page 7: Higher-order organization of complex networks

Aside How did we get to this idea and looking at this problem?

•  Research is a journey.

PCMI2016 David Gleich · Purdue 7

Page 8: Higher-order organization of complex networks

We can do motif-based clustering by generalizing spectral clustering

Spectral clustering is a classic technique to partition graphs by looking at eigenvectors.

M. Fiedler, 1973, Algebraic connect-ivity of graphs

Graph Laplacian Eigenvector PCMI2016 David Gleich · Purdue 8

Page 9: Higher-order organization of complex networks

Spectral clustering works based on conductance

There are many ways to measure the quality of a set of nodes of a graph to gauge how they partition the graph.

cut(S) = 7 cut(

¯S) = 7

|S| = 15 | ¯S| = 20

vol(S) = 85 vol(

¯S) = 151

cut(S) = 7 cut(

¯S) = 7

|S| = 15 | ¯S| = 20

vol(S) = 85 vol(

¯S) = 151

cut(S) = 7/85 + 7/151 = 0.1287

cut sparsity(S) = 7/15 = 0.4667

�(S) = cond(S) = 7/85 = 0.0824

n

�(S) = cut(S)/min(vol(S), vol(

¯S))

PCMI2016 David Gleich · Purdue 9

Page 10: Higher-order organization of complex networks

Conductance sets in graphs

PCMI2016 David Gleich · Purdue 10

Conductance is one of the most important quality scores [Schaeffer07]

used in Markov chain theory, bioinformatics, vision, etc. PCMI Nelson showed how use you can this to get heavy-hitters in turnstile algs! The conductance of a set of vertices is the ratio of edges leaving to total edges: Equivalently, it’s the probability that a random edge leaves the set. Small conductance ó Good set

�(S) =

cut(S)

min

�vol(S), vol(

¯S)

�(edges leaving the set)

(total edges in the set)

cut(S) = 7

vol(S) = 33

vol(

¯S) = 11

�(S) = 7/11

Page 11: Higher-order organization of complex networks

Spectral clustering has theoretical guarantees

Cheeger Inequality

Finding the best conductance set is NP-hard. L • Cheeger realized the eigenvalues of the

Laplacian provided a bound in manifolds • Alon and Milman independently realized

the same thing for a graph!

J. Cheeger, 1970, A lower bound on the smallest eigenvalue of the Laplacian

N. Alon, V. Milman 1985. λ1 isoperi-metric inequalities for graphs and superconcentrators

Laplacian �2⇤/2 �2 2�⇤

0 = �1 �2 ... �n 2Eigenvalues of the Laplacian

�⇤ = set of smallest conductance

PCMI2016 David Gleich · Purdue 11

Page 12: Higher-order organization of complex networks

The sweep cut algorithm realizes the guarantee

We can find a set S that achieves the Cheeger bound. 1.  Compute the eigenvector

associated with λ2. 2.  Sort the vertices by their values

in the eigenvector: σ1, σ2, … σn 3.  Let Sk = {σ1, …, σk} and

compute the conductance of each Sk: φk = φ(Sk)

4.  Pick the minimum φm of φk .

M. Mihail, 1989 Conductance and convergence of Markov chains

F. C. Graham, 1992, Spectral Graph Theory.

�m 4p

�⇤PCMI2016 David Gleich · Purdue 12

Page 13: Higher-order organization of complex networks

The sweep cut visualized

0 20 400

0.2

0.4

0.6

0.8

1

Si

φ i

�(S) =

cut(S)

min

�vol(S), vol(

¯S)

PCMI2016 David Gleich · Purdue 13

Page 14: Higher-order organization of complex networks

Demo…

PCMI2016 David Gleich · Purdue 14

Page 15: Higher-order organization of complex networks

That’s spectral clustering 40+ years of ideas and successful applications •  Fast algorithms that avoid eigenvectors "

(Graculus from Dhillon et al. 2007) •  Local algorithms for seeded detection"

(Spielman & Teng 2004; Andersen, Chung, Lang 2006)"PCMI: Kimon gave a talk about this yesterday!

•  Overlapping algorithms •  Embeddings •  And more!

PCMI2016 David Gleich · Purdue 15

Page 16: Higher-order organization of complex networks

But current problems are much more rich than when spectral was designed

Spectral clustering is theoretically justified for undirected, simple graphs" Many datasets are directed, weighted, signed, colored, layered,

R. Milo, 2002, Science

X

Y

X causes Y to be expressed Z represses Y

X

Z

Y

+

– PCMI2016 David Gleich · Purdue 16

Page 17: Higher-order organization of complex networks

Our contributions 1.  A generalized conductance metric for motifs 2.  A new spectral clustering algorithm to minimize the generalized

conductance. 3.  AND an associated Cheeger inequality. 4.  Aquatic layers in food webs 5.  Control structures in neural networks 6.  Hub structure in transportation networks 7.  Anomaly detection in Twitter

Benson, Gleich, Leskovec, Science 2016.

PCMI2016 David Gleich · Purdue 17

Page 18: Higher-order organization of complex networks

Motif-based conductance generalizes !edge-based conductance Need notions of cut and volume!

�(S) =

#(edges cut)

min(vol(S), vol(

¯S))

Edges cut! Triangles cut!S S

SS̄ S̄

vol(S) = #(edge end points in S) volM (S) = #(triangle

end points in S)

�M (S) =

#(triangles cut)

min(volM (S), volM (

¯S))

PCMI2016 David Gleich · Purdue 18

Page 19: Higher-order organization of complex networks

An example of motif-conductance

910

6

58

17

2

0

4

3

11

910

8

72

0

4

3

11

6

5

1

S

Motif

�M (S) =

motifs cut

motif volume

=

1

10

PCMI2016 David Gleich · Purdue 19

Page 20: Higher-order organization of complex networks

Going from motifs back to a matrix for spectral clustering

910

6

58

17

2

0

4

3

11

910

6

58

17

2

0

4

3

11

11

1

1 1

1

1

1 1

1

1

1 1

1

1

1

2

3

AW (M)

ij = counts co-occurrences of motif pattern between i , j

W (M)

PCMI2016 David Gleich · Purdue 20

Page 21: Higher-order organization of complex networks

Going from motifs back to a matrix for spectral clustering

910

6

58

17

2

0

4

3

11

11

1

1 1

1

1

1 1

1

1

1 1

1

1

1

2

3

W (M)

ij = counts co-occurrences of motif pattern between i , j

W (M)

KEY INSIGHT!Spectral clustering on W(M) yields results on the new motif notion of conductance

�M (S) =

motifs cut

motif volume

=

1

10

PCMI2016 David Gleich · Purdue 21

Page 22: Higher-order organization of complex networks

A motif-based clustering algorithm 1.  Form weighted graph W(M) 2.  Compute the Fiedler vector associated with λ2 of the

motif-normalized Laplacian

3.  Run a (motif-cond) sweep cut on f!

910

6

58

17

2

0

4

3

11

11

1

1 1

11

1 1

1

1

1 1

1

1

1

2

3

W (M)

D = diag(W (M)e)

L(M) = D�1/2(D � W (M))D�1/2

L(M)z = �2z

f(M) = D�1/2z

PCMI2016 David Gleich · Purdue 22

Page 23: Higher-order organization of complex networks

The sweep cut results

2 4 6 8 100

0.2

0.4

0.6

0.8

11

2

0

4

3

1

2

0

4

3

910

6

Best higher-order cluster

2nd best higher-order cluster

910

6

58

17

2

0

4

3

11

11

1

1 1

1

1

1 1

1

1

1 1

1

1

1

2

3

(Order from the Fiedler vector)

PCMI2016 David Gleich · Purdue 23

Page 24: Higher-order organization of complex networks

The motif-based Cheeger inequality

THEOREM!If the motif has three nodes, then the sweep procedure on the weighted graph finds a set S of nodes for which THEOREM For more than 4 nodes, we "use a slightly altered conductance.

�M (S) 4q

�⇤M

cut

M

(S, G) =

X

{i ,j ,k}2M(G)

Indicator[x

i

, x

j

, x

k

not the same]

= quadratic in x

M(G) = {instances of M in G}Key Proof Step!

PCMI2016 David Gleich · Purdue 24

Page 25: Higher-order organization of complex networks

Awesome advantages We inherit 40+ years of research! •  Fast algorithms "

(ARPACK, etc.)! •  Local methods! •  Overlapping!

•  Easy to implement "(20 lines of Matlab/Julia)

•  Scalable (1.4B edges graphs "are not a prob.)

PCMI2016 David Gleich · Purdue 25

12/13/2015 motif_example

file:///Users/arbenson/Desktop/html/motif_example.html 1/2

function [S, conductances] = MotifClusterM36(A) B = spones(A & A'); % bidirectional links U = A - B; % unidirectional links W = (B * U') .* U' + (U * B) .* U + (U' * U) .* B; % Motif M_3^6 D = diag(sum(W)); Ln = speye(size(W, 1)) - sqrt(D)^(-1) * W * sqrt(D)^(-1); [Z, ~] = eigs(Ln, 2, 'sm'); [~, order] = sort(sqrt(D)^(-1) * Z(:, 2)); conductances = zeros(n, 1); x = zeros(n, 1); for i = 1:n x(order(i)) = 1; xn = ~x + 0; conductances(i) = x' * (D - W) * x / min(x' * D * x, xn' * D * xn); end [~, split] = min(conductances); S = order(1:split);

Error using motif_example (line 2) Not enough input arguments.

Published with MATLAB® R2015a

Page 26: Higher-order organization of complex networks

Case studies

An intro note! 1.  Aquatic layers in food webs."

Signed patterns in regulatory networks 2.  Control structures in neural networks 3.  Hub structure in transportation networks. 4.  Scaling and large data

PCMI2016 David Gleich · Purdue 26

Page 27: Higher-order organization of complex networks

NOTE !The partition depends on the motif

1011

9

83

1

5

4

12

7

6

2

1011

9

83

1

5

4

12

7

6

2

PCMI2016 David Gleich · Purdue 27

Page 28: Higher-order organization of complex networks

Case study 1!Motifs partition the food webs Food webs model energy exchange in species of an ecosystem i -> j means i’s energy goes to j "(or j eats i) Via Cheeger, motif conductance is better than edge conductance.

PCMI2016 David Gleich · Purdue 28

Page 29: Higher-order organization of complex networks

Demo

PCMI2016 David Gleich · Purdue 29

Page 30: Higher-order organization of complex networks

Case study 1!Motifs partition the food webs

Micronutrient !sources!

Benthic Fishes!

Benthic Macroinvertebrates!

Pelagic fishes !and benthic prey!

Motif M6 reveals aquatic layers.

A B

C

Figure 1: Higher-order network structures and the higher-order network clusteringframework. A: Higher-order structures are captured by network motifs. For example, all13 connected three-node directed motifs are shown here. B: Clustering of a network based onmotif M7. For a given motif M , our framework aims to find a set of nodes S that minimizesmotif conductance, �M(S), which we define as the ratio of the number of motifs cut (filledtriangles cut) to the minimum number of nodes in instances of the motif in either S or S̄ (11).In this case, there is one motif cut. C: The higher-order network clustering framework. Given agraph and a motif of interest (in this case, M7), the framework forms a motif adjacency matrix(WM ) by counting the number of times two nodes co-occur in an instance of the motif. Aneigenvector of a Laplacian transformation of the motif adjacency matrix is then computed. Theordering � of the nodes provided by the components of the eigenvector (13) produces nested setsSr = {�1, . . . , �r} of increasing size r. We prove that the set Sr with the smallest motif-basedconductance, �M(Sr), is a near-optimal higher-order cluster (11).

7

84% accuracy vs. 69% for other methods

PCMI2016 David Gleich · Purdue 30

Page 31: Higher-order organization of complex networks

Case study 2!Nictation control in neural network

(d) From Nictation, a dispersal behavior of the nematode Caenorhabditis elegans, is regulated by IL2 neurons, Lee et al. Nature Neuroscience.

"We find the control mechanism that explains this based on the bi-fan motif (Milo et al. found it over-expressed)

A B

C

Nicatation – standing on a tail and waving

A B

C

PCMI2016 David Gleich · Purdue 31

Page 32: Higher-order organization of complex networks

Case study 3 !Rich structure beyond clusters

North American air "transport network Nodes are airports Edges reflect "reachability, and "are unweighted. (Based on Frey"et al.’s 2007)

PCMI2016 David Gleich · Purdue 32

Page 33: Higher-order organization of complex networks

We can use complex motifs with non-anchored nodes

Accepted pending minor revisions

Do not distribute.

D

C

B

A

Figure 4: Higher-order spectral analysis of a network of airports in Canada and the UnitedStates (22). A: The three higher-order structures used in our analysis. Each motif is “an-chored” by the blue nodes i and j, which means our framework only seeks to cluster togetherthe blue nodes. Specifically, the motif adjacency matrix adds weight to the (i, j) edge basedon the number of third intermediary nodes (green squares). The first two motifs correspondto highly-connected cities and the motif on the right connects non-hubs to non-hubs. B: Thetop 50 most populous cities in the United States which correspond to nodes in the network.The edge thickness is proportional to the weight in the motif adjacency matrix WM . The thick,dark lines indicate that large weights correspond to popular mainline routes. C: Embedding ofnodes provided by their corresponding components of the first two non-trivial eigenvectors ofthe normalized Laplacian for WM . The marked cities are eight large U.S. hubs (green), threeWest coast non-hubs (red), and three East coast non-hubs (purple). The primary spectral coor-dinate (left to right) reveals how much of a hub the city is, and the second spectral coordinate(top to bottom) captures West-East geography (11). D: Embedding of nodes provided by theircorresponding components in the first two non-trivial eigenvectors of the standard, edge-based(non-higher-order) normalized Laplacian. This method does not capture the hub and geographyfound by the higher-order method. For example, Atlanta, the largest hub, is in the center of theembedding, next to Salina, a non-hub.

10

Counts length-two walks

PCMI2016 David Gleich · Purdue 33

Page 34: Higher-order organization of complex networks

The weighting alone reveals hub-like structure

PCMI2016 David Gleich · Purdue 34

Page 35: Higher-order organization of complex networks

The motif embedding shows this structure and splits into east-west

Top 10 U.S. hubs

East coast non-hubs!

West coast non-hubs!

Primary spectral coordinate

Atlanta, the top hub, is next to Salina, a non-hub.

MOTIF SPECTRAL EMBEDDING

EDGE SPECTRAL EMBEDDING

PCMI2016 David Gleich · Purdue 35

Page 36: Higher-order organization of complex networks

Case study 4!Large scale stuff

The up-linked triangle finds an anomalous cluster in Twitter.

Anomalous cluster in the 1.4B edge Twitter graph. All nodes are holding accounts for a company, and the orange nodes have incomplete profiles.

PCMI2016 David Gleich · Purdue 36

Page 37: Higher-order organization of complex networks

Related work.

§  Laplacian we propose was originally proposed by Rodríguez [2004] and again by Zhou et al. [2006]"Our new theory (motif Cheeger inequality) explains why these were good ideas.

§  Falls under general strategy of encoding hypergraph partitioning problem as graph clustering problem [Agarwal+ 06]

§  Serrour, Arenas, and Gómez, Detecting communities of triangles in complex networks using spectral optimization, 2011.

§  Arenas et al., Motif-based communities in complex networks, 2008.

PCMI2016 David Gleich · Purdue 37

Page 38: Higher-order organization of complex networks

Paper!Benson, Gleich, Leskovec!Science, 2016 1.  A generalized conductance metric for motifs 2.  A new spectral clustering algorithm to

minimize the generalized conductance. 3.  AND an associated Cheeger inequality. 4.  Aquatic layers in food webs 5.  Control structures in neural networks 6.  Hub structure in transportation networks 7.  Anomaly detection in Twitter 8.  Lots of cool stuff on signed networks.

Thank you!

Joint work with "Austin Benson and Jure Leskovec, Stanford Supported by NSF CAREER CCF-1149756, IIS-1422918 IIS- DARPA SIMPLEX

9 10

8

7

2

04

3

11

6

5

1

PCMI2016 David Gleich · Purdue 38