higher-order organization of complex networks

CEPDR

CEPVR

IL2R

OLLRRIALRIAR

RIVLRIVR

RMDDR

RMDLRMDR

RMDVL

RMFLSMDDL

SMDDR

SMDVR

URBR

Higher-order organization !of complex networks

910

8

72

0

4

3

11

6

5

1

David F. Gleich!Purdue University!

Joint work with "Austin Benson and Jure Leskovec, Stanford "Supported by NSF CAREER CCF-1149756, IIS-1422918 DARPA SIMPLEX

PCMI2016 David Gleich · Purdue 1

Code & Data snap.stanford.edu/higher-order" github.com/arbenson/higher-order-organization-julia

Network analysis has two important observations about real-world networks

Real-world networks have modular organization!

Edge-based clustering and community detection sometimes expose this structure.

Control widgets are over-expressed in complex networks. !

We can expose this motif or graphlet analysis


Milo et al., Science, 2002. Co-author network

Nodes and edges are not the fundamental units of these networks.

Why should we look for structure "

in terms of them?


Idea Find clusters


Idea Find clusters of motifs


In practice, motifs organize real-world networks !amazing well and recover aquatic layers in food webs

Micronutrient !sources!

Benthic Fishes!

Benthic Macroinvertibrates!

Pelagic fishes !And benthic Prey!

http://marinebio.org/oceans/marine-zones/

We don’t know how to find this structure based on edge partitioning.


Aside How did we get to this idea and looking at this problem?

•  Research is a journey.


We can do motif-based clustering by generalizing spectral clustering

Spectral clustering is a classic technique to partition graphs by looking at eigenvectors.

M. Fiedler, 1973, Algebraic connect-ivity of graphs

Graph Laplacian Eigenvector PCMI2016 David Gleich · Purdue 8

Spectral clustering works based on conductance

There are many ways to measure the quality of a set of nodes of a graph to gauge how they partition the graph.

cut(S) = 7 cut(

¯S) = 7

|S| = 15 | ¯S| = 20

vol(S) = 85 vol(

¯S) = 151

cut(S) = 7 cut(

¯S) = 7

|S| = 15 | ¯S| = 20

vol(S) = 85 vol(

¯S) = 151

cut(S) = 7/85 + 7/151 = 0.1287

cut sparsity(S) = 7/15 = 0.4667

�(S) = cond(S) = 7/85 = 0.0824

n

�(S) = cut(S)/min(vol(S), vol(

¯S))


Conductance sets in graphs


Conductance is one of the most important quality scores [Schaeffer07]

used in Markov chain theory, bioinformatics, vision, etc. PCMI Nelson showed how use you can this to get heavy-hitters in turnstile algs! The conductance of a set of vertices is the ratio of edges leaving to total edges: Equivalently, it’s the probability that a random edge leaves the set. Small conductance ó Good set

�(S) =

cut(S)

min

�vol(S), vol(

¯S)

�(edges leaving the set)

(total edges in the set)

cut(S) = 7

vol(S) = 33

vol(

¯S) = 11

�(S) = 7/11

Spectral clustering has theoretical guarantees

Cheeger Inequality

Finding the best conductance set is NP-hard. L • Cheeger realized the eigenvalues of the

Laplacian provided a bound in manifolds • Alon and Milman independently realized

the same thing for a graph!

J. Cheeger, 1970, A lower bound on the smallest eigenvalue of the Laplacian

N. Alon, V. Milman 1985. λ1 isoperi-metric inequalities for graphs and superconcentrators

Laplacian �2⇤/2 �2 2�⇤

0 = �1 �2 ... �n 2Eigenvalues of the Laplacian

�⇤ = set of smallest conductance


The sweep cut algorithm realizes the guarantee

We can find a set S that achieves the Cheeger bound. 1.  Compute the eigenvector

associated with λ2. 2.  Sort the vertices by their values

in the eigenvector: σ1, σ2, … σn 3.  Let Sk = {σ1, …, σk} and

compute the conductance of each Sk: φk = φ(Sk)

4.  Pick the minimum φm of φk .

M. Mihail, 1989 Conductance and convergence of Markov chains

F. C. Graham, 1992, Spectral Graph Theory.

�m 4p

�⇤PCMI2016 David Gleich · Purdue 12

The sweep cut visualized

0 20 400

0.2

0.4

0.6

0.8

1

Si

φ i

�(S) =

cut(S)

min

�vol(S), vol(

¯S)

�


Demo…


That’s spectral clustering 40+ years of ideas and successful applications •  Fast algorithms that avoid eigenvectors "

(Graculus from Dhillon et al. 2007) •  Local algorithms for seeded detection"

(Spielman & Teng 2004; Andersen, Chung, Lang 2006)"PCMI: Kimon gave a talk about this yesterday!

•  Overlapping algorithms •  Embeddings •  And more!


But current problems are much more rich than when spectral was designed

Spectral clustering is theoretically justified for undirected, simple graphs" Many datasets are directed, weighted, signed, colored, layered,

R. Milo, 2002, Science

X

Y

X causes Y to be expressed Z represses Y

X

Z

Y

+

– PCMI2016 David Gleich · Purdue 16

Our contributions 1.  A generalized conductance metric for motifs 2.  A new spectral clustering algorithm to minimize the generalized

conductance. 3.  AND an associated Cheeger inequality. 4.  Aquatic layers in food webs 5.  Control structures in neural networks 6.  Hub structure in transportation networks 7.  Anomaly detection in Twitter

Benson, Gleich, Leskovec, Science 2016.


Motif-based conductance generalizes !edge-based conductance Need notions of cut and volume!

�(S) =

#(edges cut)

min(vol(S), vol(

¯S))

Edges cut! Triangles cut!S S

SS̄ S̄

vol(S) = #(edge end points in S) volM (S) = #(triangle

end points in S)

�M (S) =

#(triangles cut)

min(volM (S), volM (

¯S))


An example of motif-conductance

910

6

58

17

2

0

4

3

11

910

8

72

0

4

3

11

6

5

1

S̄

S

Motif

�M (S) =

motifs cut

motif volume

=

1

10


Going from motifs back to a matrix for spectral clustering

910

6

58

17

2

0

4

3

11

910

6

58

17

2

0

4

3

11

11

1

1 1

1

1

1 1

1

1

1 1

1

1

1

2

3

AW (M)

ij = counts co-occurrences of motif pattern between i , j

W (M)


Going from motifs back to a matrix for spectral clustering

910

6

58

17

2

0

4

3

11

11

1

1 1

1

1

1 1

1

1

1 1

1

1

1

2

3

W (M)

ij = counts co-occurrences of motif pattern between i , j

W (M)

KEY INSIGHT!Spectral clustering on W(M) yields results on the new motif notion of conductance

�M (S) =

motifs cut

motif volume

=

1

10


A motif-based clustering algorithm 1.  Form weighted graph W(M) 2.  Compute the Fiedler vector associated with λ2 of the

motif-normalized Laplacian

3.  Run a (motif-cond) sweep cut on f!

910

6

58

17

2

0

4

3

11

11

1

1 1

11

1 1

1

1

1 1

1

1

1

2

3

W (M)

D = diag(W (M)e)

L(M) = D�1/2(D � W (M))D�1/2

L(M)z = �2z

f(M) = D�1/2z


The sweep cut results

2 4 6 8 100

0.2

0.4

0.6

0.8

11

2

0

4

3

1

2

0

4

3

910

6

Best higher-order cluster

2nd best higher-order cluster

910

6

58

17

2

0

4

3

11

11

1

1 1

1

1

1 1

1

1

1 1

1

1

1

2

3

(Order from the Fiedler vector)


The motif-based Cheeger inequality

THEOREM!If the motif has three nodes, then the sweep procedure on the weighted graph finds a set S of nodes for which THEOREM For more than 4 nodes, we "use a slightly altered conductance.

�M (S) 4q

�⇤M

cut

M

(S, G) =

X

{i ,j ,k}2M(G)

Indicator[x

i

, x

j

, x

k

not the same]

= quadratic in x

M(G) = {instances of M in G}Key Proof Step!


Awesome advantages We inherit 40+ years of research! •  Fast algorithms "

(ARPACK, etc.)! •  Local methods! •  Overlapping!

•  Easy to implement "(20 lines of Matlab/Julia)

•  Scalable (1.4B edges graphs "are not a prob.)


12/13/2015 motif_example

file:///Users/arbenson/Desktop/html/motif_example.html 1/2

function [S, conductances] = MotifClusterM36(A) B = spones(A & A'); % bidirectional links U = A - B; % unidirectional links W = (B * U') .* U' + (U * B) .* U + (U' * U) .* B; % Motif M_3^6 D = diag(sum(W)); Ln = speye(size(W, 1)) - sqrt(D)^(-1) * W * sqrt(D)^(-1); [Z, ~] = eigs(Ln, 2, 'sm'); [~, order] = sort(sqrt(D)^(-1) * Z(:, 2)); conductances = zeros(n, 1); x = zeros(n, 1); for i = 1:n x(order(i)) = 1; xn = ~x + 0; conductances(i) = x' * (D - W) * x / min(x' * D * x, xn' * D * xn); end [~, split] = min(conductances); S = order(1:split);

Error using motif_example (line 2) Not enough input arguments.

Published with MATLAB® R2015a

Case studies

An intro note! 1.  Aquatic layers in food webs."

Signed patterns in regulatory networks 2.  Control structures in neural networks 3.  Hub structure in transportation networks. 4.  Scaling and large data


NOTE !The partition depends on the motif

1011

9

83

1

5

4

12

7

6

2

1011

9

83

1

5

4

12

7

6

2


Case study 1!Motifs partition the food webs Food webs model energy exchange in species of an ecosystem i -> j means i’s energy goes to j "(or j eats i) Via Cheeger, motif conductance is better than edge conductance.


Demo


Case study 1!Motifs partition the food webs

Micronutrient !sources!

Benthic Fishes!

Benthic Macroinvertebrates!

Pelagic fishes !and benthic prey!

Motif M6 reveals aquatic layers.

A B

C

Figure 1: Higher-order network structures and the higher-order network clusteringframework. A: Higher-order structures are captured by network motifs. For example, all13 connected three-node directed motifs are shown here. B: Clustering of a network based onmotif M7. For a given motif M , our framework aims to find a set of nodes S that minimizesmotif conductance, �M(S), which we define as the ratio of the number of motifs cut (filledtriangles cut) to the minimum number of nodes in instances of the motif in either S or S̄ (11).In this case, there is one motif cut. C: The higher-order network clustering framework. Given agraph and a motif of interest (in this case, M7), the framework forms a motif adjacency matrix(WM ) by counting the number of times two nodes co-occur in an instance of the motif. Aneigenvector of a Laplacian transformation of the motif adjacency matrix is then computed. Theordering � of the nodes provided by the components of the eigenvector (13) produces nested setsSr = {�1, . . . , �r} of increasing size r. We prove that the set Sr with the smallest motif-basedconductance, �M(Sr), is a near-optimal higher-order cluster (11).

7

84% accuracy vs. 69% for other methods


Case study 2!Nictation control in neural network

(d) From Nictation, a dispersal behavior of the nematode Caenorhabditis elegans, is regulated by IL2 neurons, Lee et al. Nature Neuroscience.

"We find the control mechanism that explains this based on the bi-fan motif (Milo et al. found it over-expressed)

A B

C

Nicatation – standing on a tail and waving

A B

C


Case study 3 !Rich structure beyond clusters

North American air "transport network Nodes are airports Edges reflect "reachability, and "are unweighted. (Based on Frey"et al.’s 2007)


We can use complex motifs with non-anchored nodes

Accepted pending minor revisions

Do not distribute.

D

C

B

A

Figure 4: Higher-order spectral analysis of a network of airports in Canada and the UnitedStates (22). A: The three higher-order structures used in our analysis. Each motif is “an-chored” by the blue nodes i and j, which means our framework only seeks to cluster togetherthe blue nodes. Specifically, the motif adjacency matrix adds weight to the (i, j) edge basedon the number of third intermediary nodes (green squares). The first two motifs correspondto highly-connected cities and the motif on the right connects non-hubs to non-hubs. B: Thetop 50 most populous cities in the United States which correspond to nodes in the network.The edge thickness is proportional to the weight in the motif adjacency matrix WM . The thick,dark lines indicate that large weights correspond to popular mainline routes. C: Embedding ofnodes provided by their corresponding components of the first two non-trivial eigenvectors ofthe normalized Laplacian for WM . The marked cities are eight large U.S. hubs (green), threeWest coast non-hubs (red), and three East coast non-hubs (purple). The primary spectral coor-dinate (left to right) reveals how much of a hub the city is, and the second spectral coordinate(top to bottom) captures West-East geography (11). D: Embedding of nodes provided by theircorresponding components in the first two non-trivial eigenvectors of the standard, edge-based(non-higher-order) normalized Laplacian. This method does not capture the hub and geographyfound by the higher-order method. For example, Atlanta, the largest hub, is in the center of theembedding, next to Salina, a non-hub.

10

Counts length-two walks


The weighting alone reveals hub-like structure


The motif embedding shows this structure and splits into east-west

Top 10 U.S. hubs

East coast non-hubs!

West coast non-hubs!

Primary spectral coordinate

Atlanta, the top hub, is next to Salina, a non-hub.

MOTIF SPECTRAL EMBEDDING

EDGE SPECTRAL EMBEDDING


Case study 4!Large scale stuff

The up-linked triangle finds an anomalous cluster in Twitter.

Anomalous cluster in the 1.4B edge Twitter graph. All nodes are holding accounts for a company, and the orange nodes have incomplete profiles.


Related work.

§  Laplacian we propose was originally proposed by Rodríguez [2004] and again by Zhou et al. [2006]"Our new theory (motif Cheeger inequality) explains why these were good ideas.

§  Falls under general strategy of encoding hypergraph partitioning problem as graph clustering problem [Agarwal+ 06]

§  Serrour, Arenas, and Gómez, Detecting communities of triangles in complex networks using spectral optimization, 2011.

§  Arenas et al., Motif-based communities in complex networks, 2008.


Paper!Benson, Gleich, Leskovec!Science, 2016 1.  A generalized conductance metric for motifs 2.  A new spectral clustering algorithm to

minimize the generalized conductance. 3.  AND an associated Cheeger inequality. 4.  Aquatic layers in food webs 5.  Control structures in neural networks 6.  Hub structure in transportation networks 7.  Anomaly detection in Twitter 8.  Lots of cool stuff on signed networks.

Thank you!

Joint work with "Austin Benson and Jure Leskovec, Stanford Supported by NSF CAREER CCF-1149756, IIS-1422918 IIS- DARPA SIMPLEX

9 10

8

7

2

04

3

11

6

5

1