modularity and community structure in networks*

Modularity and Community Structure in Networks

Modularity and Community Structure in Networks*M.E.J Newman in PNAS 20061

1NetworksA network: presented by a graph G(V,E):V = nodes, E = edges (link node pairs)Examples of real-life networks: social networks (V = people) World Wide Web (V= webpages) protein-protein interaction networks (V = proteins)2

2Protein-protein Interaction Networks3 Nodes proteins (6K), edges interactions (15K). Reflect the cells machinery and signaling pathways.

3Communities (clusters) in a networkA community (cluster) is a densely connected group of vertices, with only sparser connections to other groups.4

4Searching for communities in a networkThere are numerous algorithms with different "target-functions":"Homogenity" - dense connectivity clusters"Separation"- graph partitioning, min-cut approachClustering is important for Understanding the structure of the networkProvides an overview of the network

55Distilling Modules from Networks

6

Motivation: identifying protein complexes responsible for certain functions in the cell6Modularity (Newman)77Modularity of a division (Q)8

Q = #(edges within groups) - E(#(edges within groups in a RANDOM graph with same node degrees))Trivial division: all vertices in one group==> Q(trivial division) = 0Edges within groupski = degree of node iM = ki = 2|E|Aij = 1 if (i,j)E, 0 otherwiseEij = expected number of edges between i and j in a random graph with same node degrees.Lemma: Eij ki*kj / MQ = (Aij - ki*kj/M | i,j in the same group)8Modularity9

Are two definitions of modularity equivalent ?Methods to Optimize Q10Fast modularityGreedily iterative agglomeration of small communitiesChoosing at each step the join that results in the greatest increase (or smallest decrease) in QCan be generalized to weighted networksExtreme methods: Simulated Annealing, GAHeuristic algorithmSpectral PartitioningImportant features of Newman's clustering algorithmThe number and size of the clusters are determined by the algorithmAttempts to find a division that maximizes a modularity score Q heuristic algorithmNotifies when the network is non-modular1111Algorithm 1: Division into two groups(1)Suppose we have n vertices {1,...,n}s - {1} vector of size n. Represent a 2-division:si == sj iff i and j are in the same group (si*sj+1) = 1 if si==sj, 0 otherwise

==>12Q = (Aij - ki*kj/M | i,j in the same group)

12Algorithm 1: Division into two groups (2)13

Since

whereB = the modularity matrix - symmetric - row sum = 00 is an eigvenvalue of B13Modularity matrix: example14

14Algorithm 1: Division into two groups (3)Which vector s maximizes Q? clearly s ~ u1 maximizes Q, but u1 may not be {1} vector Greedy heuristic: choose s ~ u1: si= +1 if ui>0, si=-1 otherwise15

B's eigen values

B's corresponding eigen vectors

B is symmetric B is diagonalizable (real eigenvalues)n=||s||2 =ai2Bui = iui1516

16Example: a 2-division of a social network17

A network showing relationships between people in a karate club which eventually split into 2. The division algorithm predicts exactly the two groups after the splitknown group leaderknown group leadersColor matches the entries of the eigen vector u1: light = positive entry (si=1)dark: negative (si=-1)17Dividing into more than 2(1)How to compute into more than 2?Idea: apply the algorithm recursively on every group.18Splitting a group==>update Q{i,j} pairs that needs to be updated in Q

Bij0|1=1 iff i and j are in the same group, 0 otherwise18Dividing into more than 2(2)g - a group of ng verticess - a {1} vector of size ngCompute Q for a 2-division of g

19

New: elements of g are split into two subgroups (corresponding to s)Old: all the elements of g are within one group (g)

Bij0|119Dividing into more than 2(3)20

where

B[g] = the submatrix of B defined by gfi(g) = sum of ith row B[g]fi({1,...,n}) = 0generalized modularity matrix20Generalized modularity matrix: example21g = {1, 4, 5} (1 is the minimal index)

What is [{1...5}]?

2122

A "generalized" 2-division algorithm (divides a group in a network)2223

23Further techniques for modularity maximization(Combined with Neman's "generalized' 2-division algorithm)2424A heuristic for 2-division{g1, g2} - an initial 2-division of gWhile there is an unmoved node:Let v be an unmoved node, whose moving between g1 and g2 maximizes QMove v between g1 and g2From the ng 2-divisions generated in the previous step - let {g1, g2} be the one with maximum QIf Q>0 ==> go to 125The last iteration produces a 2-division which equals the initial 2-division2526

Choosing j' with maximum Q

2.While there is an unmoved node: 1. Let v be an unmoved node, whose moving between g1 and g2 maximizes Q 2. Move v between g1 and g2Computing Q for each nodemoving j' and storing its Q 26Algorithm 4 -cont.27

3. From the ng 2-divisions generated in the previous step - let {g1, g2} be the one with maximum Q4. If Q>0 ==> go to 127Finding the leading eigen-pairThe power method2828The Power Method (1)A - a diagonalizable matrixLet (1,V1),..., (n,Vn) be n eigenpairs of A where |1| > |2| |3|... |n|The power method finds the dominant eigenpair of A, i.e. (V1, 1) (Note that 1 is not necessarily the leading eigenvalue)X0 = any vector. X0 = c1V1+... +cnVn , where ci = X0Vi2929The Power Method (2)X1=AX0 = A (c1V1+... +cnVn) = c1AV1+... +cnAVn = c11V1+....+ cnnVnX2=A2X0 = AX1= A (c11V1+....+ cnnVn) = c112V1+....+ cnn2Vn...Xm=AmX0 = AXm-1= A (c11m-1V1+....+ cnnm-1Vn) = c11mV1+....+ cnnmVn ~ c1 1mV1If m is large enough

30

30Power Method (3)Suppose V1Y0. For m large enough:31

Xm = AXm-1 = AmX0

For simplicity, Y=Xm31Power method - Example32

Example:

We perform only matrix-vector multiplications!

Convergence usually occurs within O(n) iterations32Power method convergence condition33

To avoid numerical problems due to large numbers normalize Xi before computing Xi+1 = A XiX0 = X / ||X||X1 = AX0 / ||AX0||X2 = AX1 / || AX1||....The desired precision33Finding the leading eigenpairusing matrix shiftingLet be the eigenvalues of A, and U1,...,Un their corresponding eigenvectorsLet ||A||1 = max |i| (exercise)Q: What is the dominant eigenpair of A+||A||1I?A: (1+ ||A||1, U1)34

34ImplementationRobustness and Efficiency 3535Checking "positiveness" #define IS_POSITIVE(X) ((X) > 0.00001)Instead "x>0" ==> use IS_POSITIVE(X)3636Efficient multiplications in the (extended) modularity matrix: O(n) instead O(n2)37

multiplication in a sparse matrixinner productf(g)ixi("matrix shifting")"matrix shifting"37sparse_matrix_arrtypedef struct{ int n; /* matrix size */elem* values; /* the non zero elements ordered by rows*/int* colind; /* column indices */int* rowptr; /* pointers to where rows begin in the values array. */} sparse_matrix_arr;

38

38Fast score computations39

Computing Q for each node ==>O(n2)

Computing Q for each node in O(n)before moving 1st node

Updating the score AFTER a move of a node k (s is already updated)Algorithm 439Project specifications4040programssparse_mlpl < matrix_vec.inmodularity_mat spectral_div improve_div < adj_matrix> cluster

41for the power methodfor the power methodcomputing a 2-divisionThe complete clustering algorithm (including the improvement)41Implementation processRead and understand the documentDesign ALL programs: Data structuresFunctions used by more than one programCheck your code"Toy" examples on website - easy to debugYour own created LARGE examplesRun your code on yeast/fly networks4242Analyzing clusters in yeast and fly protein-protein interaction networksInput: true PPI network + 2 random networksTask 1: infer the true networkSolution: the true network is more modularTask 2: compute associated functions (using cytoscape + BiNGO)43

Saccharomyces cerevisiaedrosophila melanogaster43Cytoscape, BiNGOwww.cytoscape.com (version 2.5.1)A framework for analyzing networksProvides visualization of networks and clustershttp://www.psb.ugent.be/cbd/papers/BiNGO/Finding functions associated with gene clusterRuns from cytoscapeVersion 2.3 is not suitable for our project!!! (due to a bug) ==> use version 2.4 (when available) or version 2.0 (available under ~ozery/public/cytoscape-v2.5.1/plugins/BiNGO.jar).

4444BiNGO output (GO = Gene Ontology)45

45Visualization with cytoscape46

46How is the project checked?Most checks (points): "BLACK BOX"The common checks in "real world"Running with fixed input files, comparing to fixed output filesScore = #(successful checks) / #(total checks)"WHITE BOX" checks: code review (10 points maximum)code simplicity / efficiency 4747A simple data structure for maintaining a divisionComplexity:Finding all the elements of a group: O(n)Splitting a group into 2: O(n)48typedef struct Division_{int n;int* group-ids;int numGroups;double Q;} Division;#nodes in the networkfor each node - its group id (initially 0 - all nodes within on group)48Maintaining the generalized modularity matrixShould we maintain the modularity matrix?No: 1) we do not use it explicitly 2) it is a dense matrix - consumes a large memory space

Yes: 1) Despite its large size - can be kept in memory 2) Can simplify code (e.g. deriving B[g] from B, computing the L1-norm) 3) Can be used in validating the correctness of optimized multiplications (debug mode only!)

49

49Suggestion for modules50Sparse matrices: Data structure: sparse_matrix_lstReading a sparse matrix ( file / stdin)Multiplication in a vectorComputing A[g]Methods hiding the inner structure (allows a simple replacement of sparse_matrix_lst with another data structure for holding sparse matrices)DivisionGroupThe spectral algorithm:2-divisionfull-divisionThe improvement algorithmThe generalized modularity matrix: Data structure: A[g], k[g], M, f[g], L1-normMultiplication in a vectorComputing Qprinting the modularity matrix50Good luck!(and have fun...)51

51

modularity and community structure in networks*

Documents