algorithms for analyzing and mining real-world...

30
Algorithms for Analyzing and Mining Real-World Graphs Frank Takes LIACS, Leiden University, The Netherlands November 4, 2014 This Week’s Discoveries — November 4, 2014 1 / 26

Upload: others

Post on 26-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Algorithms for Analyzing and Mining Real-World Graphsliacs.leidenuniv.nl/~takesfw/pdf/twd2014.pdf · 2014-11-04 · Graph data storage (e.g., compression), retrieval and processing

Algorithms for Analyzing and MiningReal-World Graphs

Frank Takes

LIACS, Leiden University, The Netherlands

November 4, 2014

This Week’s Discoveries — November 4, 2014 1 / 26

Page 2: Algorithms for Analyzing and Mining Real-World Graphsliacs.leidenuniv.nl/~takesfw/pdf/twd2014.pdf · 2014-11-04 · Graph data storage (e.g., compression), retrieval and processing

Context

Data

Unstructured data vs. structured data

Graph (network) consisting of vertices (nodes) and edges (links)

Directed or undirectedWeighted or unweighted

Possibly large: millions of nodes, billions of edges

Interest from: mathematics, computer science, physics, biology,public administration, social sciences, . . .

This Week’s Discoveries — November 4, 2014 2 / 26

Page 3: Algorithms for Analyzing and Mining Real-World Graphsliacs.leidenuniv.nl/~takesfw/pdf/twd2014.pdf · 2014-11-04 · Graph data storage (e.g., compression), retrieval and processing

Corporate Social Network

This Week’s Discoveries — November 4, 2014 3 / 26

Page 4: Algorithms for Analyzing and Mining Real-World Graphsliacs.leidenuniv.nl/~takesfw/pdf/twd2014.pdf · 2014-11-04 · Graph data storage (e.g., compression), retrieval and processing

Protein Interaction Network

This Week’s Discoveries — November 4, 2014 4 / 26

Page 5: Algorithms for Analyzing and Mining Real-World Graphsliacs.leidenuniv.nl/~takesfw/pdf/twd2014.pdf · 2014-11-04 · Graph data storage (e.g., compression), retrieval and processing

Scientific Collaboration Network

This Week’s Discoveries — November 4, 2014 5 / 26

Page 6: Algorithms for Analyzing and Mining Real-World Graphsliacs.leidenuniv.nl/~takesfw/pdf/twd2014.pdf · 2014-11-04 · Graph data storage (e.g., compression), retrieval and processing

Real-World Graphs

In real-world graphs, the nodes can be actual people, devices,organisms or organizations, whereas the edges model social,technological, biological or economic relationships

Examples: online social networks, citation networks, webgraphs,communication networks, trade networks, . . .

Totally different data, very similar structure:

Sparse networks with a low densityFat-tailed power-law degree distributionGiant component containing majority of the nodesLow average node-to-node distanceHigh average clustering coefficient

This Week’s Discoveries — November 4, 2014 6 / 26

Page 7: Algorithms for Analyzing and Mining Real-World Graphsliacs.leidenuniv.nl/~takesfw/pdf/twd2014.pdf · 2014-11-04 · Graph data storage (e.g., compression), retrieval and processing

Real-World Graphs

In real-world graphs, the nodes can be actual people, devices,organisms or organizations, whereas the edges model social,technological, biological or economic relationships

Examples: online social networks, citation networks, webgraphs,communication networks, trade networks, . . .

Totally different data, very similar structure:

Sparse networks with a low densityFat-tailed power-law degree distributionGiant component containing majority of the nodesLow average node-to-node distanceHigh average clustering coefficient

This Week’s Discoveries — November 4, 2014 6 / 26

Page 8: Algorithms for Analyzing and Mining Real-World Graphsliacs.leidenuniv.nl/~takesfw/pdf/twd2014.pdf · 2014-11-04 · Graph data storage (e.g., compression), retrieval and processing

Degree Distribution

100

101

102

103

104

105

106

107

0 500 1000 1500 2000 2500

fre

qu

en

cy

degree

Figure: Degree distribution of an online social network with 8 millionnodes and 1 billion links. Tail runs up to 280 000.

This Week’s Discoveries — November 4, 2014 7 / 26

Page 9: Algorithms for Analyzing and Mining Real-World Graphsliacs.leidenuniv.nl/~takesfw/pdf/twd2014.pdf · 2014-11-04 · Graph data storage (e.g., compression), retrieval and processing

Distance Distribution

100

101

102

103

104

105

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Fre

qu

en

cy

Distance

Figure: Distance distribution of an online social network with 8 millionnodes and average distance 4.75, sampled over 100 000 node pairs.

This Week’s Discoveries — November 4, 2014 8 / 26

Page 10: Algorithms for Analyzing and Mining Real-World Graphsliacs.leidenuniv.nl/~takesfw/pdf/twd2014.pdf · 2014-11-04 · Graph data storage (e.g., compression), retrieval and processing

Challenges

Graph data storage (e.g., compression), retrieval and processing

Knowledge discovery: finding patterns (data mining), linkprediction and community detection

Computation: many efficient algorithms proposed to handlegeneral graphs (Dijkstra, Floyd-Warshall, Prim, etc.), but arethey also efficient on real-world graphs given the specialstructure of these graphs?

Today: Computing the diameter of a real-world graph

This Week’s Discoveries — November 4, 2014 9 / 26

Page 11: Algorithms for Analyzing and Mining Real-World Graphsliacs.leidenuniv.nl/~takesfw/pdf/twd2014.pdf · 2014-11-04 · Graph data storage (e.g., compression), retrieval and processing

Graph Diameter

Consider a connected undirected graph G = (V ,E ) withn = |V | nodes and m = |E | edges

Distance d(v ,w): length of shortest path between nodesv ,w ∈ V

Diameter D(G ): maximal distance (longest shortest pathlength) over all node pairs: maxv ,w∈V d(v ,w)

Eccentricity e(v): length of a longest shortest path from v :e(v) = maxw∈V d(v ,w)

Diameter D(G ) (alternative definition): maximal eccentricityover all nodes: maxv∈V e(v)

This Week’s Discoveries — November 4, 2014 10 / 26

Page 12: Algorithms for Analyzing and Mining Real-World Graphsliacs.leidenuniv.nl/~takesfw/pdf/twd2014.pdf · 2014-11-04 · Graph data storage (e.g., compression), retrieval and processing

Graph Diameter

Consider a connected undirected graph G = (V ,E ) withn = |V | nodes and m = |E | edges

Distance d(v ,w): length of shortest path between nodesv ,w ∈ V

Diameter D(G ): maximal distance (longest shortest pathlength) over all node pairs: maxv ,w∈V d(v ,w)

Eccentricity e(v): length of a longest shortest path from v :e(v) = maxw∈V d(v ,w)

Diameter D(G ) (alternative definition): maximal eccentricityover all nodes: maxv∈V e(v)

This Week’s Discoveries — November 4, 2014 10 / 26

Page 13: Algorithms for Analyzing and Mining Real-World Graphsliacs.leidenuniv.nl/~takesfw/pdf/twd2014.pdf · 2014-11-04 · Graph data storage (e.g., compression), retrieval and processing

Graph Diameter

Consider a connected undirected graph G = (V ,E ) withn = |V | nodes and m = |E | edges

Distance d(v ,w): length of shortest path between nodesv ,w ∈ V

Diameter D(G ): maximal distance (longest shortest pathlength) over all node pairs: maxv ,w∈V d(v ,w)

Eccentricity e(v): length of a longest shortest path from v :e(v) = maxw∈V d(v ,w)

Diameter D(G ) (alternative definition): maximal eccentricityover all nodes: maxv∈V e(v)

This Week’s Discoveries — November 4, 2014 10 / 26

Page 14: Algorithms for Analyzing and Mining Real-World Graphsliacs.leidenuniv.nl/~takesfw/pdf/twd2014.pdf · 2014-11-04 · Graph data storage (e.g., compression), retrieval and processing

Graph Diameter

Consider a connected undirected graph G = (V ,E ) withn = |V | nodes and m = |E | edges

Distance d(v ,w): length of shortest path between nodesv ,w ∈ V

Diameter D(G ): maximal distance (longest shortest pathlength) over all node pairs: maxv ,w∈V d(v ,w)

Eccentricity e(v): length of a longest shortest path from v :e(v) = maxw∈V d(v ,w)

Diameter D(G ) (alternative definition): maximal eccentricityover all nodes: maxv∈V e(v)

This Week’s Discoveries — November 4, 2014 10 / 26

Page 15: Algorithms for Analyzing and Mining Real-World Graphsliacs.leidenuniv.nl/~takesfw/pdf/twd2014.pdf · 2014-11-04 · Graph data storage (e.g., compression), retrieval and processing

Diameter Example

Figure: Graph with diameter D(G ) = 6. Numbers denote eccentricity values

This Week’s Discoveries — November 4, 2014 11 / 26

Page 16: Algorithms for Analyzing and Mining Real-World Graphsliacs.leidenuniv.nl/~takesfw/pdf/twd2014.pdf · 2014-11-04 · Graph data storage (e.g., compression), retrieval and processing

Diameter Applications

Router networks: what is the worst-case response time betweenany two machines?

Social networks: in how many steps does a message released bya single user reach everyone in the network?

Biological interaction networks: which proteins are likely to notinfluence each other at all?

Information networks (i.e., Wikipedia): how do I change theconversation topic to a maximally different subject? ;-)

Eccentricity has been suggested as a worst-case measure ofnode centrality: the relative importance of a node based on thegraph’s structure

This Week’s Discoveries — November 4, 2014 12 / 26

Page 17: Algorithms for Analyzing and Mining Real-World Graphsliacs.leidenuniv.nl/~takesfw/pdf/twd2014.pdf · 2014-11-04 · Graph data storage (e.g., compression), retrieval and processing

Naive Algorithm

Diameter is equal to the largest value returned by an All PairsShortest Path (APSP) algorithm

Brute-force: for each of the n nodes, execute a Breadth FirstSearch (BFS) run in O(m) time to find the eccentricity, andreturn the largest value found

Time complexity O(mn)

Problematic if n = 8 million and m = 1 billion.Then one BFS takes 6 seconds on a 3.4GHz machine.That results in 1.5 years to compute the diameter . . .

This Week’s Discoveries — November 4, 2014 13 / 26

Page 18: Algorithms for Analyzing and Mining Real-World Graphsliacs.leidenuniv.nl/~takesfw/pdf/twd2014.pdf · 2014-11-04 · Graph data storage (e.g., compression), retrieval and processing

BoundingDiameters Algorithm

Input: Graph GOutput: Diameter of G

W ← V D` ← −∞ Du ← +∞for w ∈ W do

e`[w ]← −∞eu [w ]← +∞

end for

while D` 6= Du and W 6= ∅ dov ← SelectFrom(W )e[v ]← Eccentricity(v)

D` ← max(D`, e[v ])Du ← min(Du , 2 · e[v ])

for w ∈ W doe`[w ] = max(e`[w ],max(e[v ]− d(v,w), d(v,w)))eu [w ] = min(eu [w ], e[v ] + d(v,w))if (eu [w ] ≤ D` and e`[w ] ≥ Du/2) or

(e`[w ] = eu [w ]) thenW ← W − {w}

end ifend for

Du ← min(Du ,maxw∈V (eu [w ])

)end while

return D`;

This Week’s Discoveries — November 4, 2014 14 / 26

Page 19: Algorithms for Analyzing and Mining Real-World Graphsliacs.leidenuniv.nl/~takesfw/pdf/twd2014.pdf · 2014-11-04 · Graph data storage (e.g., compression), retrieval and processing

Social Network Example (1)

If I am connected to everyone in at most 6 steps, then

My direct friend is connected to everyone in at most 7 steps(he reaches everyone through me)My direct friend is connected to everyone in at least 5 steps(I reach everyone through him)

If I can reach everyone in the network in 6 steps, then

There is nobody who can reach everyone in less than 3 steps(or I could have utilized him)There is nobody who needs more than 12 steps to reach everyone(or he could have utilized me)

This Week’s Discoveries — November 4, 2014 15 / 26

Page 20: Algorithms for Analyzing and Mining Real-World Graphsliacs.leidenuniv.nl/~takesfw/pdf/twd2014.pdf · 2014-11-04 · Graph data storage (e.g., compression), retrieval and processing

Social Network Example (2)

If a node v has eccentricity e(v), then

Nodes w at distance d(v ,w) needs at most e(v) + d(v ,w) steps(w reaches every node via v)Nodes w at distance d(v ,w) needs at least e(v)− d(v ,w) steps(v reaches every node via w)

We call this the Eccentricity bounds

If a node v can reach every other node in e(v) steps, then

There is no node that can reach everyone in less than de(v)/2esteps (or v could have used that node)There is no node that needs more than e(v) · 2 steps to reach allother nodes (or that node could have used v)

We call this the Diameter bounds

This Week’s Discoveries — November 4, 2014 16 / 26

Page 21: Algorithms for Analyzing and Mining Real-World Graphsliacs.leidenuniv.nl/~takesfw/pdf/twd2014.pdf · 2014-11-04 · Graph data storage (e.g., compression), retrieval and processing

BoundingDiameters Algorithm

Initialize candidate set W to VWhile DL(G ) 6= DU(G ):

1 Select a node v from W cf. some Selection strategy2 Compute v ’s eccentricity, and update eL(v) and eU(v) for every

node v ∈W according to the Eccentricity bounds3 Update the diameter bounds DL(G ) and DU(G )4 Remove nodes w that can no longer contribute to refining the

Diameter bounds

Selection strategy: alternate between smallest eccentricitylower bound and largest upper bound, break ties by taking thenode with the highest degree

This Week’s Discoveries — November 4, 2014 17 / 26

Page 22: Algorithms for Analyzing and Mining Real-World Graphsliacs.leidenuniv.nl/~takesfw/pdf/twd2014.pdf · 2014-11-04 · Graph data storage (e.g., compression), retrieval and processing

Example run (0)

C

B

A

F

E

D

G

J

H

I

L

M

N

P

K

Q

R

S

T

What is the diameter of this graph?DL = −∞ and DU =∞

This Week’s Discoveries — November 4, 2014 18 / 26

Page 23: Algorithms for Analyzing and Mining Real-World Graphsliacs.leidenuniv.nl/~takesfw/pdf/twd2014.pdf · 2014-11-04 · Graph data storage (e.g., compression), retrieval and processing

Example run (1)

C6

4

B7

3

A7

3

F5

5

E6

4

D6

4

G6

4

J6

4

H6

4

I7

3

L7

3

M8

3

N8

3

P8

3

K7

3

Q9

4

R9

4

S10

5

T10

5

Iteration 1: after computing the eccentricity of node FDL = 5 and DU = 10

This Week’s Discoveries — November 4, 2014 19 / 26

Page 24: Algorithms for Analyzing and Mining Real-World Graphsliacs.leidenuniv.nl/~takesfw/pdf/twd2014.pdf · 2014-11-04 · Graph data storage (e.g., compression), retrieval and processing

Example run (2)

C6

6

B7

7

A7

7

F5

5

E6

6

D6

6

G6

5

J6

4

H6

6

I7

6

K7

7

L7

4

M8

4

N8

4

P8

5

Q9

4

R8

6

S10

5

T7

7

Iteration 2: after computing the eccentricity of node TDL = 7 and DU = 10

This Week’s Discoveries — November 4, 2014 20 / 26

Page 25: Algorithms for Analyzing and Mining Real-World Graphsliacs.leidenuniv.nl/~takesfw/pdf/twd2014.pdf · 2014-11-04 · Graph data storage (e.g., compression), retrieval and processing

Example run (3)

C6

6

B7

7

A7

7

F5

5

E6

6

D6

6

G6

5

J5

4

H6

6

I7

6

K7

7

L4

4

M5

4

N5

4

P5

5

Q6

4

R7

6

S7

5

T7

7

Iteration 3: after computing the eccentricity of node LDL = 7 and DU = 7

This Week’s Discoveries — November 4, 2014 21 / 26

Page 26: Algorithms for Analyzing and Mining Real-World Graphsliacs.leidenuniv.nl/~takesfw/pdf/twd2014.pdf · 2014-11-04 · Graph data storage (e.g., compression), retrieval and processing

Results

1 Eccentricity bounds difference

2 Alternate between smallest eccentricity lower bound andlargest upper bound

3 Repeatedly select a node furthest away from the previous node

Dataset Nodes D(G) Strat. 1 Strat. 2 Strat. 3 PrunedAstroPhys 17,903 14 18 9 63 185

Enron 33,696 13 12 11 61 8,715Web 855,802 24 20 4 28 91,965

YouTube 1,134,890 24 2 2 2 399,553Flickr 1,624,992 24 10 3 7 553,242Skitter 1,696,415 31 10 4 19 114,803

Wikipedia 2,213,236 18 21 3 583 947,582Orkut 3,072,441 10 357 106 389 27,429

LiveJournal 5,189,809 23 6 3 14 318,378Hyves 8,057,981 25 40 21 44 446,258

Table: Comparison of three node selection strategies

This Week’s Discoveries — November 4, 2014 22 / 26

Page 27: Algorithms for Analyzing and Mining Real-World Graphsliacs.leidenuniv.nl/~takesfw/pdf/twd2014.pdf · 2014-11-04 · Graph data storage (e.g., compression), retrieval and processing

Discussion

Main result: in real-world graphs BoundingDiameters ismuch faster than the naive algorithm (a handful vs. n BFSes)

Why does it work? There is always diversity in the eccentricityvalues of nodes, allowing central nodes to influence theeccentricity of peripheral nodes, and vice versa

When does it not work so well? In graphs with little diversityin the eccentricity values, e.g., circle-shaped graphs

Side result: efficiently computing derived measures such as theradius, center, periphery and even the exact eccentricitydistribution is also possible (after some modifications)

This Week’s Discoveries — November 4, 2014 23 / 26

Page 28: Algorithms for Analyzing and Mining Real-World Graphsliacs.leidenuniv.nl/~takesfw/pdf/twd2014.pdf · 2014-11-04 · Graph data storage (e.g., compression), retrieval and processing

Conclusion

Real-world graphs have a surprisingly similar structure

Traditional algorithms for general graphs can be improved byexploiting this non-random structure

The diameter and other extreme distance measures can becomputed efficiently using the BoundingDiameters algorithm

Graphs are everywhere! :-)

This Week’s Discoveries — November 4, 2014 24 / 26

Page 29: Algorithms for Analyzing and Mining Real-World Graphsliacs.leidenuniv.nl/~takesfw/pdf/twd2014.pdf · 2014-11-04 · Graph data storage (e.g., compression), retrieval and processing

LCN2

Leiden Complex Networks Network — LCN2

Collaboration platform for network researchers in Leiden

Participants:

Alexandru Babeanu (LION) [email protected]

Arjen Doelman (MI) [email protected]

Diego Garlaschelli (LION) [email protected]

Frank den Hollander (MI) [email protected]

Joke Meijer (LUMC) [email protected]

Aske Plaat (LIACS) [email protected]

Jos Rohling (LUMC) [email protected]

Frank Takes (LIACS) [email protected]

Interested? Contact us!

This Week’s Discoveries — November 4, 2014 25 / 26

Page 30: Algorithms for Analyzing and Mining Real-World Graphsliacs.leidenuniv.nl/~takesfw/pdf/twd2014.pdf · 2014-11-04 · Graph data storage (e.g., compression), retrieval and processing

Thank You!

Questions?

This Week’s Discoveries — November 4, 2014 26 / 26