algorithms for analyzing and mining real-world...

Post on 26-Jun-2020

6 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Algorithms for Analyzing and MiningReal-World Graphs

Frank Takes

LIACS, Leiden University, The Netherlands

November 4, 2014

This Week’s Discoveries — November 4, 2014 1 / 26

Context

Data

Unstructured data vs. structured data

Graph (network) consisting of vertices (nodes) and edges (links)

Directed or undirectedWeighted or unweighted

Possibly large: millions of nodes, billions of edges

Interest from: mathematics, computer science, physics, biology,public administration, social sciences, . . .

This Week’s Discoveries — November 4, 2014 2 / 26

Corporate Social Network

This Week’s Discoveries — November 4, 2014 3 / 26

Protein Interaction Network

This Week’s Discoveries — November 4, 2014 4 / 26

Scientific Collaboration Network

This Week’s Discoveries — November 4, 2014 5 / 26

Real-World Graphs

In real-world graphs, the nodes can be actual people, devices,organisms or organizations, whereas the edges model social,technological, biological or economic relationships

Examples: online social networks, citation networks, webgraphs,communication networks, trade networks, . . .

Totally different data, very similar structure:

Sparse networks with a low densityFat-tailed power-law degree distributionGiant component containing majority of the nodesLow average node-to-node distanceHigh average clustering coefficient

This Week’s Discoveries — November 4, 2014 6 / 26

Real-World Graphs

In real-world graphs, the nodes can be actual people, devices,organisms or organizations, whereas the edges model social,technological, biological or economic relationships

Examples: online social networks, citation networks, webgraphs,communication networks, trade networks, . . .

Totally different data, very similar structure:

Sparse networks with a low densityFat-tailed power-law degree distributionGiant component containing majority of the nodesLow average node-to-node distanceHigh average clustering coefficient

This Week’s Discoveries — November 4, 2014 6 / 26

Degree Distribution

100

101

102

103

104

105

106

107

0 500 1000 1500 2000 2500

fre

qu

en

cy

degree

Figure: Degree distribution of an online social network with 8 millionnodes and 1 billion links. Tail runs up to 280 000.

This Week’s Discoveries — November 4, 2014 7 / 26

Distance Distribution

100

101

102

103

104

105

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Fre

qu

en

cy

Distance

Figure: Distance distribution of an online social network with 8 millionnodes and average distance 4.75, sampled over 100 000 node pairs.

This Week’s Discoveries — November 4, 2014 8 / 26

Challenges

Graph data storage (e.g., compression), retrieval and processing

Knowledge discovery: finding patterns (data mining), linkprediction and community detection

Computation: many efficient algorithms proposed to handlegeneral graphs (Dijkstra, Floyd-Warshall, Prim, etc.), but arethey also efficient on real-world graphs given the specialstructure of these graphs?

Today: Computing the diameter of a real-world graph

This Week’s Discoveries — November 4, 2014 9 / 26

Graph Diameter

Consider a connected undirected graph G = (V ,E ) withn = |V | nodes and m = |E | edges

Distance d(v ,w): length of shortest path between nodesv ,w ∈ V

Diameter D(G ): maximal distance (longest shortest pathlength) over all node pairs: maxv ,w∈V d(v ,w)

Eccentricity e(v): length of a longest shortest path from v :e(v) = maxw∈V d(v ,w)

Diameter D(G ) (alternative definition): maximal eccentricityover all nodes: maxv∈V e(v)

This Week’s Discoveries — November 4, 2014 10 / 26

Graph Diameter

Consider a connected undirected graph G = (V ,E ) withn = |V | nodes and m = |E | edges

Distance d(v ,w): length of shortest path between nodesv ,w ∈ V

Diameter D(G ): maximal distance (longest shortest pathlength) over all node pairs: maxv ,w∈V d(v ,w)

Eccentricity e(v): length of a longest shortest path from v :e(v) = maxw∈V d(v ,w)

Diameter D(G ) (alternative definition): maximal eccentricityover all nodes: maxv∈V e(v)

This Week’s Discoveries — November 4, 2014 10 / 26

Graph Diameter

Consider a connected undirected graph G = (V ,E ) withn = |V | nodes and m = |E | edges

Distance d(v ,w): length of shortest path between nodesv ,w ∈ V

Diameter D(G ): maximal distance (longest shortest pathlength) over all node pairs: maxv ,w∈V d(v ,w)

Eccentricity e(v): length of a longest shortest path from v :e(v) = maxw∈V d(v ,w)

Diameter D(G ) (alternative definition): maximal eccentricityover all nodes: maxv∈V e(v)

This Week’s Discoveries — November 4, 2014 10 / 26

Graph Diameter

Consider a connected undirected graph G = (V ,E ) withn = |V | nodes and m = |E | edges

Distance d(v ,w): length of shortest path between nodesv ,w ∈ V

Diameter D(G ): maximal distance (longest shortest pathlength) over all node pairs: maxv ,w∈V d(v ,w)

Eccentricity e(v): length of a longest shortest path from v :e(v) = maxw∈V d(v ,w)

Diameter D(G ) (alternative definition): maximal eccentricityover all nodes: maxv∈V e(v)

This Week’s Discoveries — November 4, 2014 10 / 26

Diameter Example

Figure: Graph with diameter D(G ) = 6. Numbers denote eccentricity values

This Week’s Discoveries — November 4, 2014 11 / 26

Diameter Applications

Router networks: what is the worst-case response time betweenany two machines?

Social networks: in how many steps does a message released bya single user reach everyone in the network?

Biological interaction networks: which proteins are likely to notinfluence each other at all?

Information networks (i.e., Wikipedia): how do I change theconversation topic to a maximally different subject? ;-)

Eccentricity has been suggested as a worst-case measure ofnode centrality: the relative importance of a node based on thegraph’s structure

This Week’s Discoveries — November 4, 2014 12 / 26

Naive Algorithm

Diameter is equal to the largest value returned by an All PairsShortest Path (APSP) algorithm

Brute-force: for each of the n nodes, execute a Breadth FirstSearch (BFS) run in O(m) time to find the eccentricity, andreturn the largest value found

Time complexity O(mn)

Problematic if n = 8 million and m = 1 billion.Then one BFS takes 6 seconds on a 3.4GHz machine.That results in 1.5 years to compute the diameter . . .

This Week’s Discoveries — November 4, 2014 13 / 26

BoundingDiameters Algorithm

Input: Graph GOutput: Diameter of G

W ← V D` ← −∞ Du ← +∞for w ∈ W do

e`[w ]← −∞eu [w ]← +∞

end for

while D` 6= Du and W 6= ∅ dov ← SelectFrom(W )e[v ]← Eccentricity(v)

D` ← max(D`, e[v ])Du ← min(Du , 2 · e[v ])

for w ∈ W doe`[w ] = max(e`[w ],max(e[v ]− d(v,w), d(v,w)))eu [w ] = min(eu [w ], e[v ] + d(v,w))if (eu [w ] ≤ D` and e`[w ] ≥ Du/2) or

(e`[w ] = eu [w ]) thenW ← W − {w}

end ifend for

Du ← min(Du ,maxw∈V (eu [w ])

)end while

return D`;

This Week’s Discoveries — November 4, 2014 14 / 26

Social Network Example (1)

If I am connected to everyone in at most 6 steps, then

My direct friend is connected to everyone in at most 7 steps(he reaches everyone through me)My direct friend is connected to everyone in at least 5 steps(I reach everyone through him)

If I can reach everyone in the network in 6 steps, then

There is nobody who can reach everyone in less than 3 steps(or I could have utilized him)There is nobody who needs more than 12 steps to reach everyone(or he could have utilized me)

This Week’s Discoveries — November 4, 2014 15 / 26

Social Network Example (2)

If a node v has eccentricity e(v), then

Nodes w at distance d(v ,w) needs at most e(v) + d(v ,w) steps(w reaches every node via v)Nodes w at distance d(v ,w) needs at least e(v)− d(v ,w) steps(v reaches every node via w)

We call this the Eccentricity bounds

If a node v can reach every other node in e(v) steps, then

There is no node that can reach everyone in less than de(v)/2esteps (or v could have used that node)There is no node that needs more than e(v) · 2 steps to reach allother nodes (or that node could have used v)

We call this the Diameter bounds

This Week’s Discoveries — November 4, 2014 16 / 26

BoundingDiameters Algorithm

Initialize candidate set W to VWhile DL(G ) 6= DU(G ):

1 Select a node v from W cf. some Selection strategy2 Compute v ’s eccentricity, and update eL(v) and eU(v) for every

node v ∈W according to the Eccentricity bounds3 Update the diameter bounds DL(G ) and DU(G )4 Remove nodes w that can no longer contribute to refining the

Diameter bounds

Selection strategy: alternate between smallest eccentricitylower bound and largest upper bound, break ties by taking thenode with the highest degree

This Week’s Discoveries — November 4, 2014 17 / 26

Example run (0)

C

B

A

F

E

D

G

J

H

I

L

M

N

P

K

Q

R

S

T

What is the diameter of this graph?DL = −∞ and DU =∞

This Week’s Discoveries — November 4, 2014 18 / 26

Example run (1)

C6

4

B7

3

A7

3

F5

5

E6

4

D6

4

G6

4

J6

4

H6

4

I7

3

L7

3

M8

3

N8

3

P8

3

K7

3

Q9

4

R9

4

S10

5

T10

5

Iteration 1: after computing the eccentricity of node FDL = 5 and DU = 10

This Week’s Discoveries — November 4, 2014 19 / 26

Example run (2)

C6

6

B7

7

A7

7

F5

5

E6

6

D6

6

G6

5

J6

4

H6

6

I7

6

K7

7

L7

4

M8

4

N8

4

P8

5

Q9

4

R8

6

S10

5

T7

7

Iteration 2: after computing the eccentricity of node TDL = 7 and DU = 10

This Week’s Discoveries — November 4, 2014 20 / 26

Example run (3)

C6

6

B7

7

A7

7

F5

5

E6

6

D6

6

G6

5

J5

4

H6

6

I7

6

K7

7

L4

4

M5

4

N5

4

P5

5

Q6

4

R7

6

S7

5

T7

7

Iteration 3: after computing the eccentricity of node LDL = 7 and DU = 7

This Week’s Discoveries — November 4, 2014 21 / 26

Results

1 Eccentricity bounds difference

2 Alternate between smallest eccentricity lower bound andlargest upper bound

3 Repeatedly select a node furthest away from the previous node

Dataset Nodes D(G) Strat. 1 Strat. 2 Strat. 3 PrunedAstroPhys 17,903 14 18 9 63 185

Enron 33,696 13 12 11 61 8,715Web 855,802 24 20 4 28 91,965

YouTube 1,134,890 24 2 2 2 399,553Flickr 1,624,992 24 10 3 7 553,242Skitter 1,696,415 31 10 4 19 114,803

Wikipedia 2,213,236 18 21 3 583 947,582Orkut 3,072,441 10 357 106 389 27,429

LiveJournal 5,189,809 23 6 3 14 318,378Hyves 8,057,981 25 40 21 44 446,258

Table: Comparison of three node selection strategies

This Week’s Discoveries — November 4, 2014 22 / 26

Discussion

Main result: in real-world graphs BoundingDiameters ismuch faster than the naive algorithm (a handful vs. n BFSes)

Why does it work? There is always diversity in the eccentricityvalues of nodes, allowing central nodes to influence theeccentricity of peripheral nodes, and vice versa

When does it not work so well? In graphs with little diversityin the eccentricity values, e.g., circle-shaped graphs

Side result: efficiently computing derived measures such as theradius, center, periphery and even the exact eccentricitydistribution is also possible (after some modifications)

This Week’s Discoveries — November 4, 2014 23 / 26

Conclusion

Real-world graphs have a surprisingly similar structure

Traditional algorithms for general graphs can be improved byexploiting this non-random structure

The diameter and other extreme distance measures can becomputed efficiently using the BoundingDiameters algorithm

Graphs are everywhere! :-)

This Week’s Discoveries — November 4, 2014 24 / 26

LCN2

Leiden Complex Networks Network — LCN2

Collaboration platform for network researchers in Leiden

Participants:

Alexandru Babeanu (LION) babeanu@lorentz.leidenuniv.nl

Arjen Doelman (MI) doelman@math.leidenuniv.nl

Diego Garlaschelli (LION) garlaschelli@lorentz.leidenuniv.nl

Frank den Hollander (MI) denholla@math.leidenuniv.nl

Joke Meijer (LUMC) j.h.meijer@lumc.nl

Aske Plaat (LIACS) a.plaat@liacs.leidenuniv.nl

Jos Rohling (LUMC) j.h.t.rohling@lumc.nl

Frank Takes (LIACS) f.w.takes@liacs.leidenuniv.nl

Interested? Contact us!

This Week’s Discoveries — November 4, 2014 25 / 26

Thank You!

Questions?

This Week’s Discoveries — November 4, 2014 26 / 26

top related