algorithms for analyzing and mining real-world...
TRANSCRIPT
Algorithms for Analyzing and MiningReal-World Graphs
Frank Takes
LIACS, Leiden University, The Netherlands
November 4, 2014
This Week’s Discoveries — November 4, 2014 1 / 26
Context
Data
Unstructured data vs. structured data
Graph (network) consisting of vertices (nodes) and edges (links)
Directed or undirectedWeighted or unweighted
Possibly large: millions of nodes, billions of edges
Interest from: mathematics, computer science, physics, biology,public administration, social sciences, . . .
This Week’s Discoveries — November 4, 2014 2 / 26
Corporate Social Network
This Week’s Discoveries — November 4, 2014 3 / 26
Protein Interaction Network
This Week’s Discoveries — November 4, 2014 4 / 26
Scientific Collaboration Network
This Week’s Discoveries — November 4, 2014 5 / 26
Real-World Graphs
In real-world graphs, the nodes can be actual people, devices,organisms or organizations, whereas the edges model social,technological, biological or economic relationships
Examples: online social networks, citation networks, webgraphs,communication networks, trade networks, . . .
Totally different data, very similar structure:
Sparse networks with a low densityFat-tailed power-law degree distributionGiant component containing majority of the nodesLow average node-to-node distanceHigh average clustering coefficient
This Week’s Discoveries — November 4, 2014 6 / 26
Real-World Graphs
In real-world graphs, the nodes can be actual people, devices,organisms or organizations, whereas the edges model social,technological, biological or economic relationships
Examples: online social networks, citation networks, webgraphs,communication networks, trade networks, . . .
Totally different data, very similar structure:
Sparse networks with a low densityFat-tailed power-law degree distributionGiant component containing majority of the nodesLow average node-to-node distanceHigh average clustering coefficient
This Week’s Discoveries — November 4, 2014 6 / 26
Degree Distribution
100
101
102
103
104
105
106
107
0 500 1000 1500 2000 2500
fre
qu
en
cy
degree
Figure: Degree distribution of an online social network with 8 millionnodes and 1 billion links. Tail runs up to 280 000.
This Week’s Discoveries — November 4, 2014 7 / 26
Distance Distribution
100
101
102
103
104
105
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Fre
qu
en
cy
Distance
Figure: Distance distribution of an online social network with 8 millionnodes and average distance 4.75, sampled over 100 000 node pairs.
This Week’s Discoveries — November 4, 2014 8 / 26
Challenges
Graph data storage (e.g., compression), retrieval and processing
Knowledge discovery: finding patterns (data mining), linkprediction and community detection
Computation: many efficient algorithms proposed to handlegeneral graphs (Dijkstra, Floyd-Warshall, Prim, etc.), but arethey also efficient on real-world graphs given the specialstructure of these graphs?
Today: Computing the diameter of a real-world graph
This Week’s Discoveries — November 4, 2014 9 / 26
Graph Diameter
Consider a connected undirected graph G = (V ,E ) withn = |V | nodes and m = |E | edges
Distance d(v ,w): length of shortest path between nodesv ,w ∈ V
Diameter D(G ): maximal distance (longest shortest pathlength) over all node pairs: maxv ,w∈V d(v ,w)
Eccentricity e(v): length of a longest shortest path from v :e(v) = maxw∈V d(v ,w)
Diameter D(G ) (alternative definition): maximal eccentricityover all nodes: maxv∈V e(v)
This Week’s Discoveries — November 4, 2014 10 / 26
Graph Diameter
Consider a connected undirected graph G = (V ,E ) withn = |V | nodes and m = |E | edges
Distance d(v ,w): length of shortest path between nodesv ,w ∈ V
Diameter D(G ): maximal distance (longest shortest pathlength) over all node pairs: maxv ,w∈V d(v ,w)
Eccentricity e(v): length of a longest shortest path from v :e(v) = maxw∈V d(v ,w)
Diameter D(G ) (alternative definition): maximal eccentricityover all nodes: maxv∈V e(v)
This Week’s Discoveries — November 4, 2014 10 / 26
Graph Diameter
Consider a connected undirected graph G = (V ,E ) withn = |V | nodes and m = |E | edges
Distance d(v ,w): length of shortest path between nodesv ,w ∈ V
Diameter D(G ): maximal distance (longest shortest pathlength) over all node pairs: maxv ,w∈V d(v ,w)
Eccentricity e(v): length of a longest shortest path from v :e(v) = maxw∈V d(v ,w)
Diameter D(G ) (alternative definition): maximal eccentricityover all nodes: maxv∈V e(v)
This Week’s Discoveries — November 4, 2014 10 / 26
Graph Diameter
Consider a connected undirected graph G = (V ,E ) withn = |V | nodes and m = |E | edges
Distance d(v ,w): length of shortest path between nodesv ,w ∈ V
Diameter D(G ): maximal distance (longest shortest pathlength) over all node pairs: maxv ,w∈V d(v ,w)
Eccentricity e(v): length of a longest shortest path from v :e(v) = maxw∈V d(v ,w)
Diameter D(G ) (alternative definition): maximal eccentricityover all nodes: maxv∈V e(v)
This Week’s Discoveries — November 4, 2014 10 / 26
Diameter Example
Figure: Graph with diameter D(G ) = 6. Numbers denote eccentricity values
This Week’s Discoveries — November 4, 2014 11 / 26
Diameter Applications
Router networks: what is the worst-case response time betweenany two machines?
Social networks: in how many steps does a message released bya single user reach everyone in the network?
Biological interaction networks: which proteins are likely to notinfluence each other at all?
Information networks (i.e., Wikipedia): how do I change theconversation topic to a maximally different subject? ;-)
Eccentricity has been suggested as a worst-case measure ofnode centrality: the relative importance of a node based on thegraph’s structure
This Week’s Discoveries — November 4, 2014 12 / 26
Naive Algorithm
Diameter is equal to the largest value returned by an All PairsShortest Path (APSP) algorithm
Brute-force: for each of the n nodes, execute a Breadth FirstSearch (BFS) run in O(m) time to find the eccentricity, andreturn the largest value found
Time complexity O(mn)
Problematic if n = 8 million and m = 1 billion.Then one BFS takes 6 seconds on a 3.4GHz machine.That results in 1.5 years to compute the diameter . . .
This Week’s Discoveries — November 4, 2014 13 / 26
BoundingDiameters Algorithm
Input: Graph GOutput: Diameter of G
W ← V D` ← −∞ Du ← +∞for w ∈ W do
e`[w ]← −∞eu [w ]← +∞
end for
while D` 6= Du and W 6= ∅ dov ← SelectFrom(W )e[v ]← Eccentricity(v)
D` ← max(D`, e[v ])Du ← min(Du , 2 · e[v ])
for w ∈ W doe`[w ] = max(e`[w ],max(e[v ]− d(v,w), d(v,w)))eu [w ] = min(eu [w ], e[v ] + d(v,w))if (eu [w ] ≤ D` and e`[w ] ≥ Du/2) or
(e`[w ] = eu [w ]) thenW ← W − {w}
end ifend for
Du ← min(Du ,maxw∈V (eu [w ])
)end while
return D`;
This Week’s Discoveries — November 4, 2014 14 / 26
Social Network Example (1)
If I am connected to everyone in at most 6 steps, then
My direct friend is connected to everyone in at most 7 steps(he reaches everyone through me)My direct friend is connected to everyone in at least 5 steps(I reach everyone through him)
If I can reach everyone in the network in 6 steps, then
There is nobody who can reach everyone in less than 3 steps(or I could have utilized him)There is nobody who needs more than 12 steps to reach everyone(or he could have utilized me)
This Week’s Discoveries — November 4, 2014 15 / 26
Social Network Example (2)
If a node v has eccentricity e(v), then
Nodes w at distance d(v ,w) needs at most e(v) + d(v ,w) steps(w reaches every node via v)Nodes w at distance d(v ,w) needs at least e(v)− d(v ,w) steps(v reaches every node via w)
We call this the Eccentricity bounds
If a node v can reach every other node in e(v) steps, then
There is no node that can reach everyone in less than de(v)/2esteps (or v could have used that node)There is no node that needs more than e(v) · 2 steps to reach allother nodes (or that node could have used v)
We call this the Diameter bounds
This Week’s Discoveries — November 4, 2014 16 / 26
BoundingDiameters Algorithm
Initialize candidate set W to VWhile DL(G ) 6= DU(G ):
1 Select a node v from W cf. some Selection strategy2 Compute v ’s eccentricity, and update eL(v) and eU(v) for every
node v ∈W according to the Eccentricity bounds3 Update the diameter bounds DL(G ) and DU(G )4 Remove nodes w that can no longer contribute to refining the
Diameter bounds
Selection strategy: alternate between smallest eccentricitylower bound and largest upper bound, break ties by taking thenode with the highest degree
This Week’s Discoveries — November 4, 2014 17 / 26
Example run (0)
C
B
A
F
E
D
G
J
H
I
L
M
N
P
K
Q
R
S
T
What is the diameter of this graph?DL = −∞ and DU =∞
This Week’s Discoveries — November 4, 2014 18 / 26
Example run (1)
C6
4
B7
3
A7
3
F5
5
E6
4
D6
4
G6
4
J6
4
H6
4
I7
3
L7
3
M8
3
N8
3
P8
3
K7
3
Q9
4
R9
4
S10
5
T10
5
Iteration 1: after computing the eccentricity of node FDL = 5 and DU = 10
This Week’s Discoveries — November 4, 2014 19 / 26
Example run (2)
C6
6
B7
7
A7
7
F5
5
E6
6
D6
6
G6
5
J6
4
H6
6
I7
6
K7
7
L7
4
M8
4
N8
4
P8
5
Q9
4
R8
6
S10
5
T7
7
Iteration 2: after computing the eccentricity of node TDL = 7 and DU = 10
This Week’s Discoveries — November 4, 2014 20 / 26
Example run (3)
C6
6
B7
7
A7
7
F5
5
E6
6
D6
6
G6
5
J5
4
H6
6
I7
6
K7
7
L4
4
M5
4
N5
4
P5
5
Q6
4
R7
6
S7
5
T7
7
Iteration 3: after computing the eccentricity of node LDL = 7 and DU = 7
This Week’s Discoveries — November 4, 2014 21 / 26
Results
1 Eccentricity bounds difference
2 Alternate between smallest eccentricity lower bound andlargest upper bound
3 Repeatedly select a node furthest away from the previous node
Dataset Nodes D(G) Strat. 1 Strat. 2 Strat. 3 PrunedAstroPhys 17,903 14 18 9 63 185
Enron 33,696 13 12 11 61 8,715Web 855,802 24 20 4 28 91,965
YouTube 1,134,890 24 2 2 2 399,553Flickr 1,624,992 24 10 3 7 553,242Skitter 1,696,415 31 10 4 19 114,803
Wikipedia 2,213,236 18 21 3 583 947,582Orkut 3,072,441 10 357 106 389 27,429
LiveJournal 5,189,809 23 6 3 14 318,378Hyves 8,057,981 25 40 21 44 446,258
Table: Comparison of three node selection strategies
This Week’s Discoveries — November 4, 2014 22 / 26
Discussion
Main result: in real-world graphs BoundingDiameters ismuch faster than the naive algorithm (a handful vs. n BFSes)
Why does it work? There is always diversity in the eccentricityvalues of nodes, allowing central nodes to influence theeccentricity of peripheral nodes, and vice versa
When does it not work so well? In graphs with little diversityin the eccentricity values, e.g., circle-shaped graphs
Side result: efficiently computing derived measures such as theradius, center, periphery and even the exact eccentricitydistribution is also possible (after some modifications)
This Week’s Discoveries — November 4, 2014 23 / 26
Conclusion
Real-world graphs have a surprisingly similar structure
Traditional algorithms for general graphs can be improved byexploiting this non-random structure
The diameter and other extreme distance measures can becomputed efficiently using the BoundingDiameters algorithm
Graphs are everywhere! :-)
This Week’s Discoveries — November 4, 2014 24 / 26
LCN2
Leiden Complex Networks Network — LCN2
Collaboration platform for network researchers in Leiden
Participants:
Alexandru Babeanu (LION) [email protected]
Arjen Doelman (MI) [email protected]
Diego Garlaschelli (LION) [email protected]
Frank den Hollander (MI) [email protected]
Joke Meijer (LUMC) [email protected]
Aske Plaat (LIACS) [email protected]
Jos Rohling (LUMC) [email protected]
Frank Takes (LIACS) [email protected]
Interested? Contact us!
This Week’s Discoveries — November 4, 2014 25 / 26
Thank You!
Questions?
This Week’s Discoveries — November 4, 2014 26 / 26