giraph++: from "think like a vertex" to "think like a graph"

11
From “Think Like a Vertex” to “Think Like a Graph” Yuanyuan Tian, Andrey Balmin, Severin Andreas Corsten, Shirish Tatikonda, John McPherson

Upload: yuanyuan-tian

Post on 18-Jan-2017

20 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Giraph++: From "Think Like a Vertex" to "Think Like a Graph"

From “Think Like a Vertex” to “Think Like a Graph”Yuanyuan Tian, Andrey Balmin, Severin Andreas Corsten, Shirish Tatikonda, John McPherson

Page 2: Giraph++: From "Think Like a Vertex" to "Think Like a Graph"

Large Scale Graph Processing

Graph data is everywhere and growing rapidly! 126 million blogs 50 billion Web pages 90 trillion emails

Analyzing graph data is increasingly important. Retail: customer segmentation and recommendation Consumer Insights: key influencer analysis Telco: churn prediction Biology & Health Care: disease propagation analysis Government: Terrorists detection

What is the right programming model for large-scale graph processing?

Page 3: Giraph++: From "Think Like a Vertex" to "Think Like a Graph"

Existing Graph Processing Systems

Divide input graphs into partitions

Employ vertex-centric programming model Programmers “think like a vertex”

Operate on a vertex and its edges Communication to other vertices

Message passing (e.g Pregel/Giraph) Scheduling of updates (e.g GraphLab)

Page 4: Giraph++: From "Think Like a Vertex" to "Think Like a Graph"

“Think like a vertex” “Think like a graph”

“Think like a vertex” “Think like a graph”

Partition: A collection of vertices A proper subgraph Computation: A vertex and its edges A subgraph

Communication: 1-hop at a timee.g ABD

Multiple-hops at a timee.g. AD

Page 5: Giraph++: From "Think Like a Vertex" to "Think Like a Graph"

Graph-Centric Programming Model

Expose subgraphs to programmers Internal vertices vs boundary vertices

Information exchange between internal vertices of a partition is immediate

Messages are only sent from boundary vertices of a partition to internal vertices of a different partition

internal vertex

external vertex

(primary copy)

(local copy)

message

Page 6: Giraph++: From "Think Like a Vertex" to "Think Like a Graph"

Advantages of graph-centric model

Any algorithm expressed in the vertex-centric model can be expressed in the graph-centric model

Allow lower-level access for algorithm-specific optimizations Use of existing off-the-shell graph algorithms on subgraphs Local asynchronous computation Natural translation of existing graph-centric parallel algorithms

Provide sufficiently high-level abstraction for easy of use

Page 7: Giraph++: From "Think Like a Vertex" to "Think Like a Graph"

Example: Weakly Connected Component (WCC)

compute() on each subgraph

If 0th superstep Sequential WCC on subgraph Send labels of boundary vertices

to their corresponding internal vertices

Else Use the received messages to

update the labels of vertices in the subgraph

Merge connected components For boundary vertices with label

change, send labels to their corresponding internal vertices

A B C FD E

A B B C C D D E E F

A B B C

A B

A B

A B

B C

C D D E

C D

B C

A

A

A

A

A

Superstep

0:

1:

2:

3:

4:

5:

6:

A B C D E F

A A C D E

A A A B C D

A A A A B C

A A A A A B

A A A A A A

A A A A A A

Partition P1 Partition P2

B

A A A C C C

A A A A A A

A A A A A A

A C

A

Superstep

0:

1:

2:

Subgraph G1 Subgraph G2

A B C D C D E F

A

A

A

C

A

A

vert

ex-c

entr

ic m

odel

grap

h-ce

ntric

mod

el

Page 8: Giraph++: From "Think Like a Vertex" to "Think Like a Graph"

Hybrid Execution Model

Can we keep the simple vertex-centric programming model but still improve performance? Key: Differentiate internal messages and external

messages What messages can be used in local computation?

Vertex-centric model Only messages (external and internal) from previous superstep

Hybrid model External messages from previous superstep Internal messages from previous + current superstep (local

asynchronous computation) Only apply to a limited set of graph algorithms

Page 9: Giraph++: From "Think Like a Vertex" to "Think Like a Graph"

Giraph++: A New Graph Processing System

Built on top of Apache Giraph Support vertex-centric model (VM), graph-centric model

(GM) & hybrid model (HM) in the same Giraph++ system Contributed to Apache Giraph Project

Planed in future Giraph release

Page 10: Giraph++: From "Think Like a Vertex" to "Think Like a Graph"

A Peek of Performance Results

10-node cluster Per node: 32GB RAM, 8 cores, 7 workers Network: 1Gbit

Dataset: 4 web graphs # nodes: 19million ~ 428million # edges: 298million ~ 1.0billion

Significant improvement in execution time and network messages, especially with better graph partitioning strategy

Up to 62X speedup in execution time! Up to 200X reduction in network messages!

Data Set

Tim

e (s

ec)

Net

wor

k m

sgs

VM: vertex-centric modelGM: graph-centric modelHM: hybrid modelHP: hash partitioningGP: graph partitioning

WCC Algorithm

Page 11: Giraph++: From "Think Like a Vertex" to "Think Like a Graph"

Conclusion

“Think like a vertex” “Think like a graph” Take advantage of local graph structure in a partition Enable complex and flexible graph algorithms Can be exploited to various graph applications Bring significant performance improvement, especially with

better graph partitioning strategy A valuable complement to existing vertex-centric model