large scale graph processing pregel, graphlab and graphxi the pregel library divides a graph into a...

90
Large Scale Graph Processing Pregel, GraphLab and GraphX Amir H. Payberah [email protected] KTH Royal Institute of Technology Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 1 / 76

Upload: others

Post on 04-Oct-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Large Scale Graph Processing Pregel, GraphLab and GraphXI The pregel library divides a graph into a number ofpartitions. I Each consisting of a set ofverticesand all of those vertices’outgoing

Large Scale Graph ProcessingPregel, GraphLab and GraphX

Amir H. [email protected]

KTH Royal Institute of Technology

Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 1 / 76

Page 2: Large Scale Graph Processing Pregel, GraphLab and GraphXI The pregel library divides a graph into a number ofpartitions. I Each consisting of a set ofverticesand all of those vertices’outgoing

Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 2 / 76

Page 3: Large Scale Graph Processing Pregel, GraphLab and GraphXI The pregel library divides a graph into a number ofpartitions. I Each consisting of a set ofverticesand all of those vertices’outgoing

Large Graph

Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 3 / 76

Page 4: Large Scale Graph Processing Pregel, GraphLab and GraphXI The pregel library divides a graph into a number ofpartitions. I Each consisting of a set ofverticesand all of those vertices’outgoing

Can we use platforms like MapReduce or Spark, which are based on data-parallel

model, for large-scale graph proceeding?

Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 4 / 76

Page 5: Large Scale Graph Processing Pregel, GraphLab and GraphXI The pregel library divides a graph into a number ofpartitions. I Each consisting of a set ofverticesand all of those vertices’outgoing

Graph Algorithms Characteristics

I Difficult to extract parallelism based on partitioning of the data.

I Difficult to express parallelism based on partitioning of computation.

Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 5 / 76

Page 6: Large Scale Graph Processing Pregel, GraphLab and GraphXI The pregel library divides a graph into a number ofpartitions. I Each consisting of a set ofverticesand all of those vertices’outgoing

Proposed Solution

Graph-Parallel Processing

I Computation typically depends on the neighbors.

Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 6 / 76

Page 7: Large Scale Graph Processing Pregel, GraphLab and GraphXI The pregel library divides a graph into a number ofpartitions. I Each consisting of a set ofverticesand all of those vertices’outgoing

Graph-Parallel Processing

I Expose specialized APIs to simplify graph programming.

I Restricts the types of computation.

I New techniques to partition and distribute graphs.

I Exploit graph structure.

Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 7 / 76

Page 8: Large Scale Graph Processing Pregel, GraphLab and GraphXI The pregel library divides a graph into a number ofpartitions. I Each consisting of a set ofverticesand all of those vertices’outgoing

Data-Parallel vs. Graph-Parallel Computation

Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 8 / 76

Page 9: Large Scale Graph Processing Pregel, GraphLab and GraphXI The pregel library divides a graph into a number ofpartitions. I Each consisting of a set ofverticesand all of those vertices’outgoing

Pregel

Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 9 / 76

Page 10: Large Scale Graph Processing Pregel, GraphLab and GraphXI The pregel library divides a graph into a number ofpartitions. I Each consisting of a set ofverticesand all of those vertices’outgoing

Pregel

I Large-scale graph-parallel processing platform developed at Google.

I Inspired by bulk synchronous parallel (BSP) model.

Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 10 / 76

Page 11: Large Scale Graph Processing Pregel, GraphLab and GraphXI The pregel library divides a graph into a number ofpartitions. I Each consisting of a set ofverticesand all of those vertices’outgoing

Bulk Synchronous Parallel

All vertices update in parallel (at the same time).

Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 11 / 76

Page 12: Large Scale Graph Processing Pregel, GraphLab and GraphXI The pregel library divides a graph into a number ofpartitions. I Each consisting of a set ofverticesand all of those vertices’outgoing

Programming Model

I Vertex-centric programming: Think as a vertex.

I Each vertex computes individually its value: in parallel

I Each vertex can see its local context and updates its value.

I Input data: a directed graph that stores the program state, e.g., thecurrent value.

Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 12 / 76

Page 13: Large Scale Graph Processing Pregel, GraphLab and GraphXI The pregel library divides a graph into a number ofpartitions. I Each consisting of a set ofverticesand all of those vertices’outgoing

Execution Model (1/2)

I Applications run in sequence of iterations: supersteps• Invoking method Compute() in parallel in all vertices

I A vertex in superstep S can:• reads messages sent to it in superstep S-1.• sends messages to other vertices: receiving at superstep S+1.• modifies its state.

I Vertices communicate directly with one another by sending mes-sages.

Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 13 / 76

Page 14: Large Scale Graph Processing Pregel, GraphLab and GraphXI The pregel library divides a graph into a number ofpartitions. I Each consisting of a set ofverticesand all of those vertices’outgoing

Execution Model (1/2)

I Applications run in sequence of iterations: supersteps• Invoking method Compute() in parallel in all vertices

I A vertex in superstep S can:• reads messages sent to it in superstep S-1.• sends messages to other vertices: receiving at superstep S+1.• modifies its state.

I Vertices communicate directly with one another by sending mes-sages.

Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 13 / 76

Page 15: Large Scale Graph Processing Pregel, GraphLab and GraphXI The pregel library divides a graph into a number ofpartitions. I Each consisting of a set ofverticesand all of those vertices’outgoing

Execution Model (1/2)

I Applications run in sequence of iterations: supersteps• Invoking method Compute() in parallel in all vertices

I A vertex in superstep S can:• reads messages sent to it in superstep S-1.• sends messages to other vertices: receiving at superstep S+1.• modifies its state.

I Vertices communicate directly with one another by sending mes-sages.

Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 13 / 76

Page 16: Large Scale Graph Processing Pregel, GraphLab and GraphXI The pregel library divides a graph into a number ofpartitions. I Each consisting of a set ofverticesand all of those vertices’outgoing

Execution Model (2/2)

I Superstep 0: all vertices are in the active state.

I A vertex deactivates itself by voting to halt: no further work to do.

I A halted vertex can be active if it receives a message.

I The whole algorithm terminates when:• All vertices are simultaneously inactive.• There are no messages in transit.

Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 14 / 76

Page 17: Large Scale Graph Processing Pregel, GraphLab and GraphXI The pregel library divides a graph into a number ofpartitions. I Each consisting of a set ofverticesand all of those vertices’outgoing

Example: Max Value (1/4)

i_val := val

for each message m

if m > val then val := m

if i_val == val then

vote_to_halt

else

for each neighbor v

send_message(v, val)

Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 15 / 76

Page 18: Large Scale Graph Processing Pregel, GraphLab and GraphXI The pregel library divides a graph into a number ofpartitions. I Each consisting of a set ofverticesand all of those vertices’outgoing

Example: Max Value (2/4)

i_val := val

for each message m

if m > val then val := m

if i_val == val then

vote_to_halt

else

for each neighbor v

send_message(v, val)

Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 16 / 76

Page 19: Large Scale Graph Processing Pregel, GraphLab and GraphXI The pregel library divides a graph into a number ofpartitions. I Each consisting of a set ofverticesand all of those vertices’outgoing

Example: Max Value (3/4)

i_val := val

for each message m

if m > val then val := m

if i_val == val then

vote_to_halt

else

for each neighbor v

send_message(v, val)

Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 17 / 76

Page 20: Large Scale Graph Processing Pregel, GraphLab and GraphXI The pregel library divides a graph into a number ofpartitions. I Each consisting of a set ofverticesand all of those vertices’outgoing

Example: Max Value (4/4)

i_val := val

for each message m

if m > val then val := m

if i_val == val then

vote_to_halt

else

for each neighbor v

send_message(v, val)

Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 18 / 76

Page 21: Large Scale Graph Processing Pregel, GraphLab and GraphXI The pregel library divides a graph into a number ofpartitions. I Each consisting of a set ofverticesand all of those vertices’outgoing

Example: PageRank

I Update ranks in parallel.

I Iterate until convergence.

R[i] = 0.15 +∑

j∈Nbrs(i)

wjiR[j]

Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 19 / 76

Page 22: Large Scale Graph Processing Pregel, GraphLab and GraphXI The pregel library divides a graph into a number ofpartitions. I Each consisting of a set ofverticesand all of those vertices’outgoing

Example: PageRank

Pregel_PageRank(i, messages):

// receive all the messages

total = 0

foreach(msg in messages):

total = total + msg

// update the rank of this vertex

R[i] = 0.15 + total

// send new messages to neighbors

foreach(j in out_neighbors[i]):

sendmsg(R[i] * wij) to vertex j

R[i] = 0.15 +∑

j∈Nbrs(i)

wjiR[j]

Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 20 / 76

Page 23: Large Scale Graph Processing Pregel, GraphLab and GraphXI The pregel library divides a graph into a number ofpartitions. I Each consisting of a set ofverticesand all of those vertices’outgoing

Implementation

Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 21 / 76

Page 24: Large Scale Graph Processing Pregel, GraphLab and GraphXI The pregel library divides a graph into a number ofpartitions. I Each consisting of a set ofverticesand all of those vertices’outgoing

System Model (1/2)

I Master-worker model.

I The master is responsible for• Coordinating workers.• Determining the number of partitions.

I Each worker is responsible for• Executing the user’s Compute() method on its vertices.• Maintaining the state of its partitions.• Managing messages to and from other workers.

Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 22 / 76

Page 25: Large Scale Graph Processing Pregel, GraphLab and GraphXI The pregel library divides a graph into a number ofpartitions. I Each consisting of a set ofverticesand all of those vertices’outgoing

System Model (2/2)

I The master assigns one or more partitions to each worker.

I The master assigns a portion of user input to each worker.

• Set of records containing a number of vertices and edges.

• If a worker loads a vertex that belongs to that worker’s partitions,the appropriate data structures are immediately updated.

• Otherwise, the worker enqueues a message to the remote peer thatowns the vertex.

I After loading the graph, the master instructs each worker to performa superstep.

Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 23 / 76

Page 26: Large Scale Graph Processing Pregel, GraphLab and GraphXI The pregel library divides a graph into a number ofpartitions. I Each consisting of a set ofverticesand all of those vertices’outgoing

Graph Partitioning

I The pregel library divides a graph into a number of partitions.

I Each consisting of a set of vertices and all of those vertices’ outgoingedges.

I Vertices are assigned to partitions based on their vertex-ID (e.g.,hash(ID)).

Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 24 / 76

Page 27: Large Scale Graph Processing Pregel, GraphLab and GraphXI The pregel library divides a graph into a number ofpartitions. I Each consisting of a set ofverticesand all of those vertices’outgoing

Fault Tolerance (1/2)

I Fault tolerance is achieved through checkpointing.• Saved to persistent storage

I At start of each superstep, master tells workers to save their state:• Vertex values, edge values, incoming messages

I Master saves aggregator values (if any).

I This is not necessarily done at every superstep: costly

Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 25 / 76

Page 28: Large Scale Graph Processing Pregel, GraphLab and GraphXI The pregel library divides a graph into a number ofpartitions. I Each consisting of a set ofverticesand all of those vertices’outgoing

Fault Tolerance (2/2)

I When master detects one or more worker failures:

• All workers revert to last checkpoint.

• Continue from there.

• That is a lot of repeated work.

• At least it is better than redoing the whole job.

Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 26 / 76

Page 29: Large Scale Graph Processing Pregel, GraphLab and GraphXI The pregel library divides a graph into a number ofpartitions. I Each consisting of a set ofverticesand all of those vertices’outgoing

Pregel Limitations

I Inefficient if different regions of the graph converge at differentspeed.

I Can suffer if one task is more expensive than the others.

I Runtime of each phase is determined by the slowest machine.

Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 27 / 76

Page 30: Large Scale Graph Processing Pregel, GraphLab and GraphXI The pregel library divides a graph into a number ofpartitions. I Each consisting of a set ofverticesand all of those vertices’outgoing

GraphLab

Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 28 / 76

Page 31: Large Scale Graph Processing Pregel, GraphLab and GraphXI The pregel library divides a graph into a number ofpartitions. I Each consisting of a set ofverticesand all of those vertices’outgoing

GraphLab

I GraphLab allows asynchronous iterative computation.

I Vertex scope of vertex v: the data stored in v, in all adjacent verticesand edges.

Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 29 / 76

Page 32: Large Scale Graph Processing Pregel, GraphLab and GraphXI The pregel library divides a graph into a number ofpartitions. I Each consisting of a set ofverticesand all of those vertices’outgoing

GraphLab

I GraphLab allows asynchronous iterative computation.

I Vertex scope of vertex v: the data stored in v, in all adjacent verticesand edges.

Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 29 / 76

Page 33: Large Scale Graph Processing Pregel, GraphLab and GraphXI The pregel library divides a graph into a number ofpartitions. I Each consisting of a set ofverticesand all of those vertices’outgoing

Programming Model

I Vertex-centric programming

I A vertex can read and modify any of the data in its scope.• Calling the Update function, similar to Compute in Pregel.

I Input data: a directed graph that stores the program state.

Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 30 / 76

Page 34: Large Scale Graph Processing Pregel, GraphLab and GraphXI The pregel library divides a graph into a number ofpartitions. I Each consisting of a set ofverticesand all of those vertices’outgoing

Execution Model

I Each task in the set of tasks T , is a tuple (f, v) consisting of anupdate function f and a vertex v.

I After executing an update function (f, g, · · ·) the modified scopedata in Sv is written back to the data graph.

Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 31 / 76

Page 35: Large Scale Graph Processing Pregel, GraphLab and GraphXI The pregel library divides a graph into a number ofpartitions. I Each consisting of a set ofverticesand all of those vertices’outgoing

Execution Model

I Each task in the set of tasks T , is a tuple (f, v) consisting of anupdate function f and a vertex v.

I After executing an update function (f, g, · · ·) the modified scopedata in Sv is written back to the data graph.

Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 31 / 76

Page 36: Large Scale Graph Processing Pregel, GraphLab and GraphXI The pregel library divides a graph into a number ofpartitions. I Each consisting of a set ofverticesand all of those vertices’outgoing

Example: PageRank (Pregel)

Pregel_PageRank(i, messages):

// receive all the messages

total = 0

foreach(msg in messages):

total = total + msg

// update the rank of this vertex

R[i] = 0.15 + total

// send new messages to neighbors

foreach(j in out_neighbors[i]):

sendmsg(R[i] * wij) to vertex j

R[i] = 0.15 +∑

j∈Nbrs(i)

wjiR[j]

Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 32 / 76

Page 37: Large Scale Graph Processing Pregel, GraphLab and GraphXI The pregel library divides a graph into a number ofpartitions. I Each consisting of a set ofverticesand all of those vertices’outgoing

Example: PageRank (GraphLab)

GraphLab_PageRank(i)

// compute sum over neighbors

total = 0

foreach(j in in_neighbors(i)):

total = total + R[j] * wji

// update the PageRank

R[i] = 0.15 + total

// trigger neighbors to run again

foreach(j in out_neighbors(i)):

signal vertex-program on j

R[i] = 0.15 +∑

j∈Nbrs(i)

wjiR[j]

Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 33 / 76

Page 38: Large Scale Graph Processing Pregel, GraphLab and GraphXI The pregel library divides a graph into a number ofpartitions. I Each consisting of a set ofverticesand all of those vertices’outgoing

Consistency (1/4)

I Overlapped scopes: race-condition in simultaneous execution of twoupdate functions.

I Full consistency: during the execution f(v), no other function readsor modifies data within the v scope.

Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 34 / 76

Page 39: Large Scale Graph Processing Pregel, GraphLab and GraphXI The pregel library divides a graph into a number ofpartitions. I Each consisting of a set ofverticesand all of those vertices’outgoing

Consistency (1/4)

I Overlapped scopes: race-condition in simultaneous execution of twoupdate functions.

I Full consistency: during the execution f(v), no other function readsor modifies data within the v scope.

Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 34 / 76

Page 40: Large Scale Graph Processing Pregel, GraphLab and GraphXI The pregel library divides a graph into a number ofpartitions. I Each consisting of a set ofverticesand all of those vertices’outgoing

Consistency (2/4)

I Edge consistency: during the execution f(v), no other functionreads or modifies any of the data on v or any of the edges adja-cent to v.

Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 35 / 76

Page 41: Large Scale Graph Processing Pregel, GraphLab and GraphXI The pregel library divides a graph into a number ofpartitions. I Each consisting of a set ofverticesand all of those vertices’outgoing

Consistency (3/4)

I Vertex consistency: during the execution f(v), no other functionwill be applied to v.

Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 36 / 76

Page 42: Large Scale Graph Processing Pregel, GraphLab and GraphXI The pregel library divides a graph into a number ofpartitions. I Each consisting of a set ofverticesand all of those vertices’outgoing

Consistency (4/4)

Consistency vs. Parallelism

[Low, Y., GraphLab: A Distributed Abstraction for Large Scale Machine Learning (Doctoral dissertation, University of

California), 2013.]

Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 37 / 76

Page 43: Large Scale Graph Processing Pregel, GraphLab and GraphXI The pregel library divides a graph into a number ofpartitions. I Each consisting of a set ofverticesand all of those vertices’outgoing

Implementation

Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 38 / 76

Page 44: Large Scale Graph Processing Pregel, GraphLab and GraphXI The pregel library divides a graph into a number ofpartitions. I Each consisting of a set ofverticesand all of those vertices’outgoing

Consistency Implementation

I Distributed locking: associating a readers-writer lock with each ver-tex.

I Vertex consistency• Central vertex (write-lock)

I Edge consistency• Central vertex (write-lock), Adjacent vertices (read-locks)

I Full consistency• Central vertex (write-locks), Adjacent vertices (write-locks)

I Deadlocks are avoided by acquiring locks sequentially following acanonical order.

Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 39 / 76

Page 45: Large Scale Graph Processing Pregel, GraphLab and GraphXI The pregel library divides a graph into a number ofpartitions. I Each consisting of a set ofverticesand all of those vertices’outgoing

Consistency Implementation

I Distributed locking: associating a readers-writer lock with each ver-tex.

I Vertex consistency• Central vertex (write-lock)

I Edge consistency• Central vertex (write-lock), Adjacent vertices (read-locks)

I Full consistency• Central vertex (write-locks), Adjacent vertices (write-locks)

I Deadlocks are avoided by acquiring locks sequentially following acanonical order.

Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 39 / 76

Page 46: Large Scale Graph Processing Pregel, GraphLab and GraphXI The pregel library divides a graph into a number ofpartitions. I Each consisting of a set ofverticesand all of those vertices’outgoing

Consistency Implementation

I Distributed locking: associating a readers-writer lock with each ver-tex.

I Vertex consistency• Central vertex (write-lock)

I Edge consistency• Central vertex (write-lock), Adjacent vertices (read-locks)

I Full consistency• Central vertex (write-locks), Adjacent vertices (write-locks)

I Deadlocks are avoided by acquiring locks sequentially following acanonical order.

Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 39 / 76

Page 47: Large Scale Graph Processing Pregel, GraphLab and GraphXI The pregel library divides a graph into a number ofpartitions. I Each consisting of a set ofverticesand all of those vertices’outgoing

Consistency Implementation

I Distributed locking: associating a readers-writer lock with each ver-tex.

I Vertex consistency• Central vertex (write-lock)

I Edge consistency• Central vertex (write-lock), Adjacent vertices (read-locks)

I Full consistency• Central vertex (write-locks), Adjacent vertices (write-locks)

I Deadlocks are avoided by acquiring locks sequentially following acanonical order.

Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 39 / 76

Page 48: Large Scale Graph Processing Pregel, GraphLab and GraphXI The pregel library divides a graph into a number ofpartitions. I Each consisting of a set ofverticesand all of those vertices’outgoing

Consistency Implementation

I Distributed locking: associating a readers-writer lock with each ver-tex.

I Vertex consistency• Central vertex (write-lock)

I Edge consistency• Central vertex (write-lock), Adjacent vertices (read-locks)

I Full consistency• Central vertex (write-locks), Adjacent vertices (write-locks)

I Deadlocks are avoided by acquiring locks sequentially following acanonical order.

Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 39 / 76

Page 49: Large Scale Graph Processing Pregel, GraphLab and GraphXI The pregel library divides a graph into a number ofpartitions. I Each consisting of a set ofverticesand all of those vertices’outgoing

Graph Partitioning (1/3)

I Two-phase partitioning.

I Partitioning the data graph into k parts, called atom.

I Meta-graph: the graph of atoms (one vertex for each atom).

I Atom weight: the amount of data it stores.

I Edge weight: the number of edges crossing the atoms.

Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 40 / 76

Page 50: Large Scale Graph Processing Pregel, GraphLab and GraphXI The pregel library divides a graph into a number ofpartitions. I Each consisting of a set ofverticesand all of those vertices’outgoing

Graph Partitioning (2/3)

I Meta-graph is very small.

I A fast balanced partition of the meta-graph over the physical ma-chines.

I Assigning graph atoms to machines.

Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 41 / 76

Page 51: Large Scale Graph Processing Pregel, GraphLab and GraphXI The pregel library divides a graph into a number ofpartitions. I Each consisting of a set ofverticesand all of those vertices’outgoing

Graph Partitioning (3/3)

I Each atom file stores interior and the ghosts of the partition.• Ghost is set of vertices and edges adjacent to the partition boundary.

I Each atom is stored as a separate file on HDFS.

Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 42 / 76

Page 52: Large Scale Graph Processing Pregel, GraphLab and GraphXI The pregel library divides a graph into a number ofpartitions. I Each consisting of a set ofverticesand all of those vertices’outgoing

Fault Tolerance (1/2)

I Synchronous fault tolerance.

I The systems periodically signals all computation activity to halt.

I Then synchronizes all caches (ghosts) and saves to disk all datawhich has been modified since the last snapshot.

I Simple, but eliminates the systems advantage of asynchronous com-putation.

Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 43 / 76

Page 53: Large Scale Graph Processing Pregel, GraphLab and GraphXI The pregel library divides a graph into a number ofpartitions. I Each consisting of a set ofverticesand all of those vertices’outgoing

Fault Tolerance (1/2)

I Synchronous fault tolerance.

I The systems periodically signals all computation activity to halt.

I Then synchronizes all caches (ghosts) and saves to disk all datawhich has been modified since the last snapshot.

I Simple, but eliminates the systems advantage of asynchronous com-putation.

Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 43 / 76

Page 54: Large Scale Graph Processing Pregel, GraphLab and GraphXI The pregel library divides a graph into a number ofpartitions. I Each consisting of a set ofverticesand all of those vertices’outgoing

Fault Tolerance (2/2)

I Asynchronous fault tolerance: based on the Chandy-Lamport algo-rithm.

I The snapshot function is implemented as an update function invertices.

• It takes priority over all other update functions.

Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 44 / 76

Page 55: Large Scale Graph Processing Pregel, GraphLab and GraphXI The pregel library divides a graph into a number ofpartitions. I Each consisting of a set ofverticesand all of those vertices’outgoing

PowerGraph (GraphLab2)

Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 45 / 76

Page 56: Large Scale Graph Processing Pregel, GraphLab and GraphXI The pregel library divides a graph into a number ofpartitions. I Each consisting of a set ofverticesand all of those vertices’outgoing

PowerGraph

I Factorizes the update function into the Gather, Apply and Scatterphases.

I Vertx-cut partitioning.

Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 46 / 76

Page 57: Large Scale Graph Processing Pregel, GraphLab and GraphXI The pregel library divides a graph into a number ofpartitions. I Each consisting of a set ofverticesand all of those vertices’outgoing

Programming Model

I Gather-Apply-Scatter (GAS)

I Gather: accumulate information about neighborhood through a gen-eralized sum.

I Apply: apply the accumulated value to center vertex.

I Scatter: update adjacent edges and vertices.

Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 47 / 76

Page 58: Large Scale Graph Processing Pregel, GraphLab and GraphXI The pregel library divides a graph into a number ofpartitions. I Each consisting of a set ofverticesand all of those vertices’outgoing

Execution Model (1/2)

I Initially all vertices are active.

I It executes the vertex-program on the active vertices until none re-main.

• Once a vertex-program completes the scatter phase it becomesinactive until it is reactivated.

• Vertices can activate themselves and neighboring vertices.

I PowerGraph can execute both synchronously and asynchronously.

Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 48 / 76

Page 59: Large Scale Graph Processing Pregel, GraphLab and GraphXI The pregel library divides a graph into a number ofpartitions. I Each consisting of a set ofverticesand all of those vertices’outgoing

Execution Model (1/2)

I Initially all vertices are active.

I It executes the vertex-program on the active vertices until none re-main.

• Once a vertex-program completes the scatter phase it becomesinactive until it is reactivated.

• Vertices can activate themselves and neighboring vertices.

I PowerGraph can execute both synchronously and asynchronously.

Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 48 / 76

Page 60: Large Scale Graph Processing Pregel, GraphLab and GraphXI The pregel library divides a graph into a number ofpartitions. I Each consisting of a set ofverticesand all of those vertices’outgoing

Scheduling (2/2)

I Synchronous scheduling like Pregel.• Executing the gather, apply, and scatter in order.• Changes made to the vertex/edge data are committed at the end of

each step.

I Asynchronous scheduling like GraphLab.• Changes made to the vertex/edge data during the apply and scatter

functions are immediately committed to the graph.• Visible to subsequent computation on neighboring vertices.

Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 49 / 76

Page 61: Large Scale Graph Processing Pregel, GraphLab and GraphXI The pregel library divides a graph into a number ofpartitions. I Each consisting of a set ofverticesand all of those vertices’outgoing

Scheduling (2/2)

I Synchronous scheduling like Pregel.• Executing the gather, apply, and scatter in order.• Changes made to the vertex/edge data are committed at the end of

each step.

I Asynchronous scheduling like GraphLab.• Changes made to the vertex/edge data during the apply and scatter

functions are immediately committed to the graph.• Visible to subsequent computation on neighboring vertices.

Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 49 / 76

Page 62: Large Scale Graph Processing Pregel, GraphLab and GraphXI The pregel library divides a graph into a number ofpartitions. I Each consisting of a set ofverticesand all of those vertices’outgoing

Example: PageRank (Pregel)

Pregel_PageRank(i, messages):

// receive all the messages

total = 0

foreach(msg in messages):

total = total + msg

// update the rank of this vertex

R[i] = 0.15 + total

// send new messages to neighbors

foreach(j in out_neighbors[i]):

sendmsg(R[i] * wij) to vertex j

R[i] = 0.15 +∑

j∈Nbrs(i)

wjiR[j]

Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 50 / 76

Page 63: Large Scale Graph Processing Pregel, GraphLab and GraphXI The pregel library divides a graph into a number ofpartitions. I Each consisting of a set ofverticesand all of those vertices’outgoing

Example: PageRank (GraphLab)

GraphLab_PageRank(i)

// compute sum over neighbors

total = 0

foreach(j in in_neighbors(i)):

total = total + R[j] * wji

// update the PageRank

R[i] = 0.15 + total

// trigger neighbors to run again

foreach(j in out_neighbors(i)):

signal vertex-program on j

R[i] = 0.15 +∑

j∈Nbrs(i)

wjiR[j]

Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 51 / 76

Page 64: Large Scale Graph Processing Pregel, GraphLab and GraphXI The pregel library divides a graph into a number ofpartitions. I Each consisting of a set ofverticesand all of those vertices’outgoing

Example: PageRank (PowerGraph)

PowerGraph_PageRank(i):

Gather(j -> i):

return wji * R[j]

sum(a, b):

return a + b

// total: Gather and sum

Apply(i, total):

R[i] = 0.15 + total

Scatter(i -> j):

if R[i] changed then activate(j)

R[i] = 0.15 +∑

j∈Nbrs(i)

wjiR[j]

Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 52 / 76

Page 65: Large Scale Graph Processing Pregel, GraphLab and GraphXI The pregel library divides a graph into a number ofpartitions. I Each consisting of a set ofverticesand all of those vertices’outgoing

Implementation

Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 53 / 76

Page 66: Large Scale Graph Processing Pregel, GraphLab and GraphXI The pregel library divides a graph into a number ofpartitions. I Each consisting of a set ofverticesand all of those vertices’outgoing

Graph Partitioning (1/4)

I Natural graphs: skewed Power-Law degree distribution.

I Edge-cut algorithms perform poorly on Power-Law Graphs.

Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 54 / 76

Page 67: Large Scale Graph Processing Pregel, GraphLab and GraphXI The pregel library divides a graph into a number ofpartitions. I Each consisting of a set ofverticesand all of those vertices’outgoing

Graph Partitioning (2/4)

Vertex-Cut partitioning

Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 55 / 76

Page 68: Large Scale Graph Processing Pregel, GraphLab and GraphXI The pregel library divides a graph into a number ofpartitions. I Each consisting of a set ofverticesand all of those vertices’outgoing

Graph Partitioning (3/4)

I Random vertex-cuts

I Randomly assign edges to machines.

I Completely parallel and easy to distribute.

I High replication factor.

Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 56 / 76

Page 69: Large Scale Graph Processing Pregel, GraphLab and GraphXI The pregel library divides a graph into a number ofpartitions. I Each consisting of a set ofverticesand all of those vertices’outgoing

Graph Partitioning (4/4)

I Greedy vertex-cuts

I A(v): set of machines that vertex v spans.

I Case 1: If A(u) ∩A(v) 6= ∅, then the edge should be assigned to amachine in the intersection.

I Case 2: If A(u)∩A(v) = ∅, then the edge should be assigned to oneof the machines from the vertex with the most unassigned edges.

I Case 3: If only one of the two vertices has been assigned, thenchoose a machine from the assigned vertex.

I Case 4: If A(u) = A(v) = ∅, then assign the edge to the leastloaded machine.

Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 57 / 76

Page 70: Large Scale Graph Processing Pregel, GraphLab and GraphXI The pregel library divides a graph into a number ofpartitions. I Each consisting of a set ofverticesand all of those vertices’outgoing

GraphX

Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 58 / 76

Page 71: Large Scale Graph Processing Pregel, GraphLab and GraphXI The pregel library divides a graph into a number ofpartitions. I Each consisting of a set ofverticesand all of those vertices’outgoing

Data-Parallel vs. Graph-Parallel Computation

I Graph-parallel computation: restricting the types of computation toachieve performance.

I But, the same restrictions make it difficult and inefficient to expressmany stages in a typical graph-analytics pipeline.

Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 59 / 76

Page 72: Large Scale Graph Processing Pregel, GraphLab and GraphXI The pregel library divides a graph into a number ofpartitions. I Each consisting of a set ofverticesand all of those vertices’outgoing

Data-Parallel vs. Graph-Parallel Computation

I Graph-parallel computation: restricting the types of computation toachieve performance.

I But, the same restrictions make it difficult and inefficient to expressmany stages in a typical graph-analytics pipeline.

Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 59 / 76

Page 73: Large Scale Graph Processing Pregel, GraphLab and GraphXI The pregel library divides a graph into a number ofpartitions. I Each consisting of a set ofverticesand all of those vertices’outgoing

GraphX

I Unifies data-parallel and graph-parallel systems.

I Tables and Graphs are composable views of the same physical data.

I Implemented on top of Spark.

Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 60 / 76

Page 74: Large Scale Graph Processing Pregel, GraphLab and GraphXI The pregel library divides a graph into a number ofpartitions. I Each consisting of a set ofverticesand all of those vertices’outgoing

GraphX vs. Data-Parallel/Graph-Parallel Systems

Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 61 / 76

Page 75: Large Scale Graph Processing Pregel, GraphLab and GraphXI The pregel library divides a graph into a number ofpartitions. I Each consisting of a set ofverticesand all of those vertices’outgoing

GraphX vs. Data-Parallel/Graph-Parallel Systems

Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 61 / 76

Page 76: Large Scale Graph Processing Pregel, GraphLab and GraphXI The pregel library divides a graph into a number ofpartitions. I Each consisting of a set ofverticesand all of those vertices’outgoing

Programming Model

I Gather-Apply-Scatter (GAS)

I Input data (Property Graph): represented using two Spark RDDs:• Edge collection: VertexRDD• Vertex collection: EdgeRDD

// VD: the type of the vertex attribute

// ED: the type of the edge attribute

class Graph[VD, ED] {

val vertices: VertexRDD[VD]

val edges: EdgeRDD[ED]

}

Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 62 / 76

Page 77: Large Scale Graph Processing Pregel, GraphLab and GraphXI The pregel library divides a graph into a number ofpartitions. I Each consisting of a set ofverticesand all of those vertices’outgoing

Execution Model

I GAS decomposition

I Gather: the groupBy stage gathers messages destined to the samevertex.

I Apply: an intervening map operation applies the message sum toupdate the vertex property.

I Scatter: the join stage scatters the new vertex property to alladjacent vertices.

Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 63 / 76

Page 78: Large Scale Graph Processing Pregel, GraphLab and GraphXI The pregel library divides a graph into a number ofpartitions. I Each consisting of a set ofverticesand all of those vertices’outgoing

GraphX Operators

class Graph[V, E] {

// Constructor

def Graph(v: Collection[(Id, V)], e: Collection[(Id, Id, E)])

// Collection views

def vertices: Collection[(Id, V)]

def edges: Collection[(Id, Id, E)]

def triplets: Collection[Triplet]

// Graph-parallel computation

def mrTriplets(f: (Triplet) => M, sum: (M, M) => M): Collection[(Id, M)]

// Convenience functions

def mapV(f: (Id, V) => V): Graph[V, E]

def mapE(f: (Id, Id, E) => E): Graph[V, E]

def leftJoinV(v: Collection[(Id, V)], f: (Id, V, V) => V): Graph[V, E]

def leftJoinE(e: Collection[(Id, Id, E)], f: (Id, Id, E, E) => E):

Graph[V, E]

def subgraph(vPred: (Id, V) => Boolean, ePred: (Triplet) => Boolean):

Graph[V, E]

def reverse: Graph[V, E]

}

Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 64 / 76

Page 79: Large Scale Graph Processing Pregel, GraphLab and GraphXI The pregel library divides a graph into a number ofpartitions. I Each consisting of a set ofverticesand all of those vertices’outgoing

Example (1/3)

Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 65 / 76

Page 80: Large Scale Graph Processing Pregel, GraphLab and GraphXI The pregel library divides a graph into a number ofpartitions. I Each consisting of a set ofverticesand all of those vertices’outgoing

Example (2/3)

val sc: SparkContext

// Create an RDD for the vertices

val users: VertexRDD[(String, String)] = sc.parallelize(

Array((3L, ("rxin", "student")), (7L, ("jgonzal", "postdoc")),

(5L, ("franklin", "prof")), (2L, ("istoica", "prof"))))

// Create an RDD for edges

val relationships: EdgeRDD[String] = sc.parallelize(

Array(Edge(3L, 7L, "collab"), Edge(5L, 3L, "advisor"),

Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi")))

// Define a default user in case there are relationship with missing user

val defaultUser = ("John Doe", "Missing")

// Build the initial Graph

val userGraph: Graph[(String, String), String] =

Graph(users, relationships, defaultUser)

Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 66 / 76

Page 81: Large Scale Graph Processing Pregel, GraphLab and GraphXI The pregel library divides a graph into a number ofpartitions. I Each consisting of a set ofverticesand all of those vertices’outgoing

Example (3/3)

// Constructed from above

val userGraph: Graph[(String, String), String]

// Count all users which are postdocs

userGraph.vertices.filter((id, (name, pos)) => pos == "postdoc").count

// Count all the edges where src > dst

userGraph.edges.filter(e => e.srcId > e.dstId).count

// Use the triplets view to create an RDD of facts

val facts: RDD[String] = graph.triplets.map(triplet =>

triplet.srcAttr._1 + " is the " +

triplet.attr + " of " + triplet.dstAttr._1)

// Remove missing vertices as well as the edges to connected to them

val validGraph = graph.subgraph(vpred = (id, attr) => attr._2 != "Missing")

facts.collect.foreach(println(_))

Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 67 / 76

Page 82: Large Scale Graph Processing Pregel, GraphLab and GraphXI The pregel library divides a graph into a number ofpartitions. I Each consisting of a set ofverticesand all of those vertices’outgoing

Implementation

Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 68 / 76

Page 83: Large Scale Graph Processing Pregel, GraphLab and GraphXI The pregel library divides a graph into a number ofpartitions. I Each consisting of a set ofverticesand all of those vertices’outgoing

Implementation

I GraphX is implemented on top of Spark

I In-memory caching

I Lineage-based fault tolerance

Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 69 / 76

Page 84: Large Scale Graph Processing Pregel, GraphLab and GraphXI The pregel library divides a graph into a number ofpartitions. I Each consisting of a set ofverticesand all of those vertices’outgoing

Graph Representation

I Vertex-cut partitioning

I Representing graphs using two RDDs: edge-collection and vertex-collection

I Routing table: a logical map from a vertex id to the set of edgepartitions that contains adjacent edges.

Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 70 / 76

Page 85: Large Scale Graph Processing Pregel, GraphLab and GraphXI The pregel library divides a graph into a number ofpartitions. I Each consisting of a set ofverticesand all of those vertices’outgoing

Summary

Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 71 / 76

Page 86: Large Scale Graph Processing Pregel, GraphLab and GraphXI The pregel library divides a graph into a number ofpartitions. I Each consisting of a set ofverticesand all of those vertices’outgoing

Pregel Summary

I Bulk synchronous parallel model

I Vertex-centric

I Superstep: sequence of iterations

Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 72 / 76

Page 87: Large Scale Graph Processing Pregel, GraphLab and GraphXI The pregel library divides a graph into a number ofpartitions. I Each consisting of a set ofverticesand all of those vertices’outgoing

GraphLab Summary

I Asynchronous model

I Vertex-centric

I Three consistency levels: full, edge-level, and vertex-level

I Partitioning: two-phase partitioning

I Consistency: chromatic engine (graph coloring), distributed lockengine (reader-writer lock)

I Fault tolerance: synchronous, asynchronous (chandy-lamport)

Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 73 / 76

Page 88: Large Scale Graph Processing Pregel, GraphLab and GraphXI The pregel library divides a graph into a number ofpartitions. I Each consisting of a set ofverticesand all of those vertices’outgoing

PowerGraph Summary

I Gather-Apply-Scatter programming model

I Synchronous and asynchronous models

I Vertex-cut graph partitioning

Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 74 / 76

Page 89: Large Scale Graph Processing Pregel, GraphLab and GraphXI The pregel library divides a graph into a number ofpartitions. I Each consisting of a set ofverticesand all of those vertices’outgoing

GraphX Summary

I Unifies graph-parallel and data-prallel models

I Gather-Apply-Scatter programming model

I Vertex-cut graph partitioning

I On top of Spark

Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 75 / 76

Page 90: Large Scale Graph Processing Pregel, GraphLab and GraphXI The pregel library divides a graph into a number ofpartitions. I Each consisting of a set ofverticesand all of those vertices’outgoing

Questions?

Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 76 / 76