efficient graph processing with distributed immutable view

Efficient Graph Processing with Distributed Immutable View

Rong Chen+, Xin Ding+, Peng Wang+, Haibo Chen+, Binyu Zang+ and Haibing Guan*

Institute of Parallel and Distributed Systems +

Department of Computer Science *

Shanghai Jiao Tong University

2014HPDC

CommunicationComputation

100 Hrs of Video

every minute

1.11 Billion Users

6 Billion Photos400 Million

Tweets/day

How do we understand and use Big Data?

Big Data Everywhere

100 Hrs of Video

every minute

1.11 Billion Users

6 Billion Photos400 Million

Tweets/day

NLP

Big Data Big Learning

Machine Learning and Data Mining

It’s about the graphs ...

4 5

3 1 4

Example: PageRankA centrality analysis algorithm

to measure the relative rank for each element of a linked set

Characteristics□ Linked set data dependence□ Rank of who links it local accesses□ Convergence iterative computation

∑( 𝑗 , 𝑖 )∈𝐸

❑𝜔 𝑖𝑗𝑅 𝑗𝛼+(1−𝛼)𝑅𝑖=¿

4 5

1 23

4 5

3 1 4

4 5

3 1 21

Existing Graph-parallel Systems“Think as a vertex” philosophy

1. aggregate value of neighbors2. update itself value3. activate neighbors compute (v)

PageRankdouble sum = 0double value, last =

v.get ()foreach (n in v.in_nbrs) sum += n.value /

n.nedges;value = 0.15 + 0.85 *

sum;v.set (value);activate (v.out_nbrs);

1

2

3

4 5

1 23

Existing Graph-parallel Systems“Think as a vertex” philosophy

1. aggregate value of neighbors2. update itself value3. activate neighbors

Execution Engine□ sync: BSP-like model□ async: dist. sched_queues

Communication□ message passing: push value□ dist. shared memory: sync & pull

value

4 5

1 23

1 23 4 1

423

comp.

comm.

1 2push

1 1pull

2sync

barrier

Issues of Existing SystemsPregel[SIGMOD’09]→ Sync engine→ Edge-cut

+ Message Passingw/o dynamic

comp.high contention

3

keep alive

21

4x1

x1

2 1 master2 1 replicamsg

GraphLab[VLDB’12]

PowerGraph[OSDI’12]


+ Message Passing

GraphLab[VLDB’12]→ Async engine→ Edge-cut

+ DSM (replicas)w/o dynamic

comp.high contention

high contention

hard to programduplicated

edgesheavy comm. cost

3

keep alive

2

233

1 1

2

replica

11

44x1

x1

x2 x

2

5

dup


PowerGraph[OSDI’12]


+ Message Passing

GraphLab[VLDB’12]→ Async engine→ Edge-cut

+ DSM (replicas)

PowerGraph[OSDI’12]→ (A)Sync engine → Vertex-cut

+ GAS (replicas)w/o dynamic comp.

high contention

high contention

hard to programduplicated

edges

heavy comm. cost

high contentionheavy comm.

cost

3

keep alive

2

3

1 1

2

1

x5

x5

1

44x1

x1

233

1 1

2

replica

1

4x2 x

2

5


5

dup

ContributionsDistributed Immutable View

□ Easy to program/debug□ Support dynamic computation□ Minimized communication cost (x1 /replica)□ Contention (comp. & comm.) immunity

Multicore-based Cluster Support□ Hierarchical sync. & deterministic execution□ Improve parallelism and locality

OutlineDistributed Immutable View

→ Graph organization→ Vertex computation→ Message passing→ Change of execution flow

Multicore-based Cluster Support→ Hierarchical model→ Parallelism improvement

Evaluation

General Idea : For most graph algorithms,

vertex only aggregates neighbors’ data in one direction and activates in another direction□ e.g. PageRank, SSSP, Community Detection, …

Observation

Local aggregation/update & distributed activation□ Partitioning: avoid duplicate edges□ Computation: one-way local semantics□ Communication: merge update & activate messages

Graph OrganizationPartitioning graph and build local sub-graph

□ Normal edge-cut: randomized (e.g., hash-based) or heuristic (e.g., Metis)

□ Only create one direction edges (e.g., in-edges)→ Avoid duplicated edges

□ Create read-only replicas for edges spanning machines

4 5

23 1

4

3 1

4

23 1

5

21

master replic

a

M1 M2 M3

Vertex ComputationLocal aggregation/update

□ Support dynamic computation→ one-way local semantic

□ Immutable view: read-only access neighbors→ Eliminate contention on vertex

4 5

23 1

4

3 1

4

23 1

5

21

M1 M2 M3

read-only

CommunicationSync. & Distributed Activation

□ Merge update & activate messages1. Update value of replicas2. Invite replicas to activate neighbors

4 5

23 1

4

3 1

4

23 1

5

21

rlist:W1 l-act: 1value: 8 msg: 4

l-act:3value:6 msg:3

msg: v|m|se.g. 8 4 0

M1 M2 M3

84

active

s0

CommunicationDistributed Activation

□ Unidirectional message passing→ Replica will never be activated→ Always master replicas → Contention immunity

4 5

23 1

4

3 1

4

23 1

5

21

M1 M2 M3

in-q

ueue

s

M1M3

out-queues

Change of Execution FlowOriginal Execution Flow (e.g. Pregel)

5

parsing11

8

computation sending

14

7

10

receiving

high overhead

high contention

M2 M3

M1

threadvertex

message4

2

Change of Execution Flow

M1M3

out-queuescomputation sending

14

7

10

receiving lock-free

23

8

9

5

2

11

8

4

3

1

6

17 4

47

4

71

36

Execution Flow on Distributed Immutable View

low overhead

no contention

threadmaster

4replica4

M2 M3

M1




Evaluation

Multicore SupportTwo Challenges

1. Two-level hierarchical organization→ Preserve synchronous and deterministic

computation nature (easy to program/debug)

2. Original BSP-like model is hard to parallelize → High contention to buffer and parse

messages→ Poor locality in message parsing

Hierarchical ModelDesign Principle

□ Three level: iteration worker thread□ Only the last-level participants perform actual

tasks□ Parents (i.e. higher level participants) just wait

until all children finish their tasks

loop

tasktasktask

Level-0Level-1Level-2

workerthread

iteration

global barrier

local barrier

Parallelism ImprovementOriginal BSP-like model is hard to parallelize

M1M3

out-queues

in-q

ueue

s 5

parsing

2

11

8

computation sending

14

7

10

receiving

threadvertex

message4

M2 M3

M1

Parallelism ImprovementOriginal BSP-like model is hard to parallelize

M1M3

priv. out-queues

in-q

ueue

s 5

parsing

2

11

8

computation sending

14

7

10

receiving

M1M3

high contention

poor locality

threadvertex

message4

M2 M3

M1

Parallelism Improvement

M1M3

out-queues14

7

10

23

8

9

5

2

11

8

4

3

1

6

17 4

47

1

74

63

computation sending receiving

Distributed immutable view opens an opportunity

threadmaster

4replica4

M2 M3

M1

M2 M3

M1


M1M3

priv. out-queues14

7

10 M1M3

23

8

9

5

2

11

8

1

7

4

4

7 1

74

63 4

3

1

6poor locality

lock-freecomputation sending receiving


threadmaster

4replica4


M1M31

4

7

10 M1M3

23

8

9

5

2

11

8

1

7

4

4

7 1

74

36 6

3

1

4

lock-freecomputation sending receiving


no interference

threadmaster

4replica4

M2 M3

M1

priv. out-queues

M2 M3

M1

Parallelism ImprovementDistributed immutable view opens an opportunity

M1M31

4

7

10 M1M3

23

8

9

5

2

11

8

1

7

4

4

7 1

74

36 6

3

4

1

lock-free

sorted


good locality

threadmaster

4replica4

priv. out-queues




Implementation & Experiment

ImplementationCyclops(MT)

□ Based on (Java & Hadoop)

□ ~2,800 SLOC□ Provide mostly compatible user interface□ Graph ingress and partitioning

→ Compatible I/O-interface→ Add an additional phase to build replicas

□ Fault tolerance→ Incremental checkpoint→ Replication-based FT [DSN’14]

Experiment SettingsPlatform

□ 6X12-core AMD Opteron (64G RAM, 1GigE NIC)Graph Algorithms

□ PageRank (PR), Community Detection (CD), Alternating Least Squares (ALS), Single Source Shortest Path (SSSP)

Workload□ 7 real-world dataset from SNAP1 □ 1 synthetic dataset from GraphLab2

1http://snap.stanford.edu/data/

Dataset

|V| |E|

Amazon 0.4M 3.4MGWeb 0.9M 5.1M

LJournal 4.8M 69MWiki 5.7M 130

MSYN-GL 0.1M 2.7M

DBLP 0.3M 1.0MRoadCA 1.9M 5.5M

2http://graphlab.org

http://snap.stanford.edu/data/



http://graphlab.org/

Overall Performance Improvement

Amazon Gweb LJournal Wiki SYN-GL DBLP RoadCA0123456789

10 HamaCyclopsCyclopsMT

Norm

alize

d Sp

eedu

p

PageRank ALS CD SSSPPush-mode

8.69X

2.06X

48 workers

6 workers(8)

Performance Scalability

6 12 24 4805

101520253035 Hama

CyclopsCy-clopsMT

Norm

alize

d Sp

eedu

p

Amazon6 12 24 48

GWeb6 12 24 48

LJournal6 12 24 48

Wiki

50.2

6 12 24 4805

101520253035

Norm

alize

d Sp

eedu

p

SYN-GL6 12 24 48

DBLP6 12 24 48

RoadCA

threads

workers

Performance Breakdown

Amazon GWeb Ljournal Wiki SYN-GL DBLP RoadCA0.0 0.2 0.4 0.6 0.8 1.0

PARSESENDCOMPSYNC

Ratio

of E

xec-

Tim

e

PageRank ALS CD SSSP

0 6 12 18 24 300

100020003000400050006000

Iteration

#M

essa

ges

(K)

0 6 12 18 24 300

200400600800

1000

Hama

Iteration#Ve

rtice

s (K

)

CyclopsMT

HamaCyclops

Comparison with PowerGraph1

Amazon GWeb LJournal Wiki020406080

100120 CyclopsMT

Pow-er-Graph

Exec

-Tim

e (S

ec)

Amazon GWeb LJournal Wiki0500

100015002000

#Mes

sage

s (M

)

Dataset

COMP%

Amazon 11%GWeb 15%

LJournal 25%Wiki 39%

Cyclops-like engine on GraphLab1 Platform

Preliminary Results

Regular Natural048

12

Exec

-Tim

e (S

ec)

1http://graphlab.org 2synthetic 10-million vertex regular (even edge) and power-law (α=2.0) graphs

22

1C++ & Boost RPC lib.




ConclusionCyclops: a new synchronous vertex-oriented

graph processing system□ Preserve synchronous and deterministic

computation nature (easy to program/debug)□ Provide efficient vertex computation with

significantly fewer messages and contention immunity by distributed immutable view

□ Further support multicore-based cluster with hierarchical processing model and high parallelism

Source Code: http://ipads.se.sjtu.edu.cn/projects/cyclops

http://ipads.se.sjtu.edu.cn/projects/cyclops



Questions

Thanks

Cyclopshttp://

ipads.se.sjtu.edu.cn/projects/cyclops.html

IPADS

Institute of Parallel and Distributed

Systems

http://ipads.se.sjtu.edu.cn/projects/powerlyra.html



PowerLyra: differentiated graph computation and partitioning on skewed natural graphs□ Hybrid engine and partitioning algorithms□ Outperform PowerGraph by up to 3.26X

for natural graphs

What’s Next?


213Low

High

R N048

1216

Exec

-Tim

e (S

ec)

Preliminary Results

PLPGCyclops

Power-law: “most vertices have relatively few neighbors while a few have many neighbors”



GeneralityAlgorithms: aggregate/activate all neighbors

□ e.g. Community Detection (CD)□ Transfer to undirected graph and duplicate edges

4

3 1

4

23 1

5

21

M1 M2 M354 5

23 1

4 5

23 1

4

3 1

4

23 1

5

21

M1 M2 M3

GeneralityAlgorithms: aggregate/activate all neighbors

□ e.g. Community Detection (CD)□ Transfer to undirected graph and duplicate edges□ Still aggregate in one direction (e.g. in-edges) and

activate in another direction (e.g. out-edges)□ Preserve all benefits of Cyclops

→ x1 /replica & contention immunity & good locality

4

3 1

4

23 1

5

21

M1 M2 M354 5

23 1

4

3 1

4

23 1

5

21

M1 M2 M35

GeneralityDifference between Cyclops and GraphLab

1. How to construct local sub-graph2. How to aggregate/activate neighbors

4

3 1

4

23 1

5

21

M1 M2 M354 5

23 1

4 5

23 1

Improvement of CyclopsMT

6x1x1/16x2x1/16x4x1/16x8x1/1

6x1x1/16x1x2/26x1x4/46x1x8/8

6x1x8/16x1x8/26x1x8/46x1x8/8

0.0 5.0

10.0 15.0 20.0 25.0 30.0 SEND COMP SYNC

Exec

utio

n Ti

me

(Sec

)

#[M]achines MxWxT/R#[W]orkers

#[T]hreads#[R]eceivers

CyclopsCyclopsMT

Communication Efficiency

HamaCyclops

HamaCyclops

HamaCyclops

0.1 1.0 10.0 100.0 1,000.0

SENDPARSE

Exec-Time (Sec)

50M

25M

5M

25.6X

16.2X55.6%

12.6X25.0%

W0

W1W2W3W4W5

message:(id,data)

Hadoop RPC lib (Java) Boost RPC lib (C++)Hadoop RPC lib (Java)

Hama:PowerGrap

h:Cyclops:

send + buffer + parse (contention)

send + update

(contention)

31.5%

Using Heuristic Edge-cut (i.e. Metis)

Amazon Gweb LJournal Wiki SYN-GL DBLP RoadCA0

5

10

15

20

25 HamaCyclopsCyclopsMT

Norm

alize

d Sp

eedu

p

PageRank ALS CD SSSP

23.04X

5.95X

48 workers

6 workers(8)

Memory Consumption

Configuration

Max Cap (GB)

Max Usage (GB)

Young GC2

(#)Full GC2

(#)Hama/48 1.7 1.5 132 69

Cyclops/48 4.0 3.0 45 15CyclopsMT/

6x812.6/8 11.0/8 268/8 32/8

Memory Behavior1 per Worker(PageRank with Wiki dataset)

2 GC: Concurrent Mark-Sweep1 jStat

Ingress Time

Dataset

LD REP INIT TOT

H C H C H C H CAmazon 6.2 5.9 0.0 2.5 1.7 1.5 7.9 9.9

GWeb 7.1 6.8 0.0 2.8 2.6 1.9 9.7 11.4LJournal 27.1 31.0 0.0 44.7 17.9 9.2 45.0 84.9

Wiki 46.7 46.7 0.0 62.2 33.4 20.4 80.0 129.3

SYN-GL 4.2 4.0 0.0 2.6 2.4 1.8 6.6 8.4DBLP 4.1 4.1 0.0 1.5 1.3 0.9 5.4 6.5

RoadCA 6.4 6.2 0.0 3.9 0.9 0.6 7.3 10.7

CyclopsHama

Selective ActivationSync. & Distributed Activation

□ Merge update & activate messages1. Update value of replicas2. Invite replicas to activate neighbors

4 5

23 1

4

3 1

4

23 1

5

21

rlist:W1 l-act: 1value: 8 msg: 4

l-act:3value:6 msg:3

msg: v|m|se.g. 8 4 0

M1 M2 M3

84

active

msg: v|m|s|l

*Selective Activation (e.g. ALS)

Option: Activation_List

s0

M2 M3

M1

Parallelism ImprovementDistributed immutable view opens an opportunity

M1M3

out-queues14

7

10 M1M3

23

8

9

5

2

11

8

1

7

4

4

7 1

74

36 6

3

4

1

lock-free

sorted


good locality

comp.threads

comm.threadsvs.separate

configuration

threadmaster

4replica4

w/ dynamic comp.

no contention

easy to program

duplicated edgeslow comm. cost

CyclopsExisting graph-parallel

systems (e.g., Pregel, GraphLab, PowerGraph)

Cyclops(MT)→ Distributed

Immutable View

w/o dynamic comp.

high contention

hard to program

duplicated edgesheavy comm. cost

233

1 1

5replica

1

4x1

x1

BiGraph: bipartite-oriented distributed graph partitioning for big learning□ A set of online distributed graph partition

algorithms designed for bipartite graphs and applications

□ Partition graphs in a differentiated way and loading data according to the data affinity

□ Outperform PowerGraph with default partition by up to 17.75X, and save up to 96% network traffic

What’s Next?




Multicore SupportTwo Challenges

1. Two-level hierarchical organization→ Preserve synchronous and deterministic

computation nature (easy to program/debug)

2. Original BSP-like model is hard to parallelize → High contention to buffer and parse

messages→ Poor locality in message parsing→ Asymmetric degree of parallelism for CPU and

NIC

efficient graph processing with distributed immutable view

Documents