efficient graph processing with distributed immutable view rong chen rong chen +, xin ding +, peng...
TRANSCRIPT
Efficient Graph Processing
with Distributed Immutable View
Rong Chen+, Xin Ding+, Peng Wang+, Haibo Chen+, Binyu Zang+ and Haibing Guan*
Institute of Parallel and Distributed Systems +
Department of Computer Science *
Shanghai Jiao Tong University
2014HPDC
CommunicationComputation
100 Hrs of Video
every minute
1.11 Billion Users
6 Billion Photos400 Million
Tweets/day
How do we understand and use Big Data?
Big Data Everywhere
100 Hrs of Video
every minute
1.11 Billion Users
6 Billion Photos400 Million
Tweets/day
NLP
Big Data Big Learning
Machine Learning and Data Mining
It’s about the graphs ...
4 5
3 1 4
Example: PageRank
A centrality analysis algorithm to measure the relative rank for each element of a linked set
Characteristics□ Linked set data dependence□ Rank of who links it local accesses□ Convergence iterative computation
∑( 𝑗 , 𝑖 )∈𝐸
❑𝜔 𝑖𝑗𝑅 𝑗𝛼+(1−𝛼)𝑅𝑖=¿
4 5
1 23
4 5
3 1 4
4 5
3 1 21
Existing Graph-parallel Systems
“Think as a vertex” philosophy1. aggregate value of neighbors2. update itself value3. activate neighbors
compute (v) PageRank
double sum = 0double value, last =
v.get ()foreach (n in v.in_nbrs) sum += n.value /
n.nedges;
value = 0.15 + 0.85 * sum;
v.set (value);
activate (v.out_nbrs);
1
2
3
4 5
1 23
Existing Graph-parallel Systems
“Think as a vertex” philosophy1. aggregate value of neighbors2. update itself value3. activate neighbors
Execution Engine□ sync: BSP-like model□ async: dist. sched_queues
Communication□ message passing: push value□ dist. shared memory: sync & pull
value
4 5
1 23
1 2
3 4 1
423
comp.
comm.
1 2push
1 1pull
2sync
barrier
Issues of Existing Systems
Pregel[SIGMOD’09]→ Sync engine→ Edge-cut
+ Message Passingw/o dynamic
comp.high contention
3
keep alive
21
4x1
x1
2 1 master
2 1 replica
msg
GraphLab[VLDB’12]
PowerGraph[OSDI’12]
Issues of Existing Systems
Pregel[SIGMOD’09]→ Sync engine→ Edge-cut
+ Message Passing
GraphLab[VLDB’12]→ Async engine→ Edge-cut
+ DSM (replicas)w/o dynamic
comp.high contention
high contention
hard to programduplicated
edgesheavy comm. cost
3
keep alive
2
233
1 1
2
replica
11
44x1
x1
x2 x
2
5
dup
2 1 master
2 1 replica
msg
PowerGraph[OSDI’12]
Issues of Existing Systems
Pregel[SIGMOD’09]→ Sync engine→ Edge-cut
+ Message Passing
GraphLab[VLDB’12]→ Async engine→ Edge-cut
+ DSM (replicas)
PowerGraph[OSDI’12]→ (A)Sync engine → Vertex-cut
+ GAS (replicas)w/o dynamic comp.
high contention
high contention
hard to programduplicated
edges
heavy comm. cost
high contentionheavy comm.
cost
3
keep alive
2
3
1 1
2
1
x5
x5
1
44x1
x1
233
1 1
2
replica
1
4x2 x
2
5
2 1 master
2 1 replica
msg
5
dup
Contributions
Distributed Immutable View□ Easy to program/debug□ Support dynamic computation□ Minimized communication cost (x1 /replica)□ Contention (comp. & comm.) immunity
Multicore-based Cluster Support□ Hierarchical sync. & deterministic execution□ Improve parallelism and locality
Outline
Distributed Immutable View→ Graph organization→ Vertex computation→ Message passing→ Change of execution flow
Multicore-based Cluster Support→ Hierarchical model→ Parallelism improvement
Evaluation
General Idea
: For most graph algorithms, vertex only aggregates neighbors’ data in one direction and activates in another direction□ e.g. PageRank, SSSP, Community Detection, …
Observation
Local aggregation/update & distributed activation□ Partitioning: avoid duplicate edges□ Computation: one-way local semantics□ Communication: merge update & activate messages
Graph Organization
Partitioning graph and build local sub-graph□ Normal edge-cut: randomized (e.g., hash-based)
or heuristic (e.g., Metis)□ Only create one direction edges (e.g., in-edges)
→ Avoid duplicated edges□ Create read-only replicas for edges spanning
machines
4 5
23 1
4
3 1
4
23 1
5
21
master
replica
M1 M2 M3
Vertex Computation
Local aggregation/update□ Support dynamic computation
→ one-way local semantic□ Immutable view: read-only access neighbors
→ Eliminate contention on vertex
4 5
23 1
4
3 1
4
23 1
5
21
M1 M2 M3
read-only
Communication
Sync. & Distributed Activation□ Merge update & activate messages
1. Update value of replicas2. Invite replicas to activate neighbors
4 5
23 1
4
3 1
4
23 1
5
21
rlist:W1 l-act: 1value: 8 msg: 4
l-act:3value:6 msg:3
msg: v|m|se.g. 8 4 0
M1 M2 M3
84
active
s0
Communication
Distributed Activation□ Unidirectional message passing
→ Replica will never be activated→ Always master replicas → Contention immunity
4 5
23 1
4
3 1
4
23 1
5
21
M1 M2 M3
in-q
ueu
es
M1
M3
out-queues
Change of Execution Flow
Original Execution Flow (e.g. Pregel)
5
parsing11
8
computation sending
1
4
7
10
receiving
high overhead
high contention
M2 M3
M1
thread
vertex
message
4
2
Change of Execution Flow
M1
M3
out-queuescomputation sending
1
4
7
10
receiving lock-free
2
3
8
9
5
2
11
8
4
3
1
6
17 4
47
4
7
1
3
6
Execution Flow on Distributed Immutable View
low overhead
no contention
thread
master
4replica
4
M2 M3
M1
Outline
Distributed Immutable View→ Graph organization→ Vertex computation→ Message passing→ Change of execution flow
Multicore-based Cluster Support→ Hierarchical model→ Parallelism improvement
Evaluation
Multicore Support
Two Challenges1. Two-level hierarchical organization
→ Preserve synchronous and deterministic computation nature (easy to program/debug)
2. Original BSP-like model is hard to parallelize → High contention to buffer and parse
messages→ Poor locality in message parsing
Hierarchical Model
Design Principle□ Three level: iteration worker thread□ Only the last-level participants perform actual
tasks□ Parents (i.e. higher level participants) just wait
until all children finish their tasks
loop
tasktasktask
Level-0Level-1Level-2
worker
thread
iteration
global barrier
local barrier
Parallelism Improvement
Original BSP-like model is hard to parallelize
M1
M3
out-queues
in-q
ueu
es 5
parsing
2
11
8
computation sending
1
4
7
10
receiving
thread
vertex
message
4
M2 M3
M1
Parallelism Improvement
Original BSP-like model is hard to parallelize
M1
M3
priv. out-queues
in-q
ueu
es 5
parsing
2
11
8
computation sending
1
4
7
10
receiving
M1
M3
high contention
poor locality
thread
vertex
message
4
M2 M3
M1
Parallelism Improvement
M1
M3
out-queues
1
4
7
10
2
3
8
9
5
2
11
8
4
3
1
6
17 4
47
1
7
4
6
3
computation sending receiving
Distributed immutable view opens an opportunity
thread
master
4replica
4
M2 M3
M1
M2 M3
M1
Parallelism Improvement
M1
M3
priv. out-queues
1
4
7
10 M1
M3
2
3
8
9
5
2
11
8
1
7
4
4
71
7
4
6
3 4
3
1
6poor locality
lock-freecomputation sending receiving
Distributed immutable view opens an opportunity
thread
master
4replica
4
Parallelism Improvement
M1
M31
4
7
10 M1
M3
2
3
8
9
5
2
11
8
1
7
4
4
71
7
4
3
6 6
3
1
4
lock-freecomputation sending receiving
Distributed immutable view opens an opportunity
no interference
thread
master
4replica
4
M2 M3
M1
priv. out-queues
M2 M3
M1
Parallelism Improvement
Distributed immutable view opens an opportunity
M1
M31
4
7
10 M1
M3
2
3
8
9
5
2
11
8
1
7
4
4
71
7
4
3
6 6
3
4
1
lock-free
sorted
computation sending receiving
good locality
thread
master
4replica
4
priv. out-queues
Outline
Distributed Immutable View→ Graph organization→ Vertex computation→ Message passing→ Change of execution flow
Multicore-based Cluster Support→ Hierarchical model→ Parallelism improvement
Implementation & Experiment
Implementation
Cyclops(MT)□ Based on (Java &
Hadoop)□ ~2,800 SLOC□ Provide mostly compatible user interface□ Graph ingress and partitioning
→ Compatible I/O-interface→ Add an additional phase to build replicas
□ Fault tolerance→ Incremental checkpoint→ Replication-based FT [DSN’14]
Experiment Settings
Platform□ 6X12-core AMD Opteron (64G RAM, 1GigE NIC)
Graph Algorithms□ PageRank (PR), Community Detection (CD),
Alternating Least Squares (ALS), Single Source Shortest Path (SSSP)
Workload□ 7 real-world dataset from SNAP1 □ 1 synthetic dataset from GraphLab2
1http://snap.stanford.edu/data/
Dataset
|V| |E|
Amazon 0.4M 3.4M
GWeb 0.9M 5.1M
LJournal 4.8M 69M
Wiki 5.7M 130M
SYN-GL 0.1M 2.7M
DBLP 0.3M 1.0M
RoadCA 1.9M 5.5M
2http://graphlab.org
Overall Performance Improvement
Amazon Gweb LJournal Wiki SYN-GL DBLP RoadCA0123456789
10 HamaCyclopsCyclopsMT
Norm
aliz
ed S
peedup
PageRank ALS CD SSSP
Push-mode
8.69X
2.06X
48 workers
6 workers(8)
Performance Scalability
6 12 24 4805
101520253035 Hama
CyclopsCy-clopsMT
Norm
aliz
ed
Speedup
Amazon6 12 24 48
GWeb6 12 24 48
LJournal6 12 24 48
Wiki
50
.2
6 12 24 4805
101520253035
Norm
aliz
ed
Speedup
SYN-GL6 12 24 48
DBLP6 12 24 48
RoadCA
threads
workers
Performance Breakdown
Amazon GWeb Ljournal Wiki SYN-GL DBLP RoadCA0.0
0.2
0.4
0.6
0.8
1.0
PARSESENDCOMPSYNC
Rati
o o
f Exe
c-Tim
e
PageRank ALS CD SSSP
0 6 12 18 24 300
100020003000400050006000
Iteration
#M
ess
ag
es
(K)
0 6 12 18 24 300
200
400
600
800
1000
Hama
Iteration#V
ert
ice
s (K
)
CyclopsMT
HamaCyclops
Comparison with PowerGraph1
Amazon GWeb LJournal Wiki0
20406080
100120 CyclopsMT
Pow-er-Graph
Exe
c-Tim
e
(Sec)
Amazon GWeb LJournal Wiki0
500
1000
1500
2000
#M
ess
ages
(M)
Dataset
COMP%
Amazon 11%GWeb 15%
LJournal 25%Wiki 39%
Cyclops-like engine on GraphLab1 Platform
Preliminary Results
Regular Natural0
4
8
12
Exe
c-Tim
e
(Sec)
1http://graphlab.org 2synthetic 10-million vertex regular (even edge) and power-law (α=2.0) graphs
22
1C++ & Boost RPC lib.
Conclusion
Cyclops: a new synchronous vertex-oriented graph processing system□ Preserve synchronous and deterministic
computation nature (easy to program/debug)□ Provide efficient vertex computation with
significantly fewer messages and contention immunity by distributed immutable view
□ Further support multicore-based cluster with hierarchical processing model and high parallelism
Source Code: http://ipads.se.sjtu.edu.cn/projects/cyclops
Questions
Thanks
Cyclopshttp://
ipads.se.sjtu.edu.cn/projects/cyclops.html
IPADS
Institute of Parallel and Distributed
Systems
PowerLyra: differentiated graph computation and partitioning on skewed natural graphs□ Hybrid engine and partitioning algorithms□ Outperform PowerGraph by up to 3.26X
for natural graphs
What’s Next?
http://ipads.se.sjtu.edu.cn/projects/powerlyra.html
21
3Low
High
R N048
1216
Exe
c-T
ime
(S
ec)
Preliminary Results
PLPGCyclops
Power-law: “most vertices have relatively few neighbors while a few have many neighbors”
Generality
Algorithms: aggregate/activate all neighbors□ e.g. Community Detection (CD)□ Transfer to undirected graph and duplicate edges
4
3 1
4
23 1
5
21
M1 M2 M354 5
23 1
4 5
23 1
4
3 1
4
23 1
5
21
M1 M2 M3
Generality
Algorithms: aggregate/activate all neighbors□ e.g. Community Detection (CD)□ Transfer to undirected graph and duplicate edges□ Still aggregate in one direction (e.g. in-edges)
and activate in another direction (e.g. out-edges)□ Preserve all benefits of Cyclops
→ x1 /replica & contention immunity & good locality
4
3 1
4
23 1
5
21
M1 M2 M354 5
23 1
4
3 1
4
23 1
5
21
M1 M2 M35
Generality
Difference between Cyclops and GraphLab1. How to construct local sub-graph2. How to aggregate/activate neighbors
4
3 1
4
23 1
5
21
M1 M2 M354 5
23 1
4 5
23 1
Improvement of CyclopsMT
6x1
x1
/1
6x2
x1
/1
6x4
x1
/1
6x8
x1
/1
6x1
x1
/1
6x1
x2
/2
6x1
x4
/4
6x1
x8
/8
6x1
x8
/1
6x1
x8
/2
6x1
x8
/4
6x1
x8
/8
0.0
5.0
10.0
15.0
20.0
25.0
30.0 SEND COMP SYNC
Exe
cuti
on T
ime (
Sec)
#[M]achines MxWxT/R#[W]orkers
#[T]hreads
#[R]eceivers
Cyclops
CyclopsMT
Communication Efficiency
Hama
Cyclops
Hama
Cyclops
Hama
Cyclops0.1 1.0 10.0 100.0 1,000.0
SENDPARSE
Exec-Time (Sec)
50M
25M
5M
25.6X
16.2X
55.6%
12.6X
25.0%
W0
W1
W2
W3
W4
W5message:(id,data)
Hadoop RPC lib (Java) Boost RPC lib (C++)Hadoop RPC lib (Java)
Hama:PowerGrap
h:Cyclops:
send + buffer + parse (contention)
send + update
(contention)
31.5%
Using Heuristic Edge-cut (i.e. Metis)
Amazon Gweb LJournal Wiki SYN-GL DBLP RoadCA0
5
10
15
20
25 HamaCyclopsCyclopsMT
Norm
aliz
ed S
peedup
PageRank ALS CD SSSP
23.04X
5.95X
48 workers
6 workers(8)
Memory Consumption
Configuration
Max Cap (GB)
Max Usage (GB)
Young GC2
(#)Full GC2
(#)
Hama/48 1.7 1.5 132 69
Cyclops/48 4.0 3.0 45 15
CyclopsMT/6x8
12.6/8 11.0/8 268/8 32/8
Memory Behavior1 per Worker(PageRank with Wiki dataset)
2 GC: Concurrent Mark-Sweep
1 jStat
Ingress Time
Dataset
LD REP INIT TOT
H C H C H C H C
Amazon 6.2 5.9 0.0 2.5 1.7 1.5 7.9 9.9
GWeb 7.1 6.8 0.0 2.8 2.6 1.9 9.7 11.4
LJournal 27.1 31.0 0.0 44.7 17.9 9.2 45.0 84.9
Wiki 46.7 46.7 0.0 62.2 33.4 20.4 80.0 129.3
SYN-GL 4.2 4.0 0.0 2.6 2.4 1.8 6.6 8.4
DBLP 4.1 4.1 0.0 1.5 1.3 0.9 5.4 6.5
RoadCA 6.4 6.2 0.0 3.9 0.9 0.6 7.3 10.7
CyclopsHama
Selective Activation
Sync. & Distributed Activation□ Merge update & activate messages
1. Update value of replicas2. Invite replicas to activate neighbors
4 5
23 1
4
3 1
4
23 1
5
21
rlist:W1 l-act: 1value: 8 msg: 4
l-act:3value:6 msg:3
msg: v|m|se.g. 8 4 0
M1 M2 M3
84
active
msg: v|m|s|l
*Selective Activation (e.g. ALS)
Option: Activation_List
s0
M2 M3
M1
Parallelism Improvement
Distributed immutable view opens an opportunity
M1
M3
out-queues
1
4
7
10 M1
M3
2
3
8
9
5
2
11
8
1
7
4
4
71
7
4
3
6 6
3
4
1
lock-free
sorted
computation sending receiving
good locality
comp.threads
comm.threadsvs.
separateconfiguration
thread
master
4replica
4
w/ dynamic comp.
no contention
easy to program
duplicated edges
low comm. cost
CyclopsExisting graph-parallel
systems (e.g., Pregel, GraphLab, PowerGraph)
Cyclops(MT)→ Distributed
Immutable View
w/o dynamic comp.
high contention
hard to program
duplicated edges
heavy comm. cost
233
1 1
5replica
1
4x1
x1
BiGraph: bipartite-oriented distributed graph partitioning for big learning□ A set of online distributed graph partition
algorithms designed for bipartite graphs and applications
□ Partition graphs in a differentiated way and loading data according to the data affinity
□ Outperform PowerGraph with default partition by up to 17.75X, and save up to 96% network traffic
What’s Next?
http://ipads.se.sjtu.edu.cn/projects/powerlyra.html
Multicore Support
Two Challenges1. Two-level hierarchical organization
→ Preserve synchronous and deterministic computation nature (easy to program/debug)
2. Original BSP-like model is hard to parallelize → High contention to buffer and parse
messages→ Poor locality in message parsing→ Asymmetric degree of parallelism for CPU and
NIC