efficient graph processing with distributed immutable view
DESCRIPTION
HPDC. 2014. Communication. Computation. Efficient Graph Processing with Distributed Immutable View. Rong Chen + , Xin Ding + , Peng Wang + , Haibo Chen + , Binyu Zang + and Haibing Guan * Institute of Parallel and Distributed Systems + Department of Computer Science * - PowerPoint PPT PresentationTRANSCRIPT
Efficient Graph Processing with Distributed Immutable View
Rong Chen+, Xin Ding+, Peng Wang+, Haibo Chen+, Binyu Zang+ and Haibing Guan*
Institute of Parallel and Distributed Systems +
Department of Computer Science *
Shanghai Jiao Tong University
2014HPDC
CommunicationComputation
100 Hrs of Video
every minute
1.11 Billion Users
6 Billion Photos400 Million
Tweets/day
How do we understand and use Big Data?
Big Data Everywhere
100 Hrs of Video
every minute
1.11 Billion Users
6 Billion Photos400 Million
Tweets/day
NLP
Big Data Big Learning
Machine Learning and Data Mining
It’s about the graphs ...
4 5
3 1 4
Example: PageRankA centrality analysis algorithm
to measure the relative rank for each element of a linked set
Characteristics□ Linked set data dependence□ Rank of who links it local accesses□ Convergence iterative computation
∑( 𝑗 , 𝑖 )∈𝐸
❑𝜔 𝑖𝑗𝑅 𝑗𝛼+(1−𝛼)𝑅𝑖=¿
4 5
1 23
4 5
3 1 4
4 5
3 1 21
Existing Graph-parallel Systems“Think as a vertex” philosophy
1. aggregate value of neighbors2. update itself value3. activate neighbors compute (v)
PageRankdouble sum = 0double value, last =
v.get ()foreach (n in v.in_nbrs) sum += n.value /
n.nedges;value = 0.15 + 0.85 *
sum;v.set (value);activate (v.out_nbrs);
1
2
3
4 5
1 23
Existing Graph-parallel Systems“Think as a vertex” philosophy
1. aggregate value of neighbors2. update itself value3. activate neighbors
Execution Engine□ sync: BSP-like model□ async: dist. sched_queues
Communication□ message passing: push value□ dist. shared memory: sync & pull
value
4 5
1 23
1 23 4 1
423
comp.
comm.
1 2push
1 1pull
2sync
barrier
Issues of Existing SystemsPregel[SIGMOD’09]→ Sync engine→ Edge-cut
+ Message Passingw/o dynamic
comp.high contention
3
keep alive
21
4x1
x1
2 1 master2 1 replicamsg
GraphLab[VLDB’12]
PowerGraph[OSDI’12]
Issues of Existing SystemsPregel[SIGMOD’09]→ Sync engine→ Edge-cut
+ Message Passing
GraphLab[VLDB’12]→ Async engine→ Edge-cut
+ DSM (replicas)w/o dynamic
comp.high contention
high contention
hard to programduplicated
edgesheavy comm. cost
3
keep alive
2
233
1 1
2
replica
11
44x1
x1
x2 x
2
5
dup
2 1 master2 1 replicamsg
PowerGraph[OSDI’12]
Issues of Existing SystemsPregel[SIGMOD’09]→ Sync engine→ Edge-cut
+ Message Passing
GraphLab[VLDB’12]→ Async engine→ Edge-cut
+ DSM (replicas)
PowerGraph[OSDI’12]→ (A)Sync engine → Vertex-cut
+ GAS (replicas)w/o dynamic comp.
high contention
high contention
hard to programduplicated
edges
heavy comm. cost
high contentionheavy comm.
cost
3
keep alive
2
3
1 1
2
1
x5
x5
1
44x1
x1
233
1 1
2
replica
1
4x2 x
2
5
2 1 master2 1 replicamsg
5
dup
ContributionsDistributed Immutable View
□ Easy to program/debug□ Support dynamic computation□ Minimized communication cost (x1 /replica)□ Contention (comp. & comm.) immunity
Multicore-based Cluster Support□ Hierarchical sync. & deterministic execution□ Improve parallelism and locality
OutlineDistributed Immutable View
→ Graph organization→ Vertex computation→ Message passing→ Change of execution flow
Multicore-based Cluster Support→ Hierarchical model→ Parallelism improvement
Evaluation
General Idea : For most graph algorithms,
vertex only aggregates neighbors’ data in one direction and activates in another direction□ e.g. PageRank, SSSP, Community Detection, …
Observation
Local aggregation/update & distributed activation□ Partitioning: avoid duplicate edges□ Computation: one-way local semantics□ Communication: merge update & activate messages
Graph OrganizationPartitioning graph and build local sub-graph
□ Normal edge-cut: randomized (e.g., hash-based) or heuristic (e.g., Metis)
□ Only create one direction edges (e.g., in-edges)→ Avoid duplicated edges
□ Create read-only replicas for edges spanning machines
4 5
23 1
4
3 1
4
23 1
5
21
master replic
a
M1 M2 M3
Vertex ComputationLocal aggregation/update
□ Support dynamic computation→ one-way local semantic
□ Immutable view: read-only access neighbors→ Eliminate contention on vertex
4 5
23 1
4
3 1
4
23 1
5
21
M1 M2 M3
read-only
CommunicationSync. & Distributed Activation
□ Merge update & activate messages1. Update value of replicas2. Invite replicas to activate neighbors
4 5
23 1
4
3 1
4
23 1
5
21
rlist:W1 l-act: 1value: 8 msg: 4
l-act:3value:6 msg:3
msg: v|m|se.g. 8 4 0
M1 M2 M3
84
active
s0
CommunicationDistributed Activation
□ Unidirectional message passing→ Replica will never be activated→ Always master replicas → Contention immunity
4 5
23 1
4
3 1
4
23 1
5
21
M1 M2 M3
in-q
ueue
s
M1M3
out-queues
Change of Execution FlowOriginal Execution Flow (e.g. Pregel)
5
parsing11
8
computation sending
14
7
10
receiving
high overhead
high contention
M2 M3
M1
threadvertex
message4
2
Change of Execution Flow
M1M3
out-queuescomputation sending
14
7
10
receiving lock-free
23
8
9
5
2
11
8
4
3
1
6
17 4
47
4
71
36
Execution Flow on Distributed Immutable View
low overhead
no contention
threadmaster
4replica4
M2 M3
M1
OutlineDistributed Immutable View
→ Graph organization→ Vertex computation→ Message passing→ Change of execution flow
Multicore-based Cluster Support→ Hierarchical model→ Parallelism improvement
Evaluation
Multicore SupportTwo Challenges
1. Two-level hierarchical organization→ Preserve synchronous and deterministic
computation nature (easy to program/debug)
2. Original BSP-like model is hard to parallelize → High contention to buffer and parse
messages→ Poor locality in message parsing
Hierarchical ModelDesign Principle
□ Three level: iteration worker thread□ Only the last-level participants perform actual
tasks□ Parents (i.e. higher level participants) just wait
until all children finish their tasks
loop
tasktasktask
Level-0Level-1Level-2
workerthread
iteration
global barrier
local barrier
Parallelism ImprovementOriginal BSP-like model is hard to parallelize
M1M3
out-queues
in-q
ueue
s 5
parsing
2
11
8
computation sending
14
7
10
receiving
threadvertex
message4
M2 M3
M1
Parallelism ImprovementOriginal BSP-like model is hard to parallelize
M1M3
priv. out-queues
in-q
ueue
s 5
parsing
2
11
8
computation sending
14
7
10
receiving
M1M3
high contention
poor locality
threadvertex
message4
M2 M3
M1
Parallelism Improvement
M1M3
out-queues14
7
10
23
8
9
5
2
11
8
4
3
1
6
17 4
47
1
74
63
computation sending receiving
Distributed immutable view opens an opportunity
threadmaster
4replica4
M2 M3
M1
M2 M3
M1
Parallelism Improvement
M1M3
priv. out-queues14
7
10 M1M3
23
8
9
5
2
11
8
1
7
4
4
7 1
74
63 4
3
1
6poor locality
lock-freecomputation sending receiving
Distributed immutable view opens an opportunity
threadmaster
4replica4
Parallelism Improvement
M1M31
4
7
10 M1M3
23
8
9
5
2
11
8
1
7
4
4
7 1
74
36 6
3
1
4
lock-freecomputation sending receiving
Distributed immutable view opens an opportunity
no interference
threadmaster
4replica4
M2 M3
M1
priv. out-queues
M2 M3
M1
Parallelism ImprovementDistributed immutable view opens an opportunity
M1M31
4
7
10 M1M3
23
8
9
5
2
11
8
1
7
4
4
7 1
74
36 6
3
4
1
lock-free
sorted
computation sending receiving
good locality
threadmaster
4replica4
priv. out-queues
OutlineDistributed Immutable View
→ Graph organization→ Vertex computation→ Message passing→ Change of execution flow
Multicore-based Cluster Support→ Hierarchical model→ Parallelism improvement
Implementation & Experiment
ImplementationCyclops(MT)
□ Based on (Java & Hadoop)
□ ~2,800 SLOC□ Provide mostly compatible user interface□ Graph ingress and partitioning
→ Compatible I/O-interface→ Add an additional phase to build replicas
□ Fault tolerance→ Incremental checkpoint→ Replication-based FT [DSN’14]
Experiment SettingsPlatform
□ 6X12-core AMD Opteron (64G RAM, 1GigE NIC)Graph Algorithms
□ PageRank (PR), Community Detection (CD), Alternating Least Squares (ALS), Single Source Shortest Path (SSSP)
Workload□ 7 real-world dataset from SNAP1 □ 1 synthetic dataset from GraphLab2
1http://snap.stanford.edu/data/
Dataset
|V| |E|
Amazon 0.4M 3.4MGWeb 0.9M 5.1M
LJournal 4.8M 69MWiki 5.7M 130
MSYN-GL 0.1M 2.7M
DBLP 0.3M 1.0MRoadCA 1.9M 5.5M
2http://graphlab.org
Overall Performance Improvement
Amazon Gweb LJournal Wiki SYN-GL DBLP RoadCA0123456789
10 HamaCyclopsCyclopsMT
Norm
alize
d Sp
eedu
p
PageRank ALS CD SSSPPush-mode
8.69X
2.06X
48 workers
6 workers(8)
Performance Scalability
6 12 24 4805
101520253035 Hama
CyclopsCy-clopsMT
Norm
alize
d Sp
eedu
p
Amazon6 12 24 48
GWeb6 12 24 48
LJournal6 12 24 48
Wiki
50.2
6 12 24 4805
101520253035
Norm
alize
d Sp
eedu
p
SYN-GL6 12 24 48
DBLP6 12 24 48
RoadCA
threads
workers
Performance Breakdown
Amazon GWeb Ljournal Wiki SYN-GL DBLP RoadCA0.0 0.2 0.4 0.6 0.8 1.0
PARSESENDCOMPSYNC
Ratio
of E
xec-
Tim
e
PageRank ALS CD SSSP
0 6 12 18 24 300
100020003000400050006000
Iteration
#M
essa
ges
(K)
0 6 12 18 24 300
200400600800
1000
Hama
Iteration#Ve
rtice
s (K
)
CyclopsMT
HamaCyclops
Comparison with PowerGraph1
Amazon GWeb LJournal Wiki020406080
100120 CyclopsMT
Pow-er-Graph
Exec
-Tim
e (S
ec)
Amazon GWeb LJournal Wiki0500
100015002000
#Mes
sage
s (M
)
Dataset
COMP%
Amazon 11%GWeb 15%
LJournal 25%Wiki 39%
Cyclops-like engine on GraphLab1 Platform
Preliminary Results
Regular Natural048
12
Exec
-Tim
e (S
ec)
1http://graphlab.org 2synthetic 10-million vertex regular (even edge) and power-law (α=2.0) graphs
22
1C++ & Boost RPC lib.
ConclusionCyclops: a new synchronous vertex-oriented
graph processing system□ Preserve synchronous and deterministic
computation nature (easy to program/debug)□ Provide efficient vertex computation with
significantly fewer messages and contention immunity by distributed immutable view
□ Further support multicore-based cluster with hierarchical processing model and high parallelism
Source Code: http://ipads.se.sjtu.edu.cn/projects/cyclops
Questions
Thanks
Cyclopshttp://
ipads.se.sjtu.edu.cn/projects/cyclops.html
IPADS
Institute of Parallel and Distributed
Systems
PowerLyra: differentiated graph computation and partitioning on skewed natural graphs□ Hybrid engine and partitioning algorithms□ Outperform PowerGraph by up to 3.26X
for natural graphs
What’s Next?
http://ipads.se.sjtu.edu.cn/projects/powerlyra.html
213Low
High
R N048
1216
Exec
-Tim
e (S
ec)
Preliminary Results
PLPGCyclops
Power-law: “most vertices have relatively few neighbors while a few have many neighbors”
GeneralityAlgorithms: aggregate/activate all neighbors
□ e.g. Community Detection (CD)□ Transfer to undirected graph and duplicate edges
4
3 1
4
23 1
5
21
M1 M2 M354 5
23 1
4 5
23 1
4
3 1
4
23 1
5
21
M1 M2 M3
GeneralityAlgorithms: aggregate/activate all neighbors
□ e.g. Community Detection (CD)□ Transfer to undirected graph and duplicate edges□ Still aggregate in one direction (e.g. in-edges) and
activate in another direction (e.g. out-edges)□ Preserve all benefits of Cyclops
→ x1 /replica & contention immunity & good locality
4
3 1
4
23 1
5
21
M1 M2 M354 5
23 1
4
3 1
4
23 1
5
21
M1 M2 M35
GeneralityDifference between Cyclops and GraphLab
1. How to construct local sub-graph2. How to aggregate/activate neighbors
4
3 1
4
23 1
5
21
M1 M2 M354 5
23 1
4 5
23 1
Improvement of CyclopsMT
6x1x1/16x2x1/16x4x1/16x8x1/1
6x1x1/16x1x2/26x1x4/46x1x8/8
6x1x8/16x1x8/26x1x8/46x1x8/8
0.0 5.0
10.0 15.0 20.0 25.0 30.0 SEND COMP SYNC
Exec
utio
n Ti
me
(Sec
)
#[M]achines MxWxT/R#[W]orkers
#[T]hreads#[R]eceivers
CyclopsCyclopsMT
Communication Efficiency
HamaCyclops
HamaCyclops
HamaCyclops
0.1 1.0 10.0 100.0 1,000.0
SENDPARSE
Exec-Time (Sec)
50M
25M
5M
25.6X
16.2X55.6%
12.6X25.0%
W0
W1W2W3W4W5
message:(id,data)
Hadoop RPC lib (Java) Boost RPC lib (C++)Hadoop RPC lib (Java)
Hama:PowerGrap
h:Cyclops:
send + buffer + parse (contention)
send + update
(contention)
31.5%
Using Heuristic Edge-cut (i.e. Metis)
Amazon Gweb LJournal Wiki SYN-GL DBLP RoadCA0
5
10
15
20
25 HamaCyclopsCyclopsMT
Norm
alize
d Sp
eedu
p
PageRank ALS CD SSSP
23.04X
5.95X
48 workers
6 workers(8)
Memory Consumption
Configuration
Max Cap (GB)
Max Usage (GB)
Young GC2
(#)Full GC2
(#)Hama/48 1.7 1.5 132 69
Cyclops/48 4.0 3.0 45 15CyclopsMT/
6x812.6/8 11.0/8 268/8 32/8
Memory Behavior1 per Worker(PageRank with Wiki dataset)
2 GC: Concurrent Mark-Sweep1 jStat
Ingress Time
Dataset
LD REP INIT TOT
H C H C H C H CAmazon 6.2 5.9 0.0 2.5 1.7 1.5 7.9 9.9
GWeb 7.1 6.8 0.0 2.8 2.6 1.9 9.7 11.4LJournal 27.1 31.0 0.0 44.7 17.9 9.2 45.0 84.9
Wiki 46.7 46.7 0.0 62.2 33.4 20.4 80.0 129.3
SYN-GL 4.2 4.0 0.0 2.6 2.4 1.8 6.6 8.4DBLP 4.1 4.1 0.0 1.5 1.3 0.9 5.4 6.5
RoadCA 6.4 6.2 0.0 3.9 0.9 0.6 7.3 10.7
CyclopsHama
Selective ActivationSync. & Distributed Activation
□ Merge update & activate messages1. Update value of replicas2. Invite replicas to activate neighbors
4 5
23 1
4
3 1
4
23 1
5
21
rlist:W1 l-act: 1value: 8 msg: 4
l-act:3value:6 msg:3
msg: v|m|se.g. 8 4 0
M1 M2 M3
84
active
msg: v|m|s|l
*Selective Activation (e.g. ALS)
Option: Activation_List
s0
M2 M3
M1
Parallelism ImprovementDistributed immutable view opens an opportunity
M1M3
out-queues14
7
10 M1M3
23
8
9
5
2
11
8
1
7
4
4
7 1
74
36 6
3
4
1
lock-free
sorted
computation sending receiving
good locality
comp.threads
comm.threadsvs.separate
configuration
threadmaster
4replica4
w/ dynamic comp.
no contention
easy to program
duplicated edgeslow comm. cost
CyclopsExisting graph-parallel
systems (e.g., Pregel, GraphLab, PowerGraph)
Cyclops(MT)→ Distributed
Immutable View
w/o dynamic comp.
high contention
hard to program
duplicated edgesheavy comm. cost
233
1 1
5replica
1
4x1
x1
BiGraph: bipartite-oriented distributed graph partitioning for big learning□ A set of online distributed graph partition
algorithms designed for bipartite graphs and applications
□ Partition graphs in a differentiated way and loading data according to the data affinity
□ Outperform PowerGraph with default partition by up to 17.75X, and save up to 96% network traffic
What’s Next?
http://ipads.se.sjtu.edu.cn/projects/powerlyra.html
Multicore SupportTwo Challenges
1. Two-level hierarchical organization→ Preserve synchronous and deterministic
computation nature (easy to program/debug)
2. Original BSP-like model is hard to parallelize → High contention to buffer and parse
messages→ Poor locality in message parsing→ Asymmetric degree of parallelism for CPU and
NIC