graph processing
DESCRIPTION
GRAPH PROCESSING. Why Graph Processing?. Graphs are everywhere!. Why Graph Processing?. Why Distributed Graph Processing?. They are getting bigger!. Road Scale. >24 million vertices >58 million edges *Route Planning in Road Networks - 2008. Social Scale. >1 billion vertices - PowerPoint PPT PresentationTRANSCRIPT
GRAPH PROCESSING
Why Graph Processing?
Graphs are everywhere!
Why Graph Processing?
Why Distributed Graph Processing?
They are getting bigger!
Road Scale
>24 million vertices>58 million edges
*Route Planning in Road Networks - 2008
Social Scale
>1 billion vertices~1 trillion edges*Facebook Engineering Blog
~41 million vertices>1.4 billion edges
*Twitter Graph- 2010
Web Scale
>50 billion vertices>1 trillion edges
*NSA Big Graph Experiment- 2013
Brain Scale
>100 billion vertices>100 trillion edges
*NSA Big Graph Experiment- 2013
CHALLENGES IN PARALLEL GRAPH PROCESSING
Lumsdaine, Andrew, et al. "Challenges in parallel graph processing." Parallel Processing Letters 17.01 -2007
Challenges
1 Structure driven computationData Transfer Issues
2 Irregular StructurePartitioning Issues
*Concept borrowed from Cristina Abad’s PhD defense slides
Overcoming the challenges
1 Extend Existing Paradigms
2 BUILD NEW FRAMEWORKS!
Build New Graph Frameworks!
Key Requirements from Graph Processing Frameworks
1 Less pre-processing
2 Low and load-balanced computation
3 Low and load-balanced communication
4 Low memory footprint
5 Scalable wrt cluster size and graph size
PREGEL
Malewicz, Grzegorz, et al. "Pregel: a system for large-scale graph processing.“ ACM SIGMOD -2010.
Life of a Vertex Program
PlacementOf Vertices
Iteration 0 Iteration 1
Barrier Barrier Barrier
Time
Computation
Communication
*Concept borrowed from LFGraph Slides
Computation
Communication
B D
C E
A
Sample Graph
*Graph Borrowed from LFGraph Paper
B D
C E
A
Shortest Path Example
B0
D∞
C∞
E∞
A∞
Iteration 0Message (0+1)
B0
D∞
C∞
E∞
A1
Iteration 1
Message (1+1)
Message (1+1)
B0
D2
C∞
E2
A1
Iteration 2
Can we do better?GOAL PREGEL
Computation 1 Pass
Communication ∝ #Edge cuts
Pre-processing Cheap (Hash)
Memory High (out edges + buffered messages)
LFGRAPH – YES, WE CAN!Hoque, Imranul, and Indranil Gupta. "LFGraph: Simple and Fast Distributed Graph Analytics”. TRIOS-2013
B D
C E
A
Features
Cheap Vertex Placement: Hash Based
Low graph initialization time
B D
C E
A
Features
Publish Subscribe fetch once information flow
Low communication overhead
B D
C E
A
Subscribe
Subscribing to vertex A
B D
C E
A
Publish
Publish List of Server 1: (Server2, A)
B D
C E
A
LFGraph Model
Value of A
B D
C E
A
Features
Only stores in-neighbor vertices
Reduces memory footprint
B D
C E
A
In-neighbor storage
Local in-neighbor – simply read the value
Remote in-neighbor – read locally available value
B0
D∞
C∞
E∞
A∞
Iteration 0
B0
D∞
C∞
E∞
A∞
Iteration 1Read value 0
Read value ∞
A1
Value change in duplicate store
Value of A
B0
D∞
C∞
E∞
A1
Iteration 2
Local read of A
Local read of A
D2
E2
B D
C E
A
Features
Single Pass Computation
Low computation overhead
Life of a Vertex Program
PlacementOf Vertices
Iteration 0 Iteration 1
Barrier Barrier Barrier
Time
Computation Communication Computation Communication
*Concept Borrowed from LFGraph Slides
How everything Works
GRAPHLAB
Low, Yucheng, et al. "Graphlab: A new framework for parallel machine learning”. Conference on Uncertainty in Artificial Intelligence (UAI) - 2010
B D
C E
A A
D
E
GraphLab Model
Can we do better?GOAL GRAPHLAB
Computation 2 passes
Communication ∝ #Vertex Ghosts
Pre-processing Cheap (Hash)
Memory High (in & out edges + ghost values)
POWERGRAPH
Gonzalez, Joseph E., et al. "Powergraph: Distributed graph-parallel computation on natural graphs." USENIX OSDI - 2012.
B D
C E
A1 A2
PowerGraph Model
Can we do better?GOAL POWERGRAPH
Computation 2 passes
Communication ∝ #Vertex Mirrors
Pre-processing Expensive (Intelligent)
Memory High (in & out edges + mirror values)
Communication Analysis
External on edge cuts
Ghost vertices - in and out neighbors
Mirrors -in and out neighbors
External in neighbors
Computation Balance Analysis
• Power Law graphs have substantial load imbalance.
• Power law graphs have degree d with probability proportional to d-α.
•Lower α means a denser graph with more high degree vertices.
Computation Balance Analysis
Computation Balance Analysis
Real World vs Power Law
Communication Balance Analysis
PageRank – Runtime w/o partition
PageRank – Runtime with partition
PageRank – Memory footprint
PageRank – Network Communication
Scalability
X-Stream: Edge-centric Graph Processing using
Streaming Partitions
* Some figures adopted from author’s presentation
Motivation
• Can sequential access be used instead of random
access?!
• Can large graph processing be done on a single
machine?! X-Stream
Sequential Access: Key to Performance!
Medium Read (MB/s) Write (MB/s)Rando
mSequentia
lSpeed
up Random Sequential Speed up
RAM (1 core) 567 2605 4.6 1057 2248 2.2
RAM (16 core) 14198 25658 1.9 10044 13384 1.4
SSD 22.5 667.69 29.7 48.6 576.5 11.9
Magnetic Disk 0.6 328 546.7 2 316.3 158.2
Speed up of sequential access over random access in different media
Test bed: 64 GB RAM + 200 GB SSD + 3TB drive
How to Use Sequential Access?
Sequential access …
Edge-Centric Processing
Vertex-CentricScatter
U{state} for each vertex v
if state has updatedfor each output edge e of vscatter update on e
Update u1
Update
u2Update
un
Vertex-CentricGather
V{state}
Update
v1
Update v2
Update vn
for each vertex vfor each input edge e of vif e has an updateapply update on state
V{state2
}
1 63
58
7
4
2
BFS
SOURCE DEST
1 31 52 72 43 23 84 34 74 85 66 18 58 6
V
12345678
Looku
p Index
Vertex-Centric
Edge-CentricScatter
for each edge eif e.source has updatedscatter update on e
A
B
C = Updated Vertex
Update u1
Update un
Edge-CentricGather
for each update u on edge eapply update u to
e.destination
X
Y
Z = Updated Vertex
Update u1
Update un
Update u2
X
Y
Z
Sequential Access via Edge-Centric!
In Fast Storage
In Slow Storage
In Slow Storage
Fast and Slow Storage
1 63
58
7
4
2
SOURCE DEST
1 31 52 72 43 23 84 34 74 85 66 18 58 6
V
12345678
BFS
Edge-Centric
Lots of wasted reads!
Most real world graphs have small diameter
Large Diameter makes X-Stream slow and wasteful
66
SOURCE DEST
1 31 52 72 43 23 84 34 74 85 66 18 58 6
=
SOURCE DEST
1 38 65 62 43 24 74 33 84 82 76 18 51 5
Order is not important
No pre-processing (sorting and indexing) needed!
But, still …
• Random access for vertices
• Vertices may not fit into fast storage
Streaming Partitions
Streaming Partitions
V=Subset of vertices
E=Outgoing edges of V
U=Incoming updates to V
Mutually disjoint
Changing in each scatter
phase
Constant set
Vn
En
Un
Scatter and ShuffleV1
E1
U1 Input buffere1 e2 e3 …
Update bufferu1 u2 u3 …
Output bufferu'1 u'2 u'3 …
Vertex setv1 v2 v3 …
Fast Memory
Read source
Add update
Load
Stream
Shuffle …
Append to updates
Shuffle
Stream Buffer with k partitions
GatherV1
E1
U1 Update bufferu1 u2 u3 …
Vertex setv1 v2 v3 …
Fast MemoryLoad
Stream
Apply update
No output!
Parallelism
• State stored in vertices• Disjoint vertex set in partitions
Compute partitions in parallelParallel scatter and gather
Experimental Results
X-Stream Speedup over Graphchi
Netflix/ALS
Twitter/Pagerank
Twitter/Belief Propagation
RMAT27/WCC
0 1 2 3 4 5 6
Mean Speedup = 2.3
Speedup without considering the pre-process time of Graphchi
Netflix/ALS
Twitter/Pagerank
Twitter/Belief Propagation
RMAT27/WCC
0 1 2 3 4 5 6
X-Stream Speedup over Graphchi
Mean Speedup = 3.7
Speedup considering the pre-process time of Graphchi
Netflix
/ALS
Twitter/P
agera
nk
Twitter/B
elief
Propagation
RMAT27/W
CC0
50010001500200025003000
Graphchi ShardingX-Stream runtime
Tim
e (s
ec)
X-Stream Runtime vs Graphchi Sharding
77
Disk Transfer Rates
Metric X-Stream GraphchiData moved 224 GB 322 GBTime taken 398
seconds2613 seconds
Transfer rate
578 MB/s 126 MB/s
SSD sustain reads = 667 MB/s, writes = 576 MB/s
Data transfer rates on Page Rank algorithm on Twitter workload
Scalability on Input Data size
384MB
768MB
1536MB3GB
6GB12GB
24GB48GB
96GB192GB
384GB768GB
1.5TB0:00:010:00:050:00:210:01:240:05:380:22:301:30:006:00:00
24:00:0096:00:00
Tim
e (H
H:M
M:S
S)
RAM SSD Disk
8 Million V, 128 Million E, 8 sec
256 Million V, 4 Billion E, 33 mins
4 Billion V, 64 Billion E, 26 hours
Discussion• Features like global values, aggregation
functions, asynchronous computation missing from LFGraph. Will the overhead of adding these features slow it down?
• LFGraph assumes that all edge values are same. If the edge values are not, either the receiving vertices or the server will have to incorporate that value. Overheads?
• LFGraph has one pass computation but then it executes the vertex program at each vertex (active or inactive). Trade-off?
Discussion• Independent computation and communication rounds
may not always be preferred. Use bandwidth when available.
• Faul Tolerance is another feature missing from LFGraph. Overheads?
• Three benchmarks for experiments. Enough evaluation?
• Scalability comparison with Pregel with different experiment settings. Memory comparison with PowerGraph based on heap values from logs. Fair experiments?
Discussion• Could the system become asynchronous?• Could the scatter and gather phase be combined into
one phase?• Does not support iterating over the edges/updates of
a vertex. Can this be added?• How good do they determine number of partitions?• Can shuffle be optimized by counting the updates of
each partition during scatter?
Thank you for listening!
Questions?
Backup Slides
Reason for Improvement
Qualitative Comparison
GOAL PREGEL GRAPHLAB POWERGRAPH LFGRAPH
Computation 2 passes, Combiners
2 passes 2 passes 1 pass
Communication ∝ #Edge cuts ∝ #Vertex Ghosts
∝ #Vertex Mirrors
∝ #External in-neighbors
Pre-processing Cheap (Hash) Cheap (Hash) Expensive (Intelligent)
Cheap (Hash)
Memory High (out edges + buffered messages)
High (in & out edges + ghost
values)
High (in & out edges + mirror
values)
Low (in edges + remote values)
Backup Slides
Read Bandwidth - SSD
0100200300400500600700800900
1000
X-StreamGraphchi
5 minute window
Read
(MB/
s)
Write Bandwidth - SSD
0100200300400500600700800
X-StreamGraphchi
5 minute window
Writ
e (M
B/s)
Scalability on Thread Count
Scalability on Number of I/O Devices
Sharding-Computing Breakdown in Graphchi
Netflix/A
LS
Twitter/Pag
erank
Twitter/Belie
f Propag
ation
RMAT27/WCC
00.30.60.9
Graphchi Runtime Breakdown
Compute + I/ORe-sort shard
Benchmark
Frac
tion
of R
untim
e
X-Stream not Always Perfect
Large Diameter makes X-stream Slow!
In-Memory X-Stream Performance
1 2 4 8 16020406080
100
BFS (32M vertices/256M edges)
BFS-1 [HPC 2010]BFS-2 [PACT 2011]X-Stream
Threads
Runti
me
(s) L
ower
is
bette
r
Ligra vs. X-Stream
Discussion• The current implementation is on a single
machine, can it be extended to clusters?– Would it still perform good– How to provide fault tolerance and synchronization?
• The waste rate is high (~65%). Could this be improved?
• Can the partition be more intelligent? Dynamic partitioning?
• Could all vertex-centric programs be converted to edge-centric?
• When does streaming outperform random access?