graph processing

GRAPH PROCESSING

Why Graph Processing?

Graphs are everywhere!

Why Graph Processing?

Why Distributed Graph Processing?

They are getting bigger!

Road Scale

>24 million vertices>58 million edges

*Route Planning in Road Networks - 2008

Social Scale

>1 billion vertices~1 trillion edges*Facebook Engineering Blog

~41 million vertices>1.4 billion edges

*Twitter Graph- 2010

Web Scale

>50 billion vertices>1 trillion edges

*NSA Big Graph Experiment- 2013

Brain Scale

>100 billion vertices>100 trillion edges

*NSA Big Graph Experiment- 2013

CHALLENGES IN PARALLEL GRAPH PROCESSING

Lumsdaine, Andrew, et al. "Challenges in parallel graph processing." Parallel Processing Letters 17.01 -2007

Challenges

1 Structure driven computationData Transfer Issues

2 Irregular StructurePartitioning Issues

*Concept borrowed from Cristina Abad’s PhD defense slides

Overcoming the challenges

1 Extend Existing Paradigms

2 BUILD NEW FRAMEWORKS!

Build New Graph Frameworks!

Key Requirements from Graph Processing Frameworks

1 Less pre-processing

2 Low and load-balanced computation

3 Low and load-balanced communication

4 Low memory footprint

5 Scalable wrt cluster size and graph size

PREGEL

Malewicz, Grzegorz, et al. "Pregel: a system for large-scale graph processing.“ ACM SIGMOD -2010.

Life of a Vertex Program

PlacementOf Vertices

Iteration 0 Iteration 1

Barrier Barrier Barrier

Time

Computation

Communication

*Concept borrowed from LFGraph Slides

Computation

Communication

B D

C E

A

Sample Graph

*Graph Borrowed from LFGraph Paper

B D

C E

A

Shortest Path Example

B0

D∞

C∞

E∞

A∞

Iteration 0Message (0+1)

B0

D∞

C∞

E∞

A1

Iteration 1

Message (1+1)

Message (1+1)

B0

D2

C∞

E2

A1

Iteration 2

Can we do better?GOAL PREGEL

Computation 1 Pass

Communication ∝ #Edge cuts

Pre-processing Cheap (Hash)

Memory High (out edges + buffered messages)

LFGRAPH – YES, WE CAN!Hoque, Imranul, and Indranil Gupta. "LFGraph: Simple and Fast Distributed Graph Analytics”. TRIOS-2013

B D

C E

A

Features

Cheap Vertex Placement: Hash Based

Low graph initialization time

B D

C E

A

Features

Publish Subscribe fetch once information flow

Low communication overhead

B D

C E

A

Subscribe

Subscribing to vertex A

B D

C E

A

Publish

Publish List of Server 1: (Server2, A)

B D

C E

A

LFGraph Model

Value of A

B D

C E

A

Features

Only stores in-neighbor vertices

Reduces memory footprint

B D

C E

A

In-neighbor storage

Local in-neighbor – simply read the value

Remote in-neighbor – read locally available value

B0

D∞

C∞

E∞

A∞

Iteration 0

B0

D∞

C∞

E∞

A∞

Iteration 1Read value 0

Read value ∞

A1

Value change in duplicate store

Value of A

B0

D∞

C∞

E∞

A1

Iteration 2

Local read of A

Local read of A

D2

E2

B D

C E

A

Features

Single Pass Computation

Low computation overhead

Life of a Vertex Program

PlacementOf Vertices

Iteration 0 Iteration 1

Barrier Barrier Barrier

Time

Computation Communication Computation Communication

*Concept Borrowed from LFGraph Slides

How everything Works

GRAPHLAB

Low, Yucheng, et al. "Graphlab: A new framework for parallel machine learning”. Conference on Uncertainty in Artificial Intelligence (UAI) - 2010

B D

C E

A A

D

E

GraphLab Model

Can we do better?GOAL GRAPHLAB

Computation 2 passes

Communication ∝ #Vertex Ghosts

Pre-processing Cheap (Hash)

Memory High (in & out edges + ghost values)

POWERGRAPH

Gonzalez, Joseph E., et al. "Powergraph: Distributed graph-parallel computation on natural graphs." USENIX OSDI - 2012.

B D

C E

A1 A2

PowerGraph Model

Can we do better?GOAL POWERGRAPH

Computation 2 passes

Communication ∝ #Vertex Mirrors

Pre-processing Expensive (Intelligent)

Memory High (in & out edges + mirror values)

Communication Analysis

External on edge cuts

Ghost vertices - in and out neighbors

Mirrors -in and out neighbors

External in neighbors

Computation Balance Analysis

• Power Law graphs have substantial load imbalance.

• Power law graphs have degree d with probability proportional to d-α.

•Lower α means a denser graph with more high degree vertices.

Computation Balance Analysis

Real World vs Power Law

Communication Balance Analysis

PageRank – Runtime w/o partition

PageRank – Runtime with partition

PageRank – Memory footprint

PageRank – Network Communication

Scalability

X-Stream: Edge-centric Graph Processing using

Streaming Partitions

* Some figures adopted from author’s presentation

Motivation

• Can sequential access be used instead of random

access?!

• Can large graph processing be done on a single

machine?! X-Stream

Sequential Access: Key to Performance!

Medium Read (MB/s) Write (MB/s)Rando

mSequentia

lSpeed

up Random Sequential Speed up

RAM (1 core) 567 2605 4.6 1057 2248 2.2

RAM (16 core) 14198 25658 1.9 10044 13384 1.4

SSD 22.5 667.69 29.7 48.6 576.5 11.9

Magnetic Disk 0.6 328 546.7 2 316.3 158.2

Speed up of sequential access over random access in different media

Test bed: 64 GB RAM + 200 GB SSD + 3TB drive

How to Use Sequential Access?

Sequential access …

Edge-Centric Processing

Vertex-CentricScatter

U{state} for each vertex v

if state has updatedfor each output edge e of vscatter update on e

Update u1

Update

u2Update

un

Vertex-CentricGather

V{state}

Update

v1

Update v2

Update vn

for each vertex vfor each input edge e of vif e has an updateapply update on state

V{state2

}

1 63

58

7

4

2

BFS

SOURCE DEST

1 31 52 72 43 23 84 34 74 85 66 18 58 6

V

12345678

Looku

p Index

Vertex-Centric

Edge-CentricScatter

for each edge eif e.source has updatedscatter update on e

A

B

C = Updated Vertex

Update u1

Update un

Edge-CentricGather

for each update u on edge eapply update u to

e.destination

X

Y

Z = Updated Vertex

Update u1

Update un

Update u2

X

Y

Z

Sequential Access via Edge-Centric!

In Fast Storage

In Slow Storage

In Slow Storage

Fast and Slow Storage

1 63

58

7

4

2

SOURCE DEST

1 31 52 72 43 23 84 34 74 85 66 18 58 6

V

12345678

BFS

Edge-Centric

Lots of wasted reads!

Most real world graphs have small diameter

Large Diameter makes X-Stream slow and wasteful

66

SOURCE DEST

1 31 52 72 43 23 84 34 74 85 66 18 58 6

=

SOURCE DEST

1 38 65 62 43 24 74 33 84 82 76 18 51 5

Order is not important

No pre-processing (sorting and indexing) needed!

But, still …

• Random access for vertices

• Vertices may not fit into fast storage



V=Subset of vertices

E=Outgoing edges of V

U=Incoming updates to V

Mutually disjoint

Changing in each scatter

phase

Constant set

Vn

En

Un

Scatter and ShuffleV1

E1

U1 Input buffere1 e2 e3 …

Update bufferu1 u2 u3 …

Output bufferu'1 u'2 u'3 …

Vertex setv1 v2 v3 …

Fast Memory

Read source

Add update

Load

Stream

Shuffle …

Append to updates

Shuffle

Stream Buffer with k partitions

GatherV1

E1

U1 Update bufferu1 u2 u3 …

Vertex setv1 v2 v3 …

Fast MemoryLoad

Stream

Apply update

No output!

Parallelism

• State stored in vertices• Disjoint vertex set in partitions

Compute partitions in parallelParallel scatter and gather

Experimental Results

X-Stream Speedup over Graphchi

Netflix/ALS

Twitter/Pagerank

Twitter/Belief Propagation

RMAT27/WCC

0 1 2 3 4 5 6

Mean Speedup = 2.3

Speedup without considering the pre-process time of Graphchi

Netflix/ALS

Twitter/Pagerank

Twitter/Belief Propagation

RMAT27/WCC

0 1 2 3 4 5 6

X-Stream Speedup over Graphchi

Mean Speedup = 3.7

Speedup considering the pre-process time of Graphchi

Netflix

/ALS

Twitter/P

agera

nk

Twitter/B

elief

Propagation

RMAT27/W

CC0

50010001500200025003000

Graphchi ShardingX-Stream runtime

Tim

e (s

ec)

X-Stream Runtime vs Graphchi Sharding

77

Disk Transfer Rates

Metric X-Stream GraphchiData moved 224 GB 322 GBTime taken 398

seconds2613 seconds

Transfer rate

578 MB/s 126 MB/s

SSD sustain reads = 667 MB/s, writes = 576 MB/s

Data transfer rates on Page Rank algorithm on Twitter workload

Scalability on Input Data size

384MB

768MB

1536MB3GB

6GB12GB

24GB48GB

96GB192GB

384GB768GB

1.5TB0:00:010:00:050:00:210:01:240:05:380:22:301:30:006:00:00

24:00:0096:00:00

Tim

e (H

H:M

M:S

S)

RAM SSD Disk

8 Million V, 128 Million E, 8 sec

256 Million V, 4 Billion E, 33 mins

4 Billion V, 64 Billion E, 26 hours

Discussion• Features like global values, aggregation

functions, asynchronous computation missing from LFGraph. Will the overhead of adding these features slow it down?

• LFGraph assumes that all edge values are same. If the edge values are not, either the receiving vertices or the server will have to incorporate that value. Overheads?

• LFGraph has one pass computation but then it executes the vertex program at each vertex (active or inactive). Trade-off?

Discussion• Independent computation and communication rounds

may not always be preferred. Use bandwidth when available.

• Faul Tolerance is another feature missing from LFGraph. Overheads?

• Three benchmarks for experiments. Enough evaluation?

• Scalability comparison with Pregel with different experiment settings. Memory comparison with PowerGraph based on heap values from logs. Fair experiments?

Discussion• Could the system become asynchronous?• Could the scatter and gather phase be combined into

one phase?• Does not support iterating over the edges/updates of

a vertex. Can this be added?• How good do they determine number of partitions?• Can shuffle be optimized by counting the updates of

each partition during scatter?

Thank you for listening!

Questions?

Backup Slides

Reason for Improvement

Qualitative Comparison

GOAL PREGEL GRAPHLAB POWERGRAPH LFGRAPH

Computation 2 passes, Combiners

2 passes 2 passes 1 pass

Communication ∝ #Edge cuts ∝ #Vertex Ghosts

∝ #Vertex Mirrors

∝ #External in-neighbors

Pre-processing Cheap (Hash) Cheap (Hash) Expensive (Intelligent)

Cheap (Hash)

Memory High (out edges + buffered messages)

High (in & out edges + ghost

values)

High (in & out edges + mirror

values)

Low (in edges + remote values)

Backup Slides

Read Bandwidth - SSD

0100200300400500600700800900

1000

X-StreamGraphchi

5 minute window

Read

(MB/

s)

Write Bandwidth - SSD

0100200300400500600700800

X-StreamGraphchi

5 minute window

Writ

e (M

B/s)

Scalability on Thread Count

Scalability on Number of I/O Devices

Sharding-Computing Breakdown in Graphchi

Netflix/A

LS

Twitter/Pag

erank

Twitter/Belie

f Propag

ation

RMAT27/WCC

00.30.60.9

Graphchi Runtime Breakdown

Compute + I/ORe-sort shard

Benchmark

Frac

tion

of R

untim

e

X-Stream not Always Perfect

Large Diameter makes X-stream Slow!

In-Memory X-Stream Performance

1 2 4 8 16020406080

100

BFS (32M vertices/256M edges)

BFS-1 [HPC 2010]BFS-2 [PACT 2011]X-Stream

Threads

Runti

me

(s) L

ower

is

bette

r

Ligra vs. X-Stream

Discussion• The current implementation is on a single

machine, can it be extended to clusters?– Would it still perform good– How to provide fault tolerance and synchronization?

• The waste rate is high (~65%). Could this be improved?

• Can the partition be more intelligent? Dynamic partitioning?

• Could all vertex-centric programs be converted to edge-centric?

• When does streaming outperform random access?

graph processing

Documents

distributed graph processing

graph processing ahs

twitter graph

graph computations

graph processinghi

parallel processing

brain networks

batch processing of