comparing pregel related systems

27
Comparison and Evaluation of Open Source Implementations of Pregel and Related Systems December 2, 2013 Joshua Woo, Prashant Raghav, Vishnu Prathish David R. Cheriton School of Computer Science University of Waterloo

Upload: prashant-raaghav

Post on 11-Nov-2014

62 views

Category:

Engineering


10 download

DESCRIPTION

Comparing Open Source implementations of Pregel and Related Systems. Installation of Hadoop and the Pregel Related Systems. Worked with Datasets of varying sizes from very small to very large. Large datasets that have around 30 million vertices and 50 million edges. Worked on 1,4,8 node Amazon EC2 cluster. 4 Algorithms : PageRank,Shortest Path,KMeans,Collaborative Filtering

TRANSCRIPT

Page 1: Comparing pregel related systems

Comparison and Evaluation of Open Source Implementations of Pregel and Related Systems

December 2, 2013

Joshua Woo, Prashant Raghav, Vishnu Prathish

David R. Cheriton School of Computer ScienceUniversity of Waterloo

Page 2: Comparing pregel related systems

Outline

● Motivation● Our Project● Setup● Preliminary Results● Preliminary Analysis● In-Progress● References

Page 3: Comparing pregel related systems

Motivation

Recall: Pregel● Large-scale graph processing system● Fault-tolerant framework for graph

algorithms● MapReduce for graph operations?● Vertex-centric model (“think like a vertex”)

Page 4: Comparing pregel related systems

Motivation

● Pregel is proprietary● Many open source graph processing

systems○ Pregel clones○ Pregel-inspired○ BSP

Page 5: Comparing pregel related systems

Motivation

● Apache Hama● Signal/Collect● Apache Giraph● GPS● GraphLab

● Phoebus● GoldenOrb● HipG● Mizan

Page 6: Comparing pregel related systems

MotivationSystem Impl. Language Type

Apache Hama Java Pure BSP framework

Signal/Collect Scala Pregel inspired

Apache Giraph Java Pregel clone

GPS Java Advanced Pregel clone

GraphLab C++ Pregel inspired

Phoebus Erlang Pregel clone

GoldenOrb Java Pregel clone

HipG Java Advanced Pregel clone

Mizan C++ Advanced Pregel clone

Page 7: Comparing pregel related systems

Motivation

● How do these systems compare?○ In terms of performance (runtime)?○ In terms of memory footprint?○ In terms of network utilization (num. messages)?○ Variables:

■ Algorithm■ Graph size (number of vertices)■ Cluster size

Page 8: Comparing pregel related systems

Our Project

● Compare at least 3 systems○ Apache Hama - general BSP framework○ Apache Giraph - Hadoop Map-only job, Facebook○ GPS - +dynamic repartitioning, +multi vertex-centric○ Signal/Collect - +edges, +async computations○ GraphLab○ Mizan

Page 9: Comparing pregel related systems

Our Project

● Measure the runtime of at least two algorithms on each system○ PageRank

■ Fixed number of supersteps = 30○ Single Source Shortest Path (SSSP)○ k-means clustering

Page 10: Comparing pregel related systems

Setup

● Experiments on AWS○ Ubuntu 12.04 m1.medium EC2 instances

■ 2 ECUs, 1 vCPU, 3.7 GiB memory, moderate network performance

■ 8 GiB EBS volume per instance○ Cluster sizes:

■ Single-node cluster■ 4-node cluster■ 8-node cluster

Page 11: Comparing pregel related systems

Setup

● Experiments on AWS○ 5 runs per dataset per algorithm per cluster

■ 35 runs per algorithm per cluster■ 70 runs per cluster■ 140 runs in total (single-node, 4-node)

● TODO: another 70 runs (8-node)

Page 12: Comparing pregel related systems

Setup

● Dataset○ 7 datasets

■ tinyEWD: 8 vertices 15 edges■ mediumEWD: 250 vertices 2,546 edges■ 1000EWD: 1,000 vertices 16,866 edges■ rome99: 3,353 vertices 8,870 edges■ 10000EWD: 10,000 vertices 16,866 edges■ NYC: 264,346 vertices 733,846 edges■ largeEWD: 1,000,000 vertices 15,172,126 edges

○ Source: http://algs4.cs.princeton.edu/44sp/

Page 13: Comparing pregel related systems

Setup

● Systems○ Hama

■ Hadoop 1.03.0■ Hama 0.6.3

○ Giraph■ Hadoop 0.20.203rc1■ Giraph (trunk@37bc2c80564b45d7e4ce95db76f5411a6b8bdb3a)

○ GPS■ Hadoop 0.20.203rc1■ GPS (trunk@Revision 112)

Page 14: Comparing pregel related systems

Setup

● Input Graph○ Source files converted into format suitable for each

system■ Time for this conversion excluded from results:

● Conversion done before algorithms are run (pre-processing?)

● Negligible for largeEWD (1,000,000 vertices, 15,172,126 edges)

Page 15: Comparing pregel related systems

Preliminary Results

Dataset Hama Giraph GPS

tinyEWD 14.17 41.60 14.40

mediumEWD 16.36 44.00 36.00

1000EWD 18.06 48.80 46.60

rome99 22.95 66.00 50.00

10000EWD 25.32 67.40 55.00

NYC 165.01 267.00 310.00

largeEWD 6,109.20 602.80 618.70

Average SSSP runtime on 4-node cluster (in seconds)

Page 16: Comparing pregel related systems

Preliminary ResultsSSSP runtime vs. graph size (num. vertices)

Page 17: Comparing pregel related systems

Preliminary Results

Dataset Hama Giraph GPS

tinyEWD 29.36 49.40 58.57

mediumEWD 30.26 53.40 60.42

1000EWD 37.86 54.60 61.03

rome99 29.35 56.20 61.80

10000EWD 302.33 61.80 64.80

NYC 1,001.24 134.40 68.69

largeEWD Failed 2,100.00 1,213.56

Average PageRank (30 supersteps) runtime on 4-node cluster (in seconds)

Page 18: Comparing pregel related systems

Preliminary ResultsPageRank runtime vs. graph size (num. vertices)

Page 19: Comparing pregel related systems

Preliminary Analysis● A point of resource crunch

○ No significant change in performance until a point● Hama does not scale well (vertices ~10^4)● Giraph and GPS scale better● In general, PageRank runtime > SSSP runtime● GPS input reader does not guarantee true partitioning

for large datasets● Which ‘knobs’ to keep constant? - Optimization vs.

Comparability

Page 20: Comparing pregel related systems

In-Progress

● Output validation● Memory footprint● Network utilization (num. messages)● GraphLab and Signal/Collect● Green-Marl?

○ (DSL) → [Compiler] → (Giraph, GPS)

Page 21: Comparing pregel related systems

Questions?

Page 22: Comparing pregel related systems

Extras

Page 23: Comparing pregel related systems

Preliminary Results

Dataset Hama Giraph GPS

tinyEWD 10 7 7

mediumEWD 16 13 18

1000EWD 27 25 23

rome99 105 102 18

10000EWD 85 80 64

NYC 671 905 438

largeEWD 806 670 730

Number of supersteps for SSSP

Page 24: Comparing pregel related systems

Preliminary ResultsNumber of supersteps for SSSP

Page 25: Comparing pregel related systems

Really, really PreliminaryPageRank runtime (in seconds) on GPS: native vs. Green-Marl generated

Dataset Native Green-Marl generated

tinyEWD 58.57 60.20

mediumEWD 60.42 60.11

1000EWD 61.03 62.30

rome99 61.80 62.32

10000EWD 64.80 65.78

NYC 68.69 71.34

largeEWD 1,213.56 -

Page 26: Comparing pregel related systems

Really, really PreliminaryPageRank runtime (in seconds) on GPS: native vs. Green-Marl generated