comparing pregel related systems

Download Comparing pregel related systems

Post on 11-Nov-2014

48 views

Category:

Engineering

10 download

Embed Size (px)

DESCRIPTION

Comparing Open Source implementations of Pregel and Related Systems. Installation of Hadoop and the Pregel Related Systems. Worked with Datasets of varying sizes from very small to very large. Large datasets that have around 30 million vertices and 50 million edges. Worked on 1,4,8 node Amazon EC2 cluster. 4 Algorithms : PageRank,Shortest Path,KMeans,Collaborative Filtering

TRANSCRIPT

  • 1. Comparison and Evaluation of Open Source Implementations of Pregel and Related Systems December 2, 2013 Joshua Woo, Prashant Raghav, Vishnu Prathish David R. Cheriton School of Computer Science University of Waterloo
  • 2. Outline Motivation Our Project Setup Preliminary Results Preliminary Analysis In-Progress References
  • 3. Motivation Recall: Pregel Large-scale graph processing system Fault-tolerant framework for graph algorithms MapReduce for graph operations? Vertex-centric model (think like a vertex)
  • 4. Motivation Pregel is proprietary Many open source graph processing systems Pregel clones Pregel-inspired BSP
  • 5. Motivation Apache Hama Signal/Collect Apache Giraph GPS GraphLab Phoebus GoldenOrb HipG Mizan
  • 6. Motivation System Impl. Language Type Apache Hama Java Pure BSP framework Signal/Collect Scala Pregel inspired Apache Giraph Java Pregel clone GPS Java Advanced Pregel clone GraphLab C++ Pregel inspired Phoebus Erlang Pregel clone GoldenOrb Java Pregel clone HipG Java Advanced Pregel clone Mizan C++ Advanced Pregel clone
  • 7. Motivation How do these systems compare? In terms of performance (runtime)? In terms of memory footprint? In terms of network utilization (num. messages)? Variables: Algorithm Graph size (number of vertices) Cluster size
  • 8. Our Project Compare at least 3 systems Apache Hama - general BSP framework Apache Giraph - Hadoop Map-only job, Facebook GPS - +dynamic repartitioning, +multi vertex-centric Signal/Collect - +edges, +async computations GraphLab Mizan
  • 9. Our Project Measure the runtime of at least two algorithms on each system PageRank Fixed number of supersteps = 30 Single Source Shortest Path (SSSP) k-means clustering
  • 10. Setup Experiments on AWS Ubuntu 12.04 m1.medium EC2 instances 2 ECUs, 1 vCPU, 3.7 GiB memory, moderate network performance 8 GiB EBS volume per instance Cluster sizes: Single-node cluster 4-node cluster 8-node cluster
  • 11. Setup Experiments on AWS 5 runs per dataset per algorithm per cluster 35 runs per algorithm per cluster 70 runs per cluster 140 runs in total (single-node, 4-node) TODO: another 70 runs (8-node)
  • 12. Setup Dataset 7 datasets tinyEWD: 8 vertices 15 edges mediumEWD: 250 vertices 2,546 edges 1000EWD: 1,000 vertices 16,866 edges rome99: 3,353 vertices 8,870 edges 10000EWD: 10,000 vertices 16,866 edges NYC: 264,346 vertices 733,846 edges largeEWD: 1,000,000 vertices 15,172,126 edges Source: http://algs4.cs.princeton.edu/44sp/
  • 13. Setup Systems Hama Hadoop 1.03.0 Hama 0.6.3 Giraph Hadoop 0.20.203rc1 Giraph (trunk@37bc2c80564b45d7e4ce95db76f5411a6b8bdb3a) GPS Hadoop 0.20.203rc1 GPS (trunk@Revision 112)
  • 14. Setup Input Graph Source files converted into format suitable for each system Time for this conversion excluded from results: Conversion done before algorithms are run (pre-processing?) Negligible for largeEWD (1,000,000 vertices, 15,172,126 edges)
  • 15. Preliminary Results Average SSSP runtime on 4-node cluster (in seconds) Dataset Hama Giraph GPS tinyEWD 14.17 41.60 14.40 mediumEWD 16.36 44.00 36.00 1000EWD 18.06 48.80 46.60 rome99 22.95 66.00 50.00 10000EWD 25.32 67.40 55.00 NYC 165.01 267.00 310.00 largeEWD 6,109.20 602.80 618.70
  • 16. Preliminary Results SSSP runtime vs. graph size (num. vertices)
  • 17. Preliminary Results Average PageRank (30 supersteps) runtime on 4-node cluster (in seconds) Dataset Hama Giraph GPS tinyEWD 29.36 49.40 58.57 mediumEWD 30.26 53.40 60.42 1000EWD 37.86 54.60 61.03 rome99 29.35 56.20 61.80 10000EWD 302.33 61.80 64.80 NYC 1,001.24 134.40 68.69 largeEWD Failed 2,100.00 1,213.56
  • 18. Preliminary Results PageRank runtime vs. graph size (num. vertices)
  • 19. Preliminary Analysis A point of resource crunch No significant change in performance until a point Hama does not scale well (vertices ~10^4) Giraph and GPS scale better In general, PageRank runtime > SSSP runtime GPS input reader does not guarantee true partitioning for large datasets Which knobs to keep constant? - Optimization vs. Comparability
  • 20. In-Progress Output validation Memory footprint Network utilization (num. messages) GraphLab and Signal/Collect Green-Marl? (DSL) [Compiler] (Giraph, GPS)
  • 21. Questions?
  • 22. Extras
  • 23. Preliminary Results Number of supersteps for SSSP Dataset Hama Giraph GPS tinyEWD 10 7 7 mediumEWD 16 13 18 1000EWD 27 25 23 rome99 105 102 18 10000EWD 85 80 64 NYC 671 905 438 largeEWD 806 670 730
  • 24. Preliminary Results Number of supersteps for SSSP
  • 25. Really, really Preliminary PageRank runtime (in seconds) on GPS: native vs. Green-Marl generated Dataset Native Green-Marl generated tinyEWD 58.57 60.20 mediumEWD 60.42 60.11 1000EWD 61.03 62.30 rome99 61.80 62.32 10000EWD 64.80 65.78 NYC 68.69 71.34 largeEWD 1,213.56 -
  • 26. Really, really Preliminary PageRank runtime (in seconds) on GPS: native vs. Green-Marl generated
  • 27. References Our Project Proposal http://algs4.cs.princeton.edu/44sp/ https://github.com/apache/hadoop-common https://github.com/apache/giraph https://subversion.assembla.com/svn/phd-projects/ gps/trunk/ http://ppl.stanford.edu/main/green_marl.html