april 20, 2015 for big data analytics - harvard...

78
The Stratosphere Platform for Big Data Analytics Hongyao Ma Franco Solleza April 20, 2015

Upload: others

Post on 17-Jul-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: April 20, 2015 for Big Data Analytics - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/...April 20, 2015. Stratosphere. Stratosphere. Stratosphere. Big Data Analytics

The Stratosphere Platform for Big Data Analytics

Hongyao MaFranco Solleza

April 20, 2015

Page 2: April 20, 2015 for Big Data Analytics - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/...April 20, 2015. Stratosphere. Stratosphere. Stratosphere. Big Data Analytics

Stratosphere

Page 3: April 20, 2015 for Big Data Analytics - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/...April 20, 2015. Stratosphere. Stratosphere. Stratosphere. Big Data Analytics

Stratosphere

Page 4: April 20, 2015 for Big Data Analytics - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/...April 20, 2015. Stratosphere. Stratosphere. Stratosphere. Big Data Analytics

Stratosphere

Page 5: April 20, 2015 for Big Data Analytics - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/...April 20, 2015. Stratosphere. Stratosphere. Stratosphere. Big Data Analytics

Big Data Analytics

● “BIG Data”

● Heterogeneous datasets: structured / unstructured / semi-structured

● Users have different needs for declarativity and expressivity

Page 6: April 20, 2015 for Big Data Analytics - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/...April 20, 2015. Stratosphere. Stratosphere. Stratosphere. Big Data Analytics

What we have covered so far

● Polybase

● Shark

● MLBase

● SharedDB

● BlinkDB

Page 7: April 20, 2015 for Big Data Analytics - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/...April 20, 2015. Stratosphere. Stratosphere. Stratosphere. Big Data Analytics
Page 8: April 20, 2015 for Big Data Analytics - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/...April 20, 2015. Stratosphere. Stratosphere. Stratosphere. Big Data Analytics

The Promises● Declarative, high-level language

● “In situ” data analysis

● Richer set of primitives than MapReduce

● Treat UDFs at first-class citizens

● Automated parallelization and optimization

● Support for iterative programs

● Includes external memory query processing algorithms to support arbitrarily long programs

Page 9: April 20, 2015 for Big Data Analytics - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/...April 20, 2015. Stratosphere. Stratosphere. Stratosphere. Big Data Analytics

Outline

● Meteor & Sopremo

● PACT

● Nephele

● Experiment Results

● Future work & Discussions

Page 10: April 20, 2015 for Big Data Analytics - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/...April 20, 2015. Stratosphere. Stratosphere. Stratosphere. Big Data Analytics

Sopremo

Page 11: April 20, 2015 for Big Data Analytics - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/...April 20, 2015. Stratosphere. Stratosphere. Stratosphere. Big Data Analytics

Meteor Script

● Declarative interface● High level script

Page 12: April 20, 2015 for Big Data Analytics - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/...April 20, 2015. Stratosphere. Stratosphere. Stratosphere. Big Data Analytics

Meteor Translates To SopremoOutput

Lineitem

Filter

ComputeRevenue

Join

Supplier

Group

Page 13: April 20, 2015 for Big Data Analytics - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/...April 20, 2015. Stratosphere. Stratosphere. Stratosphere. Big Data Analytics

Sopremo

● Modular and extensible● Composable

Page 14: April 20, 2015 for Big Data Analytics - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/...April 20, 2015. Stratosphere. Stratosphere. Stratosphere. Big Data Analytics

Sopremo compiled to PACTOutput

Lineitem

Filter

ComputeRevenue

Join

Supplier

Group

Page 15: April 20, 2015 for Big Data Analytics - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/...April 20, 2015. Stratosphere. Stratosphere. Stratosphere. Big Data Analytics

PACT

Page 16: April 20, 2015 for Big Data Analytics - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/...April 20, 2015. Stratosphere. Stratosphere. Stratosphere. Big Data Analytics

PACT● Programmer makes a “pact”

with system● Uses one of 5 functions

Page 17: April 20, 2015 for Big Data Analytics - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/...April 20, 2015. Stratosphere. Stratosphere. Stratosphere. Big Data Analytics

PACT● Programmer makes a “pact”

with system● Uses one of 5 functions

Map Reduce Cross

Match Co-group

Page 18: April 20, 2015 for Big Data Analytics - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/...April 20, 2015. Stratosphere. Stratosphere. Stratosphere. Big Data Analytics

PACT● Programmer makes a “pact”

with system● Uses one of 5 functions

Map Reduce Cross

Match Co-group

Page 19: April 20, 2015 for Big Data Analytics - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/...April 20, 2015. Stratosphere. Stratosphere. Stratosphere. Big Data Analytics

PACT● Programmer makes a “pact”

with system● Uses one of 5 functions

Map Reduce Cross

Match Co-group

Page 20: April 20, 2015 for Big Data Analytics - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/...April 20, 2015. Stratosphere. Stratosphere. Stratosphere. Big Data Analytics

PACT● Programmer makes a “pact”

with system● Uses one of 5 functions

Map Reduce Cross

Match Co-group

Page 21: April 20, 2015 for Big Data Analytics - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/...April 20, 2015. Stratosphere. Stratosphere. Stratosphere. Big Data Analytics

What’s a PACT?

● Data and a function● Specifies how data are partitioned across the system● An atomic(?) operation on all specified data

Page 22: April 20, 2015 for Big Data Analytics - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/...April 20, 2015. Stratosphere. Stratosphere. Stratosphere. Big Data Analytics

Iterative PACT Programs

Page 23: April 20, 2015 for Big Data Analytics - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/...April 20, 2015. Stratosphere. Stratosphere. Stratosphere. Big Data Analytics

Iterative PACT Programs

● Implicitly, iteration mutates state

Page 24: April 20, 2015 for Big Data Analytics - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/...April 20, 2015. Stratosphere. Stratosphere. Stratosphere. Big Data Analytics

Iterative PACT Programs

● Implicitly, iteration mutates state● How to do iteration without explicit

mutation of state?

Page 25: April 20, 2015 for Big Data Analytics - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/...April 20, 2015. Stratosphere. Stratosphere. Stratosphere. Big Data Analytics

Iterative PACT Programs

● Bulk iteration

Page 26: April 20, 2015 for Big Data Analytics - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/...April 20, 2015. Stratosphere. Stratosphere. Stratosphere. Big Data Analytics

Iterative PACT Programs

● Bulk iteration

Starts with a solution set

Page 27: April 20, 2015 for Big Data Analytics - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/...April 20, 2015. Stratosphere. Stratosphere. Stratosphere. Big Data Analytics

Iterative PACT Programs

● Bulk iteration

Sends group by label to neighbors

Page 28: April 20, 2015 for Big Data Analytics - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/...April 20, 2015. Stratosphere. Stratosphere. Stratosphere. Big Data Analytics

Iterative PACT Programs

● Bulk iteration

Find minimum among those neighbors

Page 29: April 20, 2015 for Big Data Analytics - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/...April 20, 2015. Stratosphere. Stratosphere. Stratosphere. Big Data Analytics

Iterative PACT Programs

● Bulk iteration

Outputs an incremental solution set

Page 30: April 20, 2015 for Big Data Analytics - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/...April 20, 2015. Stratosphere. Stratosphere. Stratosphere. Big Data Analytics

Iterative PACT Programs

● Bulk iteration

Incremental solution set becomes input to next iteration

Page 31: April 20, 2015 for Big Data Analytics - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/...April 20, 2015. Stratosphere. Stratosphere. Stratosphere. Big Data Analytics

Iterative PACT Programs

● Bulk iteration

Page 32: April 20, 2015 for Big Data Analytics - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/...April 20, 2015. Stratosphere. Stratosphere. Stratosphere. Big Data Analytics

Iterative PACT Programs

● Incremental iteration

Page 33: April 20, 2015 for Big Data Analytics - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/...April 20, 2015. Stratosphere. Stratosphere. Stratosphere. Big Data Analytics

Iterative PACT Programs

● Incremental iteration

Starts with a work set, and a solution set

Page 34: April 20, 2015 for Big Data Analytics - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/...April 20, 2015. Stratosphere. Stratosphere. Stratosphere. Big Data Analytics

Iterative PACT Programs

● Incremental iteration

Calculates the min for a group

Page 35: April 20, 2015 for Big Data Analytics - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/...April 20, 2015. Stratosphere. Stratosphere. Stratosphere. Big Data Analytics

Iterative PACT Programs

● Incremental iteration

Merges work set with solution set and checks if label changed

Page 36: April 20, 2015 for Big Data Analytics - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/...April 20, 2015. Stratosphere. Stratosphere. Stratosphere. Big Data Analytics

Iterative PACT Programs

● Incremental iteration

If the label is new, it becomes part of the delta set ..

Page 37: April 20, 2015 for Big Data Analytics - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/...April 20, 2015. Stratosphere. Stratosphere. Stratosphere. Big Data Analytics

Iterative PACT Programs

● Incremental iteration

Which gets sent back to the next iteration

Page 38: April 20, 2015 for Big Data Analytics - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/...April 20, 2015. Stratosphere. Stratosphere. Stratosphere. Big Data Analytics

Iterative PACT Programs

● Incremental iteration

If changed, also gets matched to the neighbors...

Page 39: April 20, 2015 for Big Data Analytics - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/...April 20, 2015. Stratosphere. Stratosphere. Stratosphere. Big Data Analytics

Iterative PACT Programs

● Incremental iteration

And those matches become the new workset

Page 40: April 20, 2015 for Big Data Analytics - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/...April 20, 2015. Stratosphere. Stratosphere. Stratosphere. Big Data Analytics

Iterative PACT Programs

● Incremental iteration

Page 41: April 20, 2015 for Big Data Analytics - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/...April 20, 2015. Stratosphere. Stratosphere. Stratosphere. Big Data Analytics

PACT Optimization

Page 42: April 20, 2015 for Big Data Analytics - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/...April 20, 2015. Stratosphere. Stratosphere. Stratosphere. Big Data Analytics

PACT Optimization

Page 43: April 20, 2015 for Big Data Analytics - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/...April 20, 2015. Stratosphere. Stratosphere. Stratosphere. Big Data Analytics

PACT Optimization

Page 44: April 20, 2015 for Big Data Analytics - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/...April 20, 2015. Stratosphere. Stratosphere. Stratosphere. Big Data Analytics

PACT Optimization

Page 45: April 20, 2015 for Big Data Analytics - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/...April 20, 2015. Stratosphere. Stratosphere. Stratosphere. Big Data Analytics

PACT Optimization

Page 46: April 20, 2015 for Big Data Analytics - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/...April 20, 2015. Stratosphere. Stratosphere. Stratosphere. Big Data Analytics

PACT Optimization

Page 47: April 20, 2015 for Big Data Analytics - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/...April 20, 2015. Stratosphere. Stratosphere. Stratosphere. Big Data Analytics

PACT Optimization

Page 48: April 20, 2015 for Big Data Analytics - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/...April 20, 2015. Stratosphere. Stratosphere. Stratosphere. Big Data Analytics

Nephele

Page 49: April 20, 2015 for Big Data Analytics - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/...April 20, 2015. Stratosphere. Stratosphere. Stratosphere. Big Data Analytics

Nephele Execution

Page 50: April 20, 2015 for Big Data Analytics - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/...April 20, 2015. Stratosphere. Stratosphere. Stratosphere. Big Data Analytics

Nephele Execution● Tasks, channels,

scheduling

Page 51: April 20, 2015 for Big Data Analytics - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/...April 20, 2015. Stratosphere. Stratosphere. Stratosphere. Big Data Analytics

Nephele Execution● Tasks, channels,

scheduling

Tasks with all local pipelines associated with that task are pushed by to slaves

Page 52: April 20, 2015 for Big Data Analytics - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/...April 20, 2015. Stratosphere. Stratosphere. Stratosphere. Big Data Analytics

Nephele Execution● Tasks, channels,

scheduling

Tasks can request to send data over network (only when necessary or ready)

Page 53: April 20, 2015 for Big Data Analytics - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/...April 20, 2015. Stratosphere. Stratosphere. Stratosphere. Big Data Analytics

Nephele Execution● Fault tolerance

Page 54: April 20, 2015 for Big Data Analytics - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/...April 20, 2015. Stratosphere. Stratosphere. Stratosphere. Big Data Analytics

Nephele Execution● Fault tolerance

Conceptually, follows the same concept as lineage (RDDs) but...

Page 55: April 20, 2015 for Big Data Analytics - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/...April 20, 2015. Stratosphere. Stratosphere. Stratosphere. Big Data Analytics

Nephele Execution● Fault tolerance

Intermediate

Blocking operator model

Page 56: April 20, 2015 for Big Data Analytics - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/...April 20, 2015. Stratosphere. Stratosphere. Stratosphere. Big Data Analytics

Nephele Execution● Fault tolerance

Intermediate

Non- Blocking operator model

Page 57: April 20, 2015 for Big Data Analytics - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/...April 20, 2015. Stratosphere. Stratosphere. Stratosphere. Big Data Analytics

Nephele Execution● Runtime operators

Page 58: April 20, 2015 for Big Data Analytics - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/...April 20, 2015. Stratosphere. Stratosphere. Stratosphere. Big Data Analytics

Does it deliver?

Page 59: April 20, 2015 for Big Data Analytics - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/...April 20, 2015. Stratosphere. Stratosphere. Stratosphere. Big Data Analytics

Does it deliver?

● Maybe - what do the experiments say?● What’s old?

○ A lot of things

● What’s new?○ second-order functions that abstract parallelization○ optimization in a UDF-heavy environment○ Integrate iterative processing○ an extensible query language and underlying operator model

Page 60: April 20, 2015 for Big Data Analytics - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/...April 20, 2015. Stratosphere. Stratosphere. Stratosphere. Big Data Analytics

Experimental Evaluation

Page 61: April 20, 2015 for Big Data Analytics - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/...April 20, 2015. Stratosphere. Stratosphere. Stratosphere. Big Data Analytics

Experimental SetupSetup:

● 1 master + 25 slave machines● 16 cores @ 2.0Hz with 32GB of RAM (29GB of operating memory)● 80TB HDFS in plain ASCII, 4 SATA drives at 500MB/s read/write per node● 8 parallel tasks per slave, total DOP 40-200

Comparison with Hadoop

● Vanilla MapReduce engine● Apache Hive● Apache Giraph

Page 62: April 20, 2015 for Big Data Analytics - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/...April 20, 2015. Stratosphere. Stratosphere. Stratosphere. Big Data Analytics

Summary of Results

● Stratosphere achieves linear speedup and similar performance to Hadoop for simple tasks (TeraSort, Word Count)

● Stratosphere beats Hive and Hadoop by 5 times for complicated tasks like TPC-H and triangle enumeration, though no gain from increasing DOP

● Stratosphere performed worse on Connected Components than Giraph due to the better tuned implementation of the latter

● Checkpointing adds little overhead and saves much time when failure occurs

Page 63: April 20, 2015 for Big Data Analytics - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/...April 20, 2015. Stratosphere. Stratosphere. Stratosphere. Big Data Analytics

TeraSort --- Stratosphere v.s. HadoopStratosphere achieves similar performance as Hadoop and Linear Speedup

Page 64: April 20, 2015 for Big Data Analytics - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/...April 20, 2015. Stratosphere. Stratosphere. Stratosphere. Big Data Analytics

Word Count --- Stratosphere v.s. HadoopStratosphere is 20% faster than Hadoop and achieves linear speedup

Page 65: April 20, 2015 for Big Data Analytics - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/...April 20, 2015. Stratosphere. Stratosphere. Stratosphere. Big Data Analytics

Triangle Enumeration: Reducer 1

Page 66: April 20, 2015 for Big Data Analytics - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/...April 20, 2015. Stratosphere. Stratosphere. Stratosphere. Big Data Analytics

Triangle Enumeration: Reducer 2

Page 67: April 20, 2015 for Big Data Analytics - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/...April 20, 2015. Stratosphere. Stratosphere. Stratosphere. Big Data Analytics

Triangle Enumeration: PACT

Page 68: April 20, 2015 for Big Data Analytics - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/...April 20, 2015. Stratosphere. Stratosphere. Stratosphere. Big Data Analytics

Triangle EnumerationStratosphere is 5x faster than Hadoop, though parallelism does not help

Page 69: April 20, 2015 for Big Data Analytics - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/...April 20, 2015. Stratosphere. Stratosphere. Stratosphere. Big Data Analytics

TPC-H Query

Page 70: April 20, 2015 for Big Data Analytics - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/...April 20, 2015. Stratosphere. Stratosphere. Stratosphere. Big Data Analytics

TPC-H --- Stratosphere v.s. HiveParallelism does not seem to help, however, Stratosphere is 5x faster

Page 71: April 20, 2015 for Big Data Analytics - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/...April 20, 2015. Stratosphere. Stratosphere. Stratosphere. Big Data Analytics

Connected ComponentsGiraph is faster, due to better tuned implementation

Page 72: April 20, 2015 for Big Data Analytics - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/...April 20, 2015. Stratosphere. Stratosphere. Stratosphere. Big Data Analytics

CC --- Execution time per superstep

Page 73: April 20, 2015 for Big Data Analytics - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/...April 20, 2015. Stratosphere. Stratosphere. Stratosphere. Big Data Analytics

Fault ToleranceCheckpointing adds little overhead and saves much time when failure occurs

Page 74: April 20, 2015 for Big Data Analytics - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/...April 20, 2015. Stratosphere. Stratosphere. Stratosphere. Big Data Analytics

What Else Do We Want to See? For presented experiments:

● Breakdown of execution time to distinguish bottlenecks● What happens with even smaller DOP?● What happens with more/less tasks on each core?

Further:

● What happens with even larger data? Current size does fit into RAM● Comparison with MPP, or split query processing systems like Polybase, or

Shark given the size of the tested data

Page 75: April 20, 2015 for Big Data Analytics - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/...April 20, 2015. Stratosphere. Stratosphere. Stratosphere. Big Data Analytics

The Promises?● Declarative, high-level language

● “In situ” data analysis

● Richer set of primitives than MapReduce

● Treat UDFs at first-class citizens

● Automated parallelization and optimization

● Support for iterative programs

● Includes external memory query processing algorithms to support arbitrarily long programs

Page 76: April 20, 2015 for Big Data Analytics - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/...April 20, 2015. Stratosphere. Stratosphere. Stratosphere. Big Data Analytics

Ongoing and Future Work● One-pass optimizer unifying PACT and sopremo layers

● Strengthening fault-tolerant capabilities

● Improving scalability and efficiency of Nephele

● Design, compilation and optimization of higher-level languages

● Scalable, efficient, and adaptive algorithms and architecture

● “Stateful” systems for fast ingestion and low-latency data analysis

Page 77: April 20, 2015 for Big Data Analytics - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/...April 20, 2015. Stratosphere. Stratosphere. Stratosphere. Big Data Analytics

Discussions and Questions

● Declarativity - expressiveness tradeoff

○ More declarative -> less expressive, but easier to optimize

● Run-time optimization is the way to go?

○ Skewed data distribution may become a bottleneck for such systems

○ Detecting performance bottleneck on the fly

Page 78: April 20, 2015 for Big Data Analytics - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/...April 20, 2015. Stratosphere. Stratosphere. Stratosphere. Big Data Analytics

QEDTHANKS!