the stratosphere big data analytics platform

55
The Stratosphere Big Data Analytics Platform Amir H. Payberah Swedish Institute of Computer Science [email protected] June 4, 2014 Amir H. Payberah (SICS) Stratosphere June 4, 2014 1 / 44

Upload: amir-payberah

Post on 07-Aug-2015

62 views

Category:

Data & Analytics


1 download

TRANSCRIPT

Page 1: The Stratosphere Big Data Analytics Platform

The Stratosphere Big Data Analytics Platform

Amir H. PayberahSwedish Institute of Computer Science

[email protected] 4, 2014

Amir H. Payberah (SICS) Stratosphere June 4, 2014 1 / 44

Page 2: The Stratosphere Big Data Analytics Platform

Big Data

small data big data

Amir H. Payberah (SICS) Stratosphere June 4, 2014 2 / 44

Page 3: The Stratosphere Big Data Analytics Platform

I Big Data refers to datasets and flows largeenough that has outpaced our capability tostore, process, analyze, and understand.

Amir H. Payberah (SICS) Stratosphere June 4, 2014 3 / 44

Page 4: The Stratosphere Big Data Analytics Platform

Where DoesBig Data Come From?

Amir H. Payberah (SICS) Stratosphere June 4, 2014 4 / 44

Page 5: The Stratosphere Big Data Analytics Platform

Big Data Market Driving Factors

The number of web pages indexed by Google, which were aroundone million in 1998, have exceeded one trillion in 2008, and itsexpansion is accelerated by appearance of the social networks.∗

∗“Mining big data: current status, and forecast to the future” [Wei Fan et al., 2013]

Amir H. Payberah (SICS) Stratosphere June 4, 2014 5 / 44

Page 6: The Stratosphere Big Data Analytics Platform

Big Data Market Driving Factors

The amount of mobile data traffic is expected to grow to 10.8Exabyte per month by 2016.∗

∗“Worldwide Big Data Technology and Services 2012-2015 Forecast” [Dan Vesset et al., 2013]

Amir H. Payberah (SICS) Stratosphere June 4, 2014 6 / 44

Page 7: The Stratosphere Big Data Analytics Platform

Big Data Market Driving Factors

More than 65 billion devices were connected to the Internet by2010, and this number will go up to 230 billion by 2020.∗

∗“The Internet of Things Is Coming” [John Mahoney et al., 2013]

Amir H. Payberah (SICS) Stratosphere June 4, 2014 7 / 44

Page 8: The Stratosphere Big Data Analytics Platform

Big Data Market Driving Factors

Many companies are moving towards using Cloud services toaccess Big Data analytical tools.

Amir H. Payberah (SICS) Stratosphere June 4, 2014 8 / 44

Page 9: The Stratosphere Big Data Analytics Platform

Big Data Market Driving Factors

Open source communities

Amir H. Payberah (SICS) Stratosphere June 4, 2014 9 / 44

Page 10: The Stratosphere Big Data Analytics Platform

Amir H. Payberah (SICS) Stratosphere June 4, 2014 10 / 44

Page 11: The Stratosphere Big Data Analytics Platform

Big Data Analytics Stack

Amir H. Payberah (SICS) Stratosphere June 4, 2014 11 / 44

Page 12: The Stratosphere Big Data Analytics Platform

Hadoop Big Data Analytics Stack

Amir H. Payberah (SICS) Stratosphere June 4, 2014 12 / 44

Page 13: The Stratosphere Big Data Analytics Platform

Spark Big Data Analytics Stack

Amir H. Payberah (SICS) Stratosphere June 4, 2014 13 / 44

Page 14: The Stratosphere Big Data Analytics Platform

Stratosphere Big Data Analytics Stack

* Under development

** Stratosphere community is thinking to integrate with Mahout

Amir H. Payberah (SICS) Stratosphere June 4, 2014 14 / 44

Page 15: The Stratosphere Big Data Analytics Platform

Amir H. Payberah (SICS) Stratosphere June 4, 2014 15 / 44

Page 16: The Stratosphere Big Data Analytics Platform

What is Stratosphere?

I An efficient distributed general-purpose data analysis platform.

I Built on top of HDFS and YARN.

I Focusing on ease of programming.

Amir H. Payberah (SICS) Stratosphere June 4, 2014 16 / 44

Page 17: The Stratosphere Big Data Analytics Platform

Project Status

I Research project started in 2009 by TU Berlin, HU Berlin, and HPI

I An Apache Incubator project

I 25 contributors

I v0.5 is released

Amir H. Payberah (SICS) Stratosphere June 4, 2014 17 / 44

Page 18: The Stratosphere Big Data Analytics Platform

Motivation

I MapReduce programming model has not been designed for complexoperations, e.g., data mining.

I Very expensive, i.e., always goes to disk and HDFS.

Amir H. Payberah (SICS) Stratosphere June 4, 2014 18 / 44

Page 19: The Stratosphere Big Data Analytics Platform

Solution

I Extends MapReduce with more operators.

I Support for advanced data flow graphs.

I In-memory and out-of-core processing.

Amir H. Payberah (SICS) Stratosphere June 4, 2014 19 / 44

Page 20: The Stratosphere Big Data Analytics Platform

StratosphereProgramming Model

Amir H. Payberah (SICS) Stratosphere June 4, 2014 20 / 44

Page 21: The Stratosphere Big Data Analytics Platform

Stratosphere Programming Model

I A program is expressed as an arbitrary data flow consisting oftransformations, sources and sinks.

Amir H. Payberah (SICS) Stratosphere June 4, 2014 21 / 44

Page 22: The Stratosphere Big Data Analytics Platform

Transformations

I Higher-order functions that execute user-defined functions in parallelon the input data.

Amir H. Payberah (SICS) Stratosphere June 4, 2014 22 / 44

Page 23: The Stratosphere Big Data Analytics Platform

Transformations

I Higher-order functions that execute user-defined functions in parallelon the input data.

Amir H. Payberah (SICS) Stratosphere June 4, 2014 22 / 44

Page 24: The Stratosphere Big Data Analytics Platform

Transformations: Map

I All pairs are independently processed.

val input: DataSet[(Int, String)] = ...

val mapped = input.flatMap { case (value, words) => words.split(" ") }

Amir H. Payberah (SICS) Stratosphere June 4, 2014 23 / 44

Page 25: The Stratosphere Big Data Analytics Platform

Transformations: Map

I All pairs are independently processed.

val input: DataSet[(Int, String)] = ...

val mapped = input.flatMap { case (value, words) => words.split(" ") }

Amir H. Payberah (SICS) Stratosphere June 4, 2014 23 / 44

Page 26: The Stratosphere Big Data Analytics Platform

Transformations: Reduce

I Pairs with identical key are grouped.

I Groups are independently processed.

val input: DataSet[(String, Int)] = ...

val reduced = input.groupBy(_._1)

.reduceGroup(words => words.minBy(_._2))

Amir H. Payberah (SICS) Stratosphere June 4, 2014 24 / 44

Page 27: The Stratosphere Big Data Analytics Platform

Transformations: Reduce

I Pairs with identical key are grouped.

I Groups are independently processed.

val input: DataSet[(String, Int)] = ...

val reduced = input.groupBy(_._1)

.reduceGroup(words => words.minBy(_._2))

Amir H. Payberah (SICS) Stratosphere June 4, 2014 24 / 44

Page 28: The Stratosphere Big Data Analytics Platform

Transformations: Join

I Performs an equi-join on the key.

I Join candidates are independently processed.

val counts: DataSet[(String, Int)] = ...

val names: DataSet[(Int, String)] = ...

val join = counts.join(names)

.where(_._2)

.isEqualTo(_._1)

.map { case (l, r) => l._1 + "and" + r._2 }

Amir H. Payberah (SICS) Stratosphere June 4, 2014 25 / 44

Page 29: The Stratosphere Big Data Analytics Platform

Transformations: Join

I Performs an equi-join on the key.

I Join candidates are independently processed.

val counts: DataSet[(String, Int)] = ...

val names: DataSet[(Int, String)] = ...

val join = counts.join(names)

.where(_._2)

.isEqualTo(_._1)

.map { case (l, r) => l._1 + "and" + r._2 }

Amir H. Payberah (SICS) Stratosphere June 4, 2014 25 / 44

Page 30: The Stratosphere Big Data Analytics Platform

Transformations: and More ...

Groups each input on key

Builds a Cartesian Product

Merges two or more input data sets, keeping duplicates

Amir H. Payberah (SICS) Stratosphere June 4, 2014 26 / 44

Page 31: The Stratosphere Big Data Analytics Platform

Transformations: and More ...

Groups each input on key Builds a Cartesian Product

Merges two or more input data sets, keeping duplicates

Amir H. Payberah (SICS) Stratosphere June 4, 2014 26 / 44

Page 32: The Stratosphere Big Data Analytics Platform

Transformations: and More ...

Groups each input on key Builds a Cartesian Product

Merges two or more input data sets, keeping duplicates

Amir H. Payberah (SICS) Stratosphere June 4, 2014 26 / 44

Page 33: The Stratosphere Big Data Analytics Platform

Example: Word Count

val input = TextFile(textInput)

val words = input.flatMap { line => line.split(" ")

map(word => (word, 1)) }

val counts = words.groupBy(_._1)

.reduce((w1, w2) => (w1._1, w1._2 + w2._2))

val output = counts.write(wordsOutput, CsvOutputFormat())

Amir H. Payberah (SICS) Stratosphere June 4, 2014 27 / 44

Page 34: The Stratosphere Big Data Analytics Platform

Join Optimization

val large = env.readCsv(...)

val medium = env.readCsv(...)

val small = env.readCsv(...)

joined1 = large.join(medium)

.where(_._3)

.isEqualTo(_._1)

.map { (left, right) => ... }

joined2 = small.join(joined1)

.where(_._1)

.isEqualTo(_._2)

.map { (left, right) => ...}

result = joined2.groupBy(_._3)

.reduceGroup(_.maxBy(_._2))

Amir H. Payberah (SICS) Stratosphere June 4, 2014 28 / 44

Page 35: The Stratosphere Big Data Analytics Platform

Join Optimization

val large = env.readCsv(...)

val medium = env.readCsv(...)

val small = env.readCsv(...)

joined1 = large.join(medium)

.where(_._3)

.isEqualTo(_._1)

.map { (left, right) => ... }

joined2 = small.join(joined1)

.where(_._1)

.isEqualTo(_._2)

.map { (left, right) => ...}

result = joined2.groupBy(_._3)

.reduceGroup(_.maxBy(_._2))

(1) Partitioned hash-join- Hash/Sort tables into N buckets on join key

- Send buckets to parallel machines- Join every pair of buckets

Amir H. Payberah (SICS) Stratosphere June 4, 2014 28 / 44

Page 36: The Stratosphere Big Data Analytics Platform

Join Optimization

val large = env.readCsv(...)

val medium = env.readCsv(...)

val small = env.readCsv(...)

joined1 = large.join(medium)

.where(_._3)

.isEqualTo(_._1)

.map { (left, right) => ... }

joined2 = small.join(joined1)

.where(_._1)

.isEqualTo(_._2)

.map { (left, right) => ...}

result = joined2.groupBy(_._3)

.reduceGroup(_.maxBy(_._2))

(2) Broadcast hash-join- Broadcast the smaller table

- Avoids the network overhead

Amir H. Payberah (SICS) Stratosphere June 4, 2014 28 / 44

Page 37: The Stratosphere Big Data Analytics Platform

Join Optimization

val large = env.readCsv(...)

val medium = env.readCsv(...)

val small = env.readCsv(...)

joined1 = large.join(medium)

.where(_._3)

.isEqualTo(_._1)

.map { (left, right) => ... }

joined2 = small.join(joined1)

.where(_._1)

.isEqualTo(_._2)

.map { (left, right) => ...}

result = joined2.groupBy(_._3)

.reduceGroup(_.maxBy(_._2))

(3) Grouping/Aggregation reuses the partitioning from step (1) - no shuffle

Amir H. Payberah (SICS) Stratosphere June 4, 2014 28 / 44

Page 38: The Stratosphere Big Data Analytics Platform

What about Iteration?

Amir H. Payberah (SICS) Stratosphere June 4, 2014 29 / 44

Page 39: The Stratosphere Big Data Analytics Platform

Iterative Algorithms

I Algorithms that need iterations:

• Clustering, e.g., K-Means

• Gradient descent, e.g., Logistic Regression

• Graph algorithms, e.g., PageRank

• ...

Amir H. Payberah (SICS) Stratosphere June 4, 2014 30 / 44

Page 40: The Stratosphere Big Data Analytics Platform

Iteration

I Loop over the working data multiple times.

I Iterations with hadoop• Slow: using HDFS• Everything has to be read over and over again

Amir H. Payberah (SICS) Stratosphere June 4, 2014 31 / 44

Page 41: The Stratosphere Big Data Analytics Platform

Iteration

I Loop over the working data multiple times.

I Iterations with hadoop• Slow: using HDFS• Everything has to be read over and over again

Amir H. Payberah (SICS) Stratosphere June 4, 2014 31 / 44

Page 42: The Stratosphere Big Data Analytics Platform

Types of Iterations

I Two types of iteration at stratosphere:• Bulk iteration• Delta iteration

I Both operators repeatedly invoke the step function on the currentiteration state until a certain termination condition is reached.

Amir H. Payberah (SICS) Stratosphere June 4, 2014 32 / 44

Page 43: The Stratosphere Big Data Analytics Platform

Bulk Iteration

I In each iteration, the step function consumes the entire input, andcomputes the next version of the partial solution.

I A new version of the entire model in each iteration.

Amir H. Payberah (SICS) Stratosphere June 4, 2014 33 / 44

Page 44: The Stratosphere Big Data Analytics Platform

Bulk Iteration - Example

// 1st 2nd 10th

map(1) -> 2 map(2) -> 3 ... map(10) -> 11

map(2) -> 3 map(3) -> 4 ... map(11) -> 12

map(3) -> 4 map(4) -> 5 ... map(12) -> 13

map(4) -> 5 map(5) -> 6 ... map(13) -> 14

map(5) -> 6 map(6) -> 7 ... map(14) -> 15

val input: DataSet[Int] = ...

def step(partial: DataSet[Int]) = partial.map(a => a + 1)

val numIter = 10;

val iter = input.iterate(numIter, step)

Amir H. Payberah (SICS) Stratosphere June 4, 2014 34 / 44

Page 45: The Stratosphere Big Data Analytics Platform

Bulk Iteration - Example

// 1st 2nd 10th

map(1) -> 2 map(2) -> 3 ... map(10) -> 11

map(2) -> 3 map(3) -> 4 ... map(11) -> 12

map(3) -> 4 map(4) -> 5 ... map(12) -> 13

map(4) -> 5 map(5) -> 6 ... map(13) -> 14

map(5) -> 6 map(6) -> 7 ... map(14) -> 15

val input: DataSet[Int] = ...

def step(partial: DataSet[Int]) = partial.map(a => a + 1)

val numIter = 10;

val iter = input.iterate(numIter, step)

Amir H. Payberah (SICS) Stratosphere June 4, 2014 34 / 44

Page 46: The Stratosphere Big Data Analytics Platform

Delta Iteration

I Only parts of the model change in each iteration.

Amir H. Payberah (SICS) Stratosphere June 4, 2014 35 / 44

Page 47: The Stratosphere Big Data Analytics Platform

Delta Iteration - Example

Amir H. Payberah (SICS) Stratosphere June 4, 2014 36 / 44

Page 48: The Stratosphere Big Data Analytics Platform

Delta Iteration vs. Bulk Iteration

I Computations performed in each iteration for connected communi-ties of a social graph.

Amir H. Payberah (SICS) Stratosphere June 4, 2014 37 / 44

Page 49: The Stratosphere Big Data Analytics Platform

StratosphereExecution Engine

Amir H. Payberah (SICS) Stratosphere June 4, 2014 38 / 44

Page 50: The Stratosphere Big Data Analytics Platform

Stratosphere Architecture

I Master-worker model

I Job Manager: handles job submissionand scheduling.

I Task Manager: executes tasks receivedfrom JM.

I All operators start in-memory andgradually go out-of-core.

Amir H. Payberah (SICS) Stratosphere June 4, 2014 39 / 44

Page 51: The Stratosphere Big Data Analytics Platform

Job Graph and Execution Graph

I Jobs are expressed as data flows.

I Job graphs are transformed into the execution graph.

I Execution graphs consist information to schedule and execute a job.

Amir H. Payberah (SICS) Stratosphere June 4, 2014 40 / 44

Page 52: The Stratosphere Big Data Analytics Platform

Channels

I Channels transfer serialized records in buffers.

I Pipelined• Online transfer to receiver.

I Materialized• Sender writes result to disk, and

afterwards it transferred to receiver.• Used in recovery.

Amir H. Payberah (SICS) Stratosphere June 4, 2014 41 / 44

Page 53: The Stratosphere Big Data Analytics Platform

Fault Tolerance

I Task failure compensated by backup task deployment.

I Track the execution graph back to the latest available result, andrecompute the failed tasks.

Amir H. Payberah (SICS) Stratosphere June 4, 2014 42 / 44

Page 54: The Stratosphere Big Data Analytics Platform

Amir H. Payberah (SICS) Stratosphere June 4, 2014 43 / 44

Page 55: The Stratosphere Big Data Analytics Platform

http://stratosphere.eu

Amir H. Payberah (SICS) Stratosphere June 4, 2014 44 / 44