processing a unified model for batch and streaming data dataflow/apache...

37
Dataflow/Apache Beam A Unified Model for Batch and Streaming Data Processing Eugene Kirpichov, Google STREAM 2016

Upload: others

Post on 28-May-2020

15 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Processing A Unified Model for Batch and Streaming Data Dataflow/Apache Beamstreamingsystems.org/Slides/Eugene Kirpichov - STREAM... · 2016-04-05 · A Unified Model for Batch and

Dataflow/Apache BeamA Unified Model for Batch and Streaming Data ProcessingEugene Kirpichov, Google

STREAM 2016

Page 2: Processing A Unified Model for Batch and Streaming Data Dataflow/Apache Beamstreamingsystems.org/Slides/Eugene Kirpichov - STREAM... · 2016-04-05 · A Unified Model for Batch and

Google’s Data Processing Story

Philosophy of the Beam programming model

Agenda

1

2

Apache Beam project3

Page 3: Processing A Unified Model for Batch and Streaming Data Dataflow/Apache Beamstreamingsystems.org/Slides/Eugene Kirpichov - STREAM... · 2016-04-05 · A Unified Model for Batch and

Google’s Data Processing Story

1

Page 4: Processing A Unified Model for Batch and Streaming Data Dataflow/Apache Beamstreamingsystems.org/Slides/Eugene Kirpichov - STREAM... · 2016-04-05 · A Unified Model for Batch and

20122002 2004 2006 2008 2010

MapReduce

GFS Big Table

Dremel

Pregel

FlumeJava

Colossus

Spanner

2014

MillWheel

Dataflow

2016

Data Processing @ Google

Page 5: Processing A Unified Model for Batch and Streaming Data Dataflow/Apache Beamstreamingsystems.org/Slides/Eugene Kirpichov - STREAM... · 2016-04-05 · A Unified Model for Batch and

(distributed output dataset)

MapReduce: SELECT + GROUP BY

(distributed input dataset)

Shuffle (GROUP BY)

Map (SELECT)

Reduce (SELECT)

Page 6: Processing A Unified Model for Batch and Streaming Data Dataflow/Apache Beamstreamingsystems.org/Slides/Eugene Kirpichov - STREAM... · 2016-04-05 · A Unified Model for Batch and

20122002 2004 2006 2008 2010

MapReduce

GFS Big Table

Dremel

Pregel

FlumeJava

Colossus

Spanner

2014

MillWheel

Dataflow

2016

Data Processing @ Google

Page 7: Processing A Unified Model for Batch and Streaming Data Dataflow/Apache Beamstreamingsystems.org/Slides/Eugene Kirpichov - STREAM... · 2016-04-05 · A Unified Model for Batch and

FlumeJava Pipelines

• A Pipeline represents a graph of data processing transformations

• PCollections flow through the pipeline

• Optimized and executed as a unit for efficiency

Page 8: Processing A Unified Model for Batch and Streaming Data Dataflow/Apache Beamstreamingsystems.org/Slides/Eugene Kirpichov - STREAM... · 2016-04-05 · A Unified Model for Batch and

// Collection of raw eventsPCollection<SensorEvent> raw = ...;// Element-wise extract location/temperature pairsPCollection<KV<String, Double>> input = raw.apply(ParDo.of(new ParseFn()))// Composite transformation containing an aggregationPCollection<KV<String, Double>> output = input .apply(Mean.<Double>perKey());// Write outputoutput.apply(BigtableIO.Write.to(...));

Example: Computing mean temperature

What Where When How

Page 9: Processing A Unified Model for Batch and Streaming Data Dataflow/Apache Beamstreamingsystems.org/Slides/Eugene Kirpichov - STREAM... · 2016-04-05 · A Unified Model for Batch and

So, people used FJ to process data...

Page 10: Processing A Unified Model for Batch and Streaming Data Dataflow/Apache Beamstreamingsystems.org/Slides/Eugene Kirpichov - STREAM... · 2016-04-05 · A Unified Model for Batch and

...big data...

Page 11: Processing A Unified Model for Batch and Streaming Data Dataflow/Apache Beamstreamingsystems.org/Slides/Eugene Kirpichov - STREAM... · 2016-04-05 · A Unified Model for Batch and

...really, really big...

TuesdayWednesday

Thursday

Page 12: Processing A Unified Model for Batch and Streaming Data Dataflow/Apache Beamstreamingsystems.org/Slides/Eugene Kirpichov - STREAM... · 2016-04-05 · A Unified Model for Batch and

Batch failure mode #1

Latency

Page 13: Processing A Unified Model for Batch and Streaming Data Dataflow/Apache Beamstreamingsystems.org/Slides/Eugene Kirpichov - STREAM... · 2016-04-05 · A Unified Model for Batch and

MapReduce

TuesdayWednesday

Batch failure mode #2: Sessions

Jose

Lisa

Ingo

Asha

Cheryl

Ari

WednesdayTuesday

Page 14: Processing A Unified Model for Batch and Streaming Data Dataflow/Apache Beamstreamingsystems.org/Slides/Eugene Kirpichov - STREAM... · 2016-04-05 · A Unified Model for Batch and

Continuous & Unbounded

9:008:00 14:0013:0012:0011:0010:002:001:00 7:006:005:004:003:00

Page 15: Processing A Unified Model for Batch and Streaming Data Dataflow/Apache Beamstreamingsystems.org/Slides/Eugene Kirpichov - STREAM... · 2016-04-05 · A Unified Model for Batch and

Historicalevents

Exacthistorical

model

Periodic batchprocessing

Approximate real-time

model

Stream processing

systemContinuous

updates

State of the art until recently: Lambda Architecture

Page 16: Processing A Unified Model for Batch and Streaming Data Dataflow/Apache Beamstreamingsystems.org/Slides/Eugene Kirpichov - STREAM... · 2016-04-05 · A Unified Model for Batch and

20122002 2004 2006 2008 2010

MapReduce

GFS Big Table

Dremel

Pregel

FlumeJava

Colossus

Spanner

2014

MillWheel

Dataflow

2016

Data Processing @ Google

Page 17: Processing A Unified Model for Batch and Streaming Data Dataflow/Apache Beamstreamingsystems.org/Slides/Eugene Kirpichov - STREAM... · 2016-04-05 · A Unified Model for Batch and

MillWheel: Deterministic, low-latency streaming

● Framework for building low-latency data-processing applications

● User provides a DAG of computations to be performed

● System manages state and persistent flow of elements

Page 18: Processing A Unified Model for Batch and Streaming Data Dataflow/Apache Beamstreamingsystems.org/Slides/Eugene Kirpichov - STREAM... · 2016-04-05 · A Unified Model for Batch and

Streaming or Batch?

1 + 1 = 2Correctness Latency

Why not both?

Page 19: Processing A Unified Model for Batch and Streaming Data Dataflow/Apache Beamstreamingsystems.org/Slides/Eugene Kirpichov - STREAM... · 2016-04-05 · A Unified Model for Batch and

What are you computing?

Where in event time?

When in processing time?

How do refinements relate?

Page 20: Processing A Unified Model for Batch and Streaming Data Dataflow/Apache Beamstreamingsystems.org/Slides/Eugene Kirpichov - STREAM... · 2016-04-05 · A Unified Model for Batch and

Where in event time?

What Where When How

● Windowing divides data into event-time-based finite chunks.

● Required when doing aggregations over unbounded data.

Page 21: Processing A Unified Model for Batch and Streaming Data Dataflow/Apache Beamstreamingsystems.org/Slides/Eugene Kirpichov - STREAM... · 2016-04-05 · A Unified Model for Batch and

What Where When How

When in Processing Time?

• Triggers control when results are emitted.

• Triggers are often relative to the watermark.Pr

oces

sing

Tim

e

Event Time

Watermark

Page 22: Processing A Unified Model for Batch and Streaming Data Dataflow/Apache Beamstreamingsystems.org/Slides/Eugene Kirpichov - STREAM... · 2016-04-05 · A Unified Model for Batch and

PCollection<KV<String, Integer>> output = input .apply(Window.into(Sessions.withGapDuration(Minutes(1))) .trigger(AtWatermark() .withEarlyFirings(AtPeriod(Minutes(1))) .withLateFirings(AtCount(1))) .accumulatingAndRetracting()) .apply(new Sum());

What Where When How

How do refinements relate?

Page 23: Processing A Unified Model for Batch and Streaming Data Dataflow/Apache Beamstreamingsystems.org/Slides/Eugene Kirpichov - STREAM... · 2016-04-05 · A Unified Model for Batch and

20122002 2004 2006 2008 2010

MapReduce

GFS Big Table

Dremel

Pregel

FlumeJava

Colossus

Spanner

2014

MillWheel

Dataflow

2016

Data Processing @ Google

Page 24: Processing A Unified Model for Batch and Streaming Data Dataflow/Apache Beamstreamingsystems.org/Slides/Eugene Kirpichov - STREAM... · 2016-04-05 · A Unified Model for Batch and

Cloud DataflowA fully-managed cloud service and programming model for batch and streaming big data processing.

Google Cloud Dataflow

Page 25: Processing A Unified Model for Batch and Streaming Data Dataflow/Apache Beamstreamingsystems.org/Slides/Eugene Kirpichov - STREAM... · 2016-04-05 · A Unified Model for Batch and
Page 26: Processing A Unified Model for Batch and Streaming Data Dataflow/Apache Beamstreamingsystems.org/Slides/Eugene Kirpichov - STREAM... · 2016-04-05 · A Unified Model for Batch and

Dataflow SDK

● Portable API to construct and run a pipeline.● Available in Java and Python (alpha)● Pipelines can run…

○ On your development machine○ On the Dataflow Service on Google Cloud Platform ○ On third party environments like Spark or Flink.

Page 27: Processing A Unified Model for Batch and Streaming Data Dataflow/Apache Beamstreamingsystems.org/Slides/Eugene Kirpichov - STREAM... · 2016-04-05 · A Unified Model for Batch and

Dataflow ⇒ Apache Beam3

Page 28: Processing A Unified Model for Batch and Streaming Data Dataflow/Apache Beamstreamingsystems.org/Slides/Eugene Kirpichov - STREAM... · 2016-04-05 · A Unified Model for Batch and

Pipeline p = Pipeline.create(options);

p.apply(TextIO.Read.from("gs://dataflow-samples/shakespeare/*"))

.apply(FlatMapElements.via(

word → Arrays.asList(word.split("[^a-zA-Z']+"))))

.apply(Filter.byPredicate(word → !word.isEmpty()))

.apply(Count.perElement())

.apply(MapElements.via(

count → count.getKey() + ": " + count.getValue())

.apply(TextIO.Write.to("gs://.../..."));

p.run();

Page 29: Processing A Unified Model for Batch and Streaming Data Dataflow/Apache Beamstreamingsystems.org/Slides/Eugene Kirpichov - STREAM... · 2016-04-05 · A Unified Model for Batch and

Apache Beam ecosystem

End-user's pipeline

Libraries: transforms, sources/sinks etc.

Language-specific SDK

Beam model (ParDo, GBK, Windowing…)

Runner

Execution environment

Java ...Python

Page 30: Processing A Unified Model for Batch and Streaming Data Dataflow/Apache Beamstreamingsystems.org/Slides/Eugene Kirpichov - STREAM... · 2016-04-05 · A Unified Model for Batch and

Apache Beam ecosystem

End-user's pipeline

Libraries: transforms, sources/sinks etc.

Language-specific SDK

Beam model (ParDo, GBK, Windowing…)

Runner

Execution environment

Java ...Python

Page 31: Processing A Unified Model for Batch and Streaming Data Dataflow/Apache Beamstreamingsystems.org/Slides/Eugene Kirpichov - STREAM... · 2016-04-05 · A Unified Model for Batch and

Apache Beam Roadmap

02/01/2016Enter Apache

Incubator

Early 2016Design for use cases,

begin refactoring

Mid 2016Slight chaos

Late 2016Multiple runners execute Beam

pipelines

02/25/20161st commit to ASF repository

Page 32: Processing A Unified Model for Batch and Streaming Data Dataflow/Apache Beamstreamingsystems.org/Slides/Eugene Kirpichov - STREAM... · 2016-04-05 · A Unified Model for Batch and

Runner capability matrix

Page 33: Processing A Unified Model for Batch and Streaming Data Dataflow/Apache Beamstreamingsystems.org/Slides/Eugene Kirpichov - STREAM... · 2016-04-05 · A Unified Model for Batch and

• Multiple SDKswith shared pipeline representation

• Language-agnostic runnersimplementing the model

• Fn Runnersrun language-specific code

Technical Vision: Still more modular

Beam Model: Fn Runners

Runner A Runner B

Beam Model: Pipeline Construction

OtherLanguagesBeam Java Beam

Python

Execution Execution

Cloud Dataflow

Execution

Page 34: Processing A Unified Model for Batch and Streaming Data Dataflow/Apache Beamstreamingsystems.org/Slides/Eugene Kirpichov - STREAM... · 2016-04-05 · A Unified Model for Batch and

Recap: Timeline of ideas2004 MapReduce (SELECT / GROUP BY)

Library > DSLAbstract away fault tolerance & distribution

2010 FlumeJava: High-level API (typed DAG)

2013 MillWheel: Deterministic stream processing

2015 Dataflow: Unified batch/streaming modelWindowing, Triggers, Retractions

2016 Beam: Portable programming modelLanguage-agnostic runners

Page 36: Processing A Unified Model for Batch and Streaming Data Dataflow/Apache Beamstreamingsystems.org/Slides/Eugene Kirpichov - STREAM... · 2016-04-05 · A Unified Model for Batch and

Google confidential │ Do not distribute

Thank you

Page 37: Processing A Unified Model for Batch and Streaming Data Dataflow/Apache Beamstreamingsystems.org/Slides/Eugene Kirpichov - STREAM... · 2016-04-05 · A Unified Model for Batch and

Google confidential │ Do not distribute