apache beam (incubating)

74
Apache Beam (incubating) Kenneth Knowles [email protected] @KennKnowles Apache Apex Meetup, 2016-06-27 https://goo.gl/ LTLjKt

Upload: datatorrent

Post on 16-Apr-2017

190 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Apache Beam (incubating)

Apache Beam (incubating)

Kenneth [email protected]@KennKnowles Apache Apex Meetup, 2016-06-

27

https://goo.gl/LTLjKt

Page 2: Apache Beam (incubating)

Motivation

Beam Model

Beam Project / Technical Vision

Agenda1

2

3

2

Page 3: Apache Beam (incubating)

3

Motivation1

Page 4: Apache Beam (incubating)

https://commons.wikimedia.org/wiki/File:Globe_centered_in_the_Atlantic_Ocean_(green_and_grey_globe_scheme).svg4

Page 5: Apache Beam (incubating)

5

Unbounded, delayed, out of order

9:008:00 14:00

13:00

12:00

11:00

10:00

2:001:00 7:006:005:004:003:00

5

8:00

8:008:00

Page 6: Apache Beam (incubating)

Incoming!

Score per

user?

6

Page 7: Apache Beam (incubating)

Organizing the stream

7

8:00

8:00

8:00

Page 8: Apache Beam (incubating)

Completeness

Latency Cost

$$$

Data Processing Tradeoffs

8

Page 9: Apache Beam (incubating)

What is important for your application?

Completeness Low Latency Low Cost

Important

Not Important

$$$9

Page 10: Apache Beam (incubating)

Monthly Billing

Completeness Low Latency Low Cost

Important

Not Important

$$$10

Page 11: Apache Beam (incubating)

Billing estimate

Completeness Low Latency Low Cost

Important

Not Important

$$$11

Page 12: Apache Beam (incubating)

Abuse Detection

Completeness Low Latency Low Cost

Important

Not Important

$$$12

Page 13: Apache Beam (incubating)

13

The Beam Model

2

Page 14: Apache Beam (incubating)

The Beam Model

Pipeline

14

PTransform

PCollection

Page 15: Apache Beam (incubating)

The Beam Vision (for users)

Sum Per Key

15

input.apply( Sum.integersPerKey())

Java

input | Sum.PerKey()

Python

Apache Flink

Apache Spark

Cloud Dataflow

⋮ ⋮

Apache Apex

Apache Gearpump

(incubating)

Page 16: Apache Beam (incubating)

Pipeline p = Pipeline.create(options);

p.apply(TextIO.Read.from("gs://dataflow-samples/shakespeare/*"))

.apply(FlatMapElements.via(line → Arrays.asList(line.split("[^a-zA-Z']+"))))

.apply(Filter.byPredicate(word → !word.isEmpty()))

.apply(Count.perElement())

.apply(MapElements.via(count → count.getKey() + ": " + count.getValue())

.apply(TextIO.Write.to("gs://..."));

p.run();

What your (Java) Code Looks Like

16

Page 17: Apache Beam (incubating)

The Beam Model: Asking the Right Questions What are you computing?

Where in event time?

When in processing time are results produced?

How do refinements relate?

17

Page 18: Apache Beam (incubating)

The Beam Model: Asking the Right Questions What are you computing?

Where in event time?

When in processing time are results produced?

How do refinements relate?

18

Aggregations, transformations, ...

Page 19: Apache Beam (incubating)

The Beam Model: What are you computing?

Sum Per User

19

Page 20: Apache Beam (incubating)

The Beam Model: What are you computing?

Sum Per Key

20

input.apply(Sum.integersPerKey()) .apply(BigQueryIO.Write.to(...));

Java

input | Sum.PerKey() | Write(BigQuerySink(...))

Python

http://beam.apache.org/blog/2016/05/27/where-is-my-pcollection-dot-map.html

Page 21: Apache Beam (incubating)

The Beam Model: Asking the Right Questions What are you computing?

Where in event time?

When in processing time are results produced?

How do refinements relate?

21

Event time windowing

Page 22: Apache Beam (incubating)

22

The Beam Model: Where in Event Time?8:00

8:00

8:00

Page 23: Apache Beam (incubating)

Processing Time vs Event Time

Event Time = Processing Time ??

23

Page 24: Apache Beam (incubating)

Processing Time vs Event Time

24

Proc

essi

ng T

ime

Page 25: Apache Beam (incubating)

Proc

essi

ng T

ime

Processing Time vs Event Time

Realtime

25

This is not possible

Page 26: Apache Beam (incubating)

Processing Time vs Event Time

26

Processing DelayPr

oces

sing

Tim

e

Page 27: Apache Beam (incubating)

Processing Time vs Event TimeVery delayed

27

Proc

essi

ng T

ime

Event Time

Page 28: Apache Beam (incubating)

Processing Time windows(probably are not what you want)

Proc

essi

ng T

ime

Event Time 28

Page 29: Apache Beam (incubating)

Event Time Windows

29

Proc

essi

ng T

ime

Event Time

Page 30: Apache Beam (incubating)

Proc

essi

ng T

ime

Event Time

Event Time Windows

30

(implementing processing time windows)

Just throw away your data's timestamps and replace them with "now()"

Page 31: Apache Beam (incubating)

input |

WindowInto(FixedWindows(3600) | Sum.PerKey() | Write(BigQuerySink(...))

Python

The Beam Model: Where in Event Time?

Sum Per Key

Window Into

31

input.apply(

Window.into( FixedWindows.of( Duration.standardHours(1))) .apply(Sum.integersPerKey()) .apply(BigQueryIO.Write.to(...))

Java

Page 32: Apache Beam (incubating)

So that's what and where...

32

Page 33: Apache Beam (incubating)

The Beam Model: Asking the Right Questions What are you computing?

Where in event time?

When in processing time are results produced?

How do refinements relate?

33

Watermarks &

Triggers

Page 34: Apache Beam (incubating)

Event time windowsPr

oces

sing

Tim

e

34

Event Time

Page 35: Apache Beam (incubating)

Fixed cutoff (we can do better)Pr

oces

sing

Tim

e

Event Time35

Allowed delay

Concurrent windows

Page 36: Apache Beam (incubating)

Perfect watermarkPr

oces

sing

Tim

e

36

Event Time

Check out Slava's slides from Strata London 2016 talk on watermarks:https://goo.gl/K4FnqQ

Page 37: Apache Beam (incubating)

Heuristic WatermarkPr

oces

sing

Tim

e

37

Event Time

Page 38: Apache Beam (incubating)

Heuristic WatermarkPr

oces

sing

Tim

e

38

Current processing time

Event Time

Page 39: Apache Beam (incubating)

Heuristic WatermarkPr

oces

sing

Tim

e

39

Current processing time

Event Time

Page 40: Apache Beam (incubating)

Heuristic WatermarkPr

oces

sing

Tim

e

40

Current processing time

Late data

Event Time

Page 41: Apache Beam (incubating)

Watermarks measure completeness

41

$$$

$$$

$$$

? Running Total

✔ Monthly billing

? Abuse Detection

Page 42: Apache Beam (incubating)

The Beam Model: When in Processing Time?

Sum Per Key

Window Into

42

input .apply(Window.into(FixedWindows.of(...))

.triggering( AfterWatermark.pastEndOfWindow())) .apply(Sum.integersPerKey()) .apply(BigQueryIO.Write.to(...))

Java

input | WindowInto(FixedWindows(3600),

trigger=AfterWatermark()) | Sum.PerKey() | Write(BigQuerySink(...))

Python

Trigger after end of window

Page 43: Apache Beam (incubating)

Proc

essi

ng T

ime

Event Time

AfterWatermark.pastEndOfWindow()

43

Page 44: Apache Beam (incubating)

Current processing time

Proc

essi

ng T

ime

Event Time44

AfterWatermark.pastEndOfWindow()

Page 45: Apache Beam (incubating)

Proc

essi

ng T

ime

Event Time

Late data

45

Current processing time

AfterWatermark.pastEndOfWindow()

Page 46: Apache Beam (incubating)

Proc

essi

ng T

ime

Event Time46

High completeness

Potentially high latency

Low cost

AfterWatermark.pastEndOfWindow()

$$$

Page 47: Apache Beam (incubating)

Proc

essi

ng T

ime

Event Time

Repeatedly.forever( AfterPane.elementCountAtLeast(2))

47

Page 48: Apache Beam (incubating)

Proc

essi

ng T

ime

Event Time48

Current processing time

Repeatedly.forever( AfterPane.elementCountAtLeast(2))

Page 49: Apache Beam (incubating)

Current processing time

Proc

essi

ng T

ime

Event Time49

Repeatedly.forever( AfterPane.elementCountAtLeast(2))

Page 50: Apache Beam (incubating)

Proc

essi

ng T

ime

Event Time50

Current processing time

Repeatedly.forever( AfterPane.elementCountAtLeast(2))

Page 51: Apache Beam (incubating)

Current processing time

Proc

essi

ng T

ime

Event Time51

Repeatedly.forever( AfterPane.elementCountAtLeast(2))

Page 52: Apache Beam (incubating)

Proc

essi

ng T

ime

Event Time52

Repeatedly.forever( AfterPane.elementCountAtLeast(2))

Low completeness

Low latency

Cost driven by input

$$$

Page 53: Apache Beam (incubating)

Build a finely tuned trigger for your use caseAfterWatermark.pastEndOfWindow()

.withEarlyFirings( AfterProcessingTime .pastFirstElementInPane() .plusDuration(Duration.standardMinutes(1))

.withLateFirings(AfterPane.elementCountAtLeast(1)) 53

Bill at end of month

Near real-time estimates

Immediate corrections

Page 54: Apache Beam (incubating)

Proc

essi

ng T

ime

Event Time54

.withEarlyFirings(after 1 minute)

.withLateFirings(ASAP after each element)

Page 55: Apache Beam (incubating)

Proc

essi

ng T

ime

Event Time55

Current processing time

.withEarlyFirings(after 1 minute)

.withLateFirings(ASAP after each element)

Page 56: Apache Beam (incubating)

Proc

essi

ng T

ime

Event Time56

Current processing time

Low completeness

Low latency

Low cost, driven by time

$$$

.withEarlyFirings(after 1 minute)

.withLateFirings(ASAP after each element)

Page 57: Apache Beam (incubating)

Current processing time

Proc

essi

ng T

ime

Event Time57

.withEarlyFirings(after 1 minute)

.withLateFirings(ASAP after each element)

Page 58: Apache Beam (incubating)

Current processing time

Proc

essi

ng T

ime

Event Time

Late output

58

.withEarlyFirings(after 1 minute)

.withLateFirings(ASAP after each element)

Page 59: Apache Beam (incubating)

Proc

essi

ng T

ime

Event Time

Late output

59

.withEarlyFirings(after 1 minute)

.withLateFirings(ASAP after each element)

Page 60: Apache Beam (incubating)

Trigger CatalogueComposite TriggersBasic Triggers

60

AfterEndOfWindow()

AfterCount(n)

AfterProcessingTimeDelay(Δ)

AfterEndOfWindow() .withEarlyFirings(A) .withLateFirings(B)

AfterAny(A, B)AfterAll(A, B)Repeat(A)Sequence(A, B)

Page 61: Apache Beam (incubating)

The Beam Model: Asking the Right Questions What are you computing?

Where in event time?

When in processing time are results produced?

How do refinements relate?

61

Accumulation Mode

Page 62: Apache Beam (incubating)

The Beam Model: How do refinements relate?

62

input

.apply(Window.into(...).triggering(...).discardingFiredPanes()) .apply(Sum.integersPerKey()) .apply(BigQueryIO.Write.to(...))

vs

1

3 7

4

10

5

1

3 7

4

10

15

discarding accumulating

Page 63: Apache Beam (incubating)

The Beam Model: Asking the Right Questions What are you computing?

Where in event time?

When in processing time are results produced?

How do refinements relate?

63

Page 64: Apache Beam (incubating)

64

Beam Project / Technical Vision

3

Page 65: Apache Beam (incubating)

1. End users: who want to write pipelines in a language that’s familiar.

2. SDK writers: who want to make Beam concepts available in new languages.

3. Runner writers: who have a distributed processing environment and want to run Beam pipelines

Beam Fn API: Invoke user-definable functions

Apache Flink

Apache Spark

Beam Runner API: Build and submit a piepline

OtherLanguagesBeam Java Beam

Python

Execution Execution

Cloud Dataflo

w

Execution

The Beam Vision

Apache Apex

Apache Gearpump (incubatin

g)

Page 66: Apache Beam (incubating)

Project Setup (vision meets code)GoogleCloudPlatform/DataflowJavaSDK cloudera/spark-dataflow dataArtisans/flink-dataflow

apache/incubator-beam

Direct (on your laptop)Google Cloud DataflowFlinkSparkIn pull request: Apex, Gearpump

Integration tests

Runners

Examples

I/O Connectors

sharing

HDFSKafkaBigQueryGoogle Cloud Storage, Pubsub, Bigtable, DatastoreIn pull request: JMS, CassandraProposed: Sqoop, Parquet, JDBC, SocketStream, ...

SDKs

Page 67: Apache Beam (incubating)

Committers from Google, Data Artisans, Cloudera, Talend, Paypal● ~40 commits/week● Rigorous code review for every commit

Contributors [with GitHub badges] from: Spotify, Intel, Twitter, Capital One, DataTorrent, …, <your name here>

● Improvements to existing I/O connectors● Improvements to Spark runner● Utility classes for users● Documentation fixes● Bug diagnoses● New I/O connectors● Gearpump runner PoC● Apex runner PoC!

… and it has been awesomeapache/incubator-beam

Page 68: Apache Beam (incubating)

Java SDK: Transition from Dataflow

Dataflow Java 1.x

Apache Beam Java 0.x

Apache Beam Java 2.xBug Fix

Feature

Breaking Change

We are here

Feb 201

6

Late 2016

Page 69: Apache Beam (incubating)

Understanding: Capability Matrix

http://beam.incubator.apache.org/capability-matrix/

Page 70: Apache Beam (incubating)

Why Apache Beam?Unified - One model handles batch and streaming use cases.

Portable - Pipelines can be executed on multiple execution environments, avoiding lock-in.

Extensible - Supports user and community driven SDKs, Runners, transformation libraries, and IO connectors.

Page 71: Apache Beam (incubating)

Why Apache Beam?http://data-artisans.com/why-apache-beam/

"We firmly believe that the Beam model is the correct programming model for streaming and batch data processing."

- Kostas Tzoumas (Data Artisans)

https://cloud.google.com/blog/big-data/2016/05/why-apache-beam-a-google-perspective

"We hope it will lead to a healthy ecosystem of sophisticated runners that compete by making users happy, not [via] API lock in."

- Tyler Akidau (Google)

Page 72: Apache Beam (incubating)

72

Creating an Apache Beam CommunityCollaborate - Beam is becoming a community-driven effort with participation from many organizations and contributors.

Grow - We want to grow the Beam ecosystem and community with active, open involvement so Beam is a part of the larger OSS ecosystem.

We love contributions. Join us!

Page 74: Apache Beam (incubating)

END

74