stephan ewen - stream processing as a foundational paradigm and apache flink's approach to it

Stream Processingand Apache Flink®'s approach to it@StephanEwen

Apache Flink PMCCTO @ data Artisans

About meDatabase systems, TU Berlin, IBM, MicrosoftCo-bootstrapped Stratosphere project's runtimeApache Flink created from a (partial) Stratosphere forkApache Flink community founded data ArtisansNow Flink PMC and CTO at data Artisans

Streaming technology is enabling the obvious: continuous processing on data that is continuously produced

Hint: you already have streaming data

3

Streaming Subsumes Batch

4

2016-3-112:00 am

2016-3-11:00 am

2016-3-12:00 am

2016-3-1111:00pm

2016-3-1212:00am

2016-3-121:00am

2016-3-1110:00pm

2016-3-122:00am

2016-3-123:00am…

partition

partition


5

2016-3-112:00 am

2016-3-11:00 am

2016-3-12:00 am

2016-3-1111:00pm

2016-3-1212:00am

2016-3-121:00am

2016-3-1110:00pm

2016-3-122:00am

2016-3-123:00am…

partition

partition

Stream (low latency)

Stream (high latency)


6

2016-3-112:00 am

2016-3-11:00 am

2016-3-12:00 am

2016-3-1111:00pm

2016-3-1212:00am

2016-3-121:00am

2016-3-1110:00pm

2016-3-122:00am

2016-3-123:00am…

partition

partition

Stream (low latency)

Batch(bounded stream)Stream (high latency)

Stream Processing Decouples

7

Database(State)

App a

App b

App c

App a

App b

App c

Applications build their own stateState managed centralized

Time Travel

8

Process a period ofhistoric data

partition

partition

Process latest datawith low latency(tail of the log)

Reprocess stream(historic data first, catches up with realtime data)

9

Latency

Volume/Throughput

State &Accuracy

10

Latency

Volume/Throughput

State &Accuracy

Exactly-once semanticsEvent time processing

10s of millions evts/secfor stateful applications

Latency down tothe milliseconds

Apache Flink was the first open-source system to eliminate these

tradeoffs

Streaming Architecture Blueprint

11

collect log analyze serve & store

Flink's Approach

12

Stateful Steam Processing

Fluent API, Windows, Event Time

Table API

Stream SQL

Core API

Declarative DSL

High-level Language

Building Block


13

Source Filter /Transform

Stateread/write Sink


14

Scalable embedded state Access at memory speed &scales with parallel operators


15

Re-load state

Reset positionsin input streams

Rolling back computationRe-processing


16

Restore to differentprograms

Bugfixes, Upgrades, A/B testing, etc

Versioning the state of applications

17

Savepoint

Savepoint

Savepoint

App. A

App. B

App. C

Time

Savepoint

Flink's Approach

18


Fluent API, Windows, Event Time

Table API

Stream SQL

Core API

Declarative DSL

High-level Language

Building Block

Event Time / Out-of-Order

19

1977 1980 1983 1999 2002 2005 2015

Processing Time

EpisodeIV

EpisodeV

EpisodeVI

EpisodeI

EpisodeII

EpisodeIII

EpisodeVII

Event Time

(Stream) SQL & Table API

20

Table API

// convert stream into Tableval sensorTable: Table = sensorData .toTable(tableEnv, 'location, 'time, 'tempF)

// define query on Tableval avgTempCTable: Table = sensorTable .groupBy('location) .window(Tumble over 1.days on 'rowtime as 'w) .select('w.start as 'day, 'location, (('tempF.avg - 32) * 0.556) as 'avgTempC) .where('location like "room%")

SQL

sensorTable.sql(""" SELECT day, location, avg((tempF - 32) * 0.556) AS avgTempC

FROM sensorData WHERE location LIKE 'room%'GROUP BY day, location

""")

What can you do with that?

21

10 billion events (2TB) processed daily across multiple Flink jobs for the telco network control center.

Ad-hoc realtime queries, > 30 operators, processing 30 billion events daily, maintaining state of 100s of GB inside Flink with exactly-once guarantees

Jobs with > 20 operators, runs on > 5000 vCores in 1000-node cluster, processes millions of events per second

Flink's Streams playing at Batch

22

TeraSort

Relational Join

Classic Batch Jobs

GraphProcessing

LinearAlgebra

23

What can we expect next ?

Queryable State

24

Streaming Architecture Blueprint

25

collect log analyze &serve & store

Other Services

Full SQL on Streams

26

Continuous queriesincremental results

Windows, event time,processing time

Consistent with SQL on bounded data https://docs.google.com/document/d/1qVVt_16kdaZQ8RTfA_f4konQPW4tnl8THw6rzGUdaqU

Elastic Parallelism

27

Maintaining exactly-oncestate consistency

No extra effort for the userNo need to carefully planpartitions

Very large state

28

Terabytes of state inside the stream processor

Maintaining fast checkpoints and recoveryE.g., long histories of windows, large join tablesState at local memory speed

We are hiring!

data-artisans.com/careers

stephan ewen - stream processing as a foundational paradigm and apache flink's approach to it

Data & Analytics