stephan ewen - stream processing as a foundational paradigm and apache flink's approach to it
TRANSCRIPT
Stream Processingand Apache Flink®'s approach to it@StephanEwen
Apache Flink PMCCTO @ data Artisans
About meDatabase systems, TU Berlin, IBM, MicrosoftCo-bootstrapped Stratosphere project's runtimeApache Flink created from a (partial) Stratosphere forkApache Flink community founded data ArtisansNow Flink PMC and CTO at data Artisans
Streaming technology is enabling the obvious: continuous processing on data that is continuously produced
Hint: you already have streaming data
3
Streaming Subsumes Batch
4
2016-3-112:00 am
2016-3-11:00 am
2016-3-12:00 am
2016-3-1111:00pm
2016-3-1212:00am
2016-3-121:00am
2016-3-1110:00pm
2016-3-122:00am
2016-3-123:00am…
partition
partition
Streaming Subsumes Batch
5
2016-3-112:00 am
2016-3-11:00 am
2016-3-12:00 am
2016-3-1111:00pm
2016-3-1212:00am
2016-3-121:00am
2016-3-1110:00pm
2016-3-122:00am
2016-3-123:00am…
partition
partition
Stream (low latency)
Stream (high latency)
Streaming Subsumes Batch
6
2016-3-112:00 am
2016-3-11:00 am
2016-3-12:00 am
2016-3-1111:00pm
2016-3-1212:00am
2016-3-121:00am
2016-3-1110:00pm
2016-3-122:00am
2016-3-123:00am…
partition
partition
Stream (low latency)
Batch(bounded stream)Stream (high latency)
Stream Processing Decouples
7
Database(State)
App a
App b
App c
App a
App b
App c
Applications build their own stateState managed centralized
Time Travel
8
Process a period ofhistoric data
partition
partition
Process latest datawith low latency(tail of the log)
Reprocess stream(historic data first, catches up with realtime data)
9
Latency
Volume/Throughput
State &Accuracy
10
Latency
Volume/Throughput
State &Accuracy
Exactly-once semanticsEvent time processing
10s of millions evts/secfor stateful applications
Latency down tothe milliseconds
Apache Flink was the first open-source system to eliminate these
tradeoffs
Streaming Architecture Blueprint
11
collect log analyze serve & store
Flink's Approach
12
Stateful Steam Processing
Fluent API, Windows, Event Time
Table API
Stream SQL
Core API
Declarative DSL
High-level Language
Building Block
Stateful Steam Processing
13
Source Filter /Transform
Stateread/write Sink
Stateful Steam Processing
14
Scalable embedded state Access at memory speed &scales with parallel operators
Stateful Steam Processing
15
Re-load state
Reset positionsin input streams
Rolling back computationRe-processing
Stateful Steam Processing
16
Restore to differentprograms
Bugfixes, Upgrades, A/B testing, etc
Versioning the state of applications
17
Savepoint
Savepoint
Savepoint
App. A
App. B
App. C
Time
Savepoint
Flink's Approach
18
Stateful Steam Processing
Fluent API, Windows, Event Time
Table API
Stream SQL
Core API
Declarative DSL
High-level Language
Building Block
Event Time / Out-of-Order
19
1977 1980 1983 1999 2002 2005 2015
Processing Time
EpisodeIV
EpisodeV
EpisodeVI
EpisodeI
EpisodeII
EpisodeIII
EpisodeVII
Event Time
(Stream) SQL & Table API
20
Table API
// convert stream into Tableval sensorTable: Table = sensorData .toTable(tableEnv, 'location, 'time, 'tempF)
// define query on Tableval avgTempCTable: Table = sensorTable .groupBy('location) .window(Tumble over 1.days on 'rowtime as 'w) .select('w.start as 'day, 'location, (('tempF.avg - 32) * 0.556) as 'avgTempC) .where('location like "room%")
SQL
sensorTable.sql(""" SELECT day, location, avg((tempF - 32) * 0.556) AS avgTempC
FROM sensorData WHERE location LIKE 'room%'GROUP BY day, location
""")
What can you do with that?
21
10 billion events (2TB) processed daily across multiple Flink jobs for the telco network control center.
Ad-hoc realtime queries, > 30 operators, processing 30 billion events daily, maintaining state of 100s of GB inside Flink with exactly-once guarantees
Jobs with > 20 operators, runs on > 5000 vCores in 1000-node cluster, processes millions of events per second
Flink's Streams playing at Batch
22
TeraSort
Relational Join
Classic Batch Jobs
GraphProcessing
LinearAlgebra
23
What can we expect next ?
Queryable State
24
Streaming Architecture Blueprint
25
collect log analyze &serve & store
Other Services
Full SQL on Streams
26
Continuous queriesincremental results
Windows, event time,processing time
Consistent with SQL on bounded data https://docs.google.com/document/d/1qVVt_16kdaZQ8RTfA_f4konQPW4tnl8THw6rzGUdaqU
Elastic Parallelism
27
Maintaining exactly-oncestate consistency
No extra effort for the userNo need to carefully planpartitions
Very large state
28
Terabytes of state inside the stream processor
Maintaining fast checkpoints and recoveryE.g., long histories of windows, large join tablesState at local memory speed
29
We are hiring!
data-artisans.com/careers