state management in apache flink : consistent stateful distributed stream processing
TRANSCRIPT
Paris Carbone<[email protected]> - KTH Royal Institute of Technology Stephan Ewen<[email protected]> - data Artisans Gyula Fóra<[email protected]> - King Digital Entertainment Ltd Seif Haridi<[email protected]> - KTH Royal Institute of Technology Stefan Richter<[email protected]> - data Artisans Kostas Tzoumas<[email protected]> - data Artisans
1
State Management in Apache Flink®
Consistent Stateful Distributed Stream Processing
@vldb17
Overview
• The Apache Flink System Architecture
• Pipelined Consistent Snapshots
• Operations with Snapshots
• Large Scale Deployments and Evaluation
2
The Apache Flink Framework
Cluster Backend Metrics
Dataflow Runtime
DataStream DataSet
SQL
Tabl
e
CEP
Gra
phs
MLLibraries
Core API
Runner
Setup
3
Distributed Architecture
Cluster Backend Metrics
Dataflow Runtime
DataStream DataSet
SQL
Tabl
e
CEP
Gra
phs
MLLibraries
Core API
Runner
Setup
Job Manager
Task Manager
Task Manager
….
Client
4
Distributed Architecture
Cluster Backend Metrics
Dataflow Runtime
DataStream DataSet
SQL
Tabl
e
CEP
Gra
phs
MLLibraries
Core API
Runner
Setup
Job Manager
Task Manager
Task Manager
….
Client
4
Distributed Architecture
Cluster Backend Metrics
Dataflow Runtime
DataStream DataSet
SQL
Tabl
e
CEP
Gra
phs
MLLibraries
Core API
Runner
Setup
Job Manager
Task Manager
Task Manager
….
Client
optimised logical graph
4
Distributed Architecture
Cluster Backend Metrics
Dataflow Runtime
DataStream DataSet
SQL
Tabl
e
CEP
Gra
phs
MLLibraries
Core API
Runner
Setup
Job Manager
Task Manager
Task Manager
….
• scheduling • state partitioning • snapshot coordination
Client
optimised logical graph
4
Zookeeper
• passive failover • snapshot metadata
Distributed Architecture
Cluster Backend Metrics
Dataflow Runtime
DataStream DataSet
SQL
Tabl
e
CEP
Gra
phs
MLLibraries
Core API
Runner
Setup
Job Manager
Task Manager
Task Manager
….
• scheduling • state partitioning • snapshot coordination
Client
optimised logical graph
4
Zookeeper
• passive failover • snapshot metadata
Distributed Architecture
Cluster Backend Metrics
Dataflow Runtime
DataStream DataSet
SQL
Tabl
e
CEP
Gra
phs
MLLibraries
Core API
Runner
Setup
Job Manager
Task Manager
Task Manager
….
• scheduling • state partitioning • snapshot coordination
Client
optimised logical graph
• memory management • local snapshot execution • flow control
physical long-runningtasks
4
Zookeeper
• passive failover • snapshot metadata
Distributed Architecture
Cluster Backend Metrics
Dataflow Runtime
DataStream DataSet
SQL
Tabl
e
CEP
Gra
phs
MLLibraries
Core API
Runner
Setup
Job Manager
Task Manager
Task Manager
….
• scheduling • state partitioning • snapshot coordination
Client
optimised logical graph
• memory management • local snapshot execution • flow control
physical long-runningtasks
locally managed state
4
Zookeeper
• passive failover • snapshot metadata
Distributed Architecture
Cluster Backend Metrics
Dataflow Runtime
DataStream DataSet
SQL
Tabl
e
CEP
Gra
phs
MLLibraries
Core API
Runner
Setup
Job Manager
Task Manager
Task Manager
….
• scheduling • state partitioning • snapshot coordination
Client
optimised logical graph
• memory management • local snapshot execution • flow control
physical long-runningtasks
locally managed state
ExternalSnapshot Store(e.g., hdfs)
partial snapshots
4
1. End-to-End Guarantees
Snapshots
2. Reconfiguration
3. Version Control 4. Isolation
Snapshots
5
1. End-to-End Guarantees
Snapshots
2. Reconfiguration
3. Version Control 4. Isolation
Snapshots
6
Stateful Processing
tasktasktask
7
Stateful Processing
tasktasktask
invoke per input record
7
Stateful Processing
tasktasktask
readwrite
managed state
logical operations (collections)
invoke per input record
7
Local State Backend
physicaloperations
In-Memory(Heap) Embedded Off-heap+Disk Key-Value Store
(RocksDB)
Stateful Processing
tasktasktask
readwrite
managed state
logical operations (collections)
invoke per input record
7
Local State Backend
physicaloperations
In-Memory(Heap) Embedded Off-heap+Disk Key-Value Store
(RocksDB)
Stateful Processing
tasktasktask
readwrite
managed state
logical operations (collections)
invoke per input record
state = f(input)
7
8
local statesinput
streams
8
local statesinput
streams
stream processor
8
local statesinput
streams
divide computationinto epochs
stream processor
8
local statesinput
streams
capture all local states after completing an epoch
divide computationinto epochs
stream processor
8
local statesinput
streams
capture all local states after completing an epoch
divide computationinto epochs
stream processor
can rollback input and state to captured point in the past
8
Snapshot Store
copy states
A Synchronous Approach
master
9
drain epoch 1
Snapshot Store
copy states
A Synchronous Approach
master
9
drain epoch 1
Snapshot Store
copy states
A Synchronous Approach
master
9
drain epoch 1
Snapshot Store
copy states
A Synchronous Approach
master
9
drain epoch 2
Snapshot Store
copy states
A Synchronous Approach
master
9
drain epoch 2
Snapshot Store
copy states
A Synchronous Approach
master
9
drain epoch 2
Snapshot Store
copy states
A Synchronous Approach
master
9
• In use: Storm Trident and Spark Streaming
• A conservative approach, equivalent to batching
• Can cause unnecessary latency (master coordination)
• Processing is no longer continuous
• Forces many tasks to be idle
• Instead, in Apache Flink snapshots are pipelined
Synchronous Snapshots
10
Pipelined Snapshots
Snapshot Store
async state copy
11
Pipelined Snapshots
Snapshot Store
async state copy
insert markers
11
Pipelined Snapshots
Snapshot Store
async state copy
insert markers
A
BC D
E
11
Pipelined Snapshots
Snapshot Store
async state copy
A
BC D
E
11
Pipelined Snapshots
Snapshot Store
async state copy
A
BC D
E
B
11
Pipelined Snapshots
Snapshot Store
async state copy
epoch alignment
A
BC D
E
B
11
Pipelined Snapshots
Snapshot Store
async state copy
epoch alignment
A
BC D
E
B A
11
Pipelined Snapshots
Snapshot Store
async state copy
A
BC D
E
B A C
11
Pipelined Snapshots
Snapshot Store
async state copy
A
BC D
E
B A C D E
11
Pipelined Snapshots
Snapshot Store
async state copysnapshotcompletes
A
BC D
E
B A C D E
11
Pipelined Snapshots (cycles)
12
Pipelined Snapshots (cycles)
Problem: we cannot wait indefinitely for records in cycles
12
Pipelined Snapshots (cycles)
Problem: we cannot wait indefinitely for records in cycles
Solution: log in snapshot inflight records within a cycle
Replay upon recovery. 12
• Offers exactly-once processing guarantees
• Issued periodically/externally by the user
• Naturally respects flow control mechanisms
• Channel state logging limited to cycles only
• Multiple epoch snapshots can be pipelined
• Can offer weaker at-least-once processing guarantees by simply dropping aligning vs no alignment cost
Technique Highlights
13
1. End-to-End Guarantees
Snapshots
2. Reconfiguration
3. Version Control 4. Isolation
Snapshots Usages
14
Exactly-Once: Input and Processing
Important Assumptions
• Input streams are persisted with offset indexes (e.g., Kafka, Kinesis)
• Data Channels are FIFO and reliable (no loss)
Each epoch either completes or repeats
15
• Idempontency ~ repeated operations can be tolerated after recovery/rollback (works for mutable stores).
• Transactional Processing ~ Requires a two-phase coordination. A snapshot completion eventually leads to external commit (e.g., Flink’s HDFS RollingSink*)
in-progress committedpendingpending
epoch n-1 epoch n-2 epoch n-3epoch n
Exactly-Once Output
16
Snapshots
2. Reconfiguration
3. Version Control 4. Isolation
Snapshots Usages
1. End-to-End Guarantees
17
Dataflow Reconfiguration
18
Dataflow Reconfiguration
18
Dataflow Reconfiguration
stop
snap-1 snap-2
18
Dataflow Reconfiguration
stop
snap-1 snap-2
snap-3
…
change parallelism
18
Dataflow Reconfiguration
stop
snap-1 snap-2
snap-3
…
change parallelism
Problem: How is state repartitioned from a snapshot?
18
Reconfiguration: The Issue
19
Reconfiguration: The Issue
0x100: bob … … … … 0x449: alice
reconfigure
case I full scan
Scan Remote Storage for Responsible Keys
19
Reconfiguration: The Issue
0x100: bob … … … … 0x449: alice
reconfigure
case I full scan
Scan Remote Storage for Responsible Keys
too slow
19
Reconfiguration: The Issue
case II
0x100: bob … … … … 0x449: alice
reconfigure
Include Key Locations in Snapshot Metadata
bob: 0x100 carol: 0x344 …
alice: 0x449 chuck: 0x630 …
0x100: bob … … … … 0x449: alice
reconfigure
case I full scan
Scan Remote Storage for Responsible Keys
too slow
19
Reconfiguration: The Issue
case II
0x100: bob … … … … 0x449: alice
reconfigure
Include Key Locations in Snapshot Metadata
bob: 0x100 carol: 0x344 …
alice: 0x449 chuck: 0x630 …
0x100: bob … … … … 0x449: alice
reconfigure
case I full scan
Scan Remote Storage for Responsible Keys
too slow
too much
19
Reconfiguration: Key GroupsPre-partition state in
hash(K) space, into key-groups
bob……
… ………
alice
20
Reconfiguration: Key GroupsPre-partition state in
hash(K) space, into key-groups
bob……
… ………
• Snapshot Metadata: Contains a reference per stored Key-Group (less metadata)
• Reconfiguration: Contiguous key-group allocation to available tasks (less IO)
alice
20
Reconfiguration: Key GroupsPre-partition state in
hash(K) space, into key-groups
bob……
… ………
• Snapshot Metadata: Contains a reference per stored Key-Group (less metadata)
• Reconfiguration: Contiguous key-group allocation to available tasks (less IO)
alice
Note: number of key groups controls trade-off between metadata to keep and reconfiguration speed
20
Snapshots
2. Reconfiguration
3. Version Control 4. Isolation
Snapshots Usages
1. End-to-End Guarantees
21
Version Control
22
Version Control
Pipeline v.1
22
Version Control
fork and update Pipeline v.1
Pipeline v.2
22
Version Control
fork and update Pipeline v.1
Pipeline v.2
22
Version Control
fork and update Pipeline v.1
Pipeline v.3
Pipeline v.2
22
Version Control
fork and update Pipeline v.1
Pipeline v.3
Pipeline v.2
22
Snapshots
2. Reconfiguration
3. Version Control 4. Isolation
Snapshots Usages
1. End-to-End Guarantees
23
Isolation Levels
24
Isolation Levels
select from facebook.userID, clients.name … inner join clients on …
read-committed(snapshot)
read-uncommitted(dirty read on latest state)
external query
24
Large Scale Deployment at King
25
Large Scale Deployment at King10
0
200
300
400
500
Global State Size (GB)
0
50
100
150
200
250
Tota
lSna
psho
ttin
gTi
me
(sec
)
total time / snapshot(alignment + async copies)
25
Large Scale Deployment at King10
0
200
300
400
500
Global State Size (GB)
0
50
100
150
200
250
Tota
lSna
psho
ttin
gTi
me
(sec
)
total time / snapshot(alignment + async copies)
~runtime overhead
25
Large Scale Deployment at King
30 50 70Parallelism
0
200
400
600
800
1000
1200
1400
Tota
lAlig
nmen
tTim
e(m
sec)
PROCWINOUT
alignmentcost
100
200
300
400
500
Global State Size (GB)
0
50
100
150
200
250
Tota
lSna
psho
ttin
gTi
me
(sec
)
total time / snapshot(alignment + async copies)
~runtime overhead
25
Large Scale Deployment at King
30 50 70Parallelism
0
200
400
600
800
1000
1200
1400
Tota
lAlig
nmen
tTim
e(m
sec)
PROCWINOUT
alignmentcost
100
200
300
400
500
Global State Size (GB)
0
50
100
150
200
250
Tota
lSna
psho
ttin
gTi
me
(sec
)
total time / snapshot(alignment + async copies)
~runtime overhead
25
Large Scale Deployment at King
30 50 70Parallelism
0
200
400
600
800
1000
1200
1400
Tota
lAlig
nmen
tTim
e(m
sec)
PROCWINOUT
alignmentcost
100
200
300
400
500
Global State Size (GB)
0
50
100
150
200
250
Tota
lSna
psho
ttin
gTi
me
(sec
)
total time / snapshot(alignment + async copies)
~runtime overhead
• #shuffles (keyby) • parallelism
25
Teaser: More paper highlights
• We can use the same technique to coordinate externally managed state with snapshots.
• Epoch markers can act as on-the-fly reconfiguration points.
• Internals of asynchronous and incremental snapshots.
26
Paris Carbone<[email protected]> - KTH Royal Institute of Technology Stephan Ewen<[email protected]> - data Artisans Gyula Fóra<[email protected]> - King Digital Entertainment Ltd Seif Haridi<[email protected]> - KTH Royal Institute of Technology Stefan Richter<[email protected]> - data Artisans Kostas Tzoumas<[email protected]> - data Artisans
27
State Management in Apache Flink®
Consistent Stateful Distributed Stream Processing
@vldb17