introduction to stateful stream processing with apache flink

51
1 Kostas Kloudas @kkloudas HUG @ Warsaw JULY 3, 2017 Stateful Stream Processing with Apache Flink®

Upload: konstantinos-kloudas

Post on 22-Jan-2018

312 views

Category:

Software


5 download

TRANSCRIPT

Page 1: Introduction to Stateful Stream Processing with Apache Flink

1

Kostas Kloudas@kkloudas

HUG @ WarsawJULY 3, 2017

Stateful Stream Processingwith Apache Flink®

Page 2: Introduction to Stateful Stream Processing with Apache Flink

2

Original creators of ApacheFlink®

Providers of thedA Platform

Page 3: Introduction to Stateful Stream Processing with Apache Flink

3 questions and some history▪ What is stateful stream processing?

▪ Why care about it?

▪ How does Flink do it?

▪ The evolution of Flink.

3

Page 4: Introduction to Stateful Stream Processing with Apache Flink

Stateful Stream Processing

4

Page 5: Introduction to Stateful Stream Processing with Apache Flink

Stateful Stream Processing

5

Continuous Processingfor Continuously Arriving Data

Page 6: Introduction to Stateful Stream Processing with Apache Flink

6

Batchjobs

t

2017-06-1401:00am

2017-06-1400:00am

2017-06-1311:00pm

2017-06-1310:00pm

... ● Continuously ingestingdata

● Time-bounded batchfiles

● Periodic batch jobs

The ol’ traditional batch way

Page 7: Introduction to Stateful Stream Processing with Apache Flink

7

intermediatestate

t

2017-06-1401:00am

2017-06-1400:00am

2017-06-1311:00pm

2017-06-1310:00pm

...

● Compute a counter:#(A) per hour / 2 min

● What if:

● interval crosses batch boundaries?→ carry intermediate results to next batch

● events out of order?→ ???

The ol’ traditional batch way

Page 8: Introduction to Stateful Stream Processing with Apache Flink

▪ So, for a simple counting program:• Custom logic for handling state• Custom logic for handling time• Custom logic for fault tolerance

8

The ol’ traditional batch way

Page 9: Introduction to Stateful Stream Processing with Apache Flink

▪ So, for a simple counting program:• Custom logic for handling state• Custom logic for handling time• Custom logic for fault tolerance

9

The ol’ traditional batch way

Difficult and has nothing to do with your program.

Page 10: Introduction to Stateful Stream Processing with Apache Flink

Why should we care?▪...this is just for continuous data, right?

10

Page 11: Introduction to Stateful Stream Processing with Apache Flink

Why should we care?▪...this is just for continuous data, right?

11

Most datasets are continuously arriving streams.

Page 12: Introduction to Stateful Stream Processing with Apache Flink

Stream Processing

12

Computation over an endless stream of data

YourCode

...

Page 13: Introduction to Stateful Stream Processing with Apache Flink

Distributed Stream Processing

13

YourCode

... ... ...

YourCode

YourCode

● Partitions input by some key

● Distributes computationacross multiple instances

● Each instance is responsiblefor some keys

Page 14: Introduction to Stateful Stream Processing with Apache Flink

qwe

Stateful Stream Processing

14

... ...

YourCode

YourCode

update localvariables/structures

var x = …

if (condition(x)) {…

}

Page 15: Introduction to Stateful Stream Processing with Apache Flink

Stateful Stream Processing

15

... ...

YourCode

YourCode

qweupdate local

variables/structures

var x = …

if (condition(x)) {…

}● Embedded local state

● State co-partitioned withthe input stream by key

Page 16: Introduction to Stateful Stream Processing with Apache Flink

A practical stream processor

16

state●Fault-tolerance

●Scalability

●Efficiency

●Event-time (out-of-order events)

●Allows you to work in event-time (e.g. timers)

time

Page 17: Introduction to Stateful Stream Processing with Apache Flink

17

Stateful Stream Processorthat handles

consistently, robustly, and efficiently

LargeDistributed State

Time / Order /Completeness

● Stateful stream processing asa new paradigm tocontinuously processcontinuously arriving data

● Produce accurate results

● Real-time is only a naturalconsequence of the model

A practical stream processor

Page 18: Introduction to Stateful Stream Processing with Apache Flink

This is where Flink shines...▪ Supports out-of-order streams▪ Manages state transparently

• exactly-once processing

▪ Offers high throughput and low latency▪ Scales to large deployments

• https://data-artisans.com/blog/blink-flink-alibaba-search

• https://data-artisans.com/blog/rbea-scalable-real-time-analytics-at-king

18

Page 19: Introduction to Stateful Stream Processing with Apache Flink

Apache Flink®

19

Page 20: Introduction to Stateful Stream Processing with Apache Flink

About time ...

20

... ...

YourCode

YourCode

When are my results complete?

Page 21: Introduction to Stateful Stream Processing with Apache Flink

21

... ...

YourCode

YourCode

When are my results complete?

Processing Time drawbacks:• Incorrect results• Irreproducible results

About time ...

Page 22: Introduction to Stateful Stream Processing with Apache Flink

About time ...

22

Page 23: Introduction to Stateful Stream Processing with Apache Flink

Event Time: Watermarks

23

● Special markers, called Watermarks

● Flow with elements

● A watermark oftimestamp t meansthat no records withtimestamp < t shouldbe expected

Page 24: Introduction to Stateful Stream Processing with Apache Flink

Event Time: Watermarks

24

Page 25: Introduction to Stateful Stream Processing with Apache Flink

25

Documentation:https://ci.apache.org/projects/flink/flink-docs-release-

1.3/dev/event_time.html

Event Time

Page 26: Introduction to Stateful Stream Processing with Apache Flink

Fault tolerance▪ How to ensure exactly-once semantics?

26

Page 27: Introduction to Stateful Stream Processing with Apache Flink

Fault tolerance simple case

27

event log

single process

main memoryperiodically take a Snapshot of the memory

Page 28: Introduction to Stateful Stream Processing with Apache Flink

28

event log

single process

main memoryRecoveryrestore snapshot and replay

events since snapshot

persists events(temporarily)

Fault tolerance simple case

Page 29: Introduction to Stateful Stream Processing with Apache Flink

Fault tolerance distributed

▪ How to create consistent snapshots ofdistributed state?

▪ How to do it efficiently?

29

Page 30: Introduction to Stateful Stream Processing with Apache Flink

Distributed Snapshots

30

Coordination via markers, injected into the streams

Page 31: Introduction to Stateful Stream Processing with Apache Flink

31

State index(Hash Table

or RocksDB)

Events flow without replication or synchronous writes

statefuloperation

source

Distributed Snapshots

Page 32: Introduction to Stateful Stream Processing with Apache Flink

32

Trigger checkpointInject checkpoint barrier

statefuloperation

source

Distributed Snapshots

Page 33: Introduction to Stateful Stream Processing with Apache Flink

33stateful

operationsource

Take state snapshotTrigger state

copy-on-write

Distributed Snapshots

Page 34: Introduction to Stateful Stream Processing with Apache Flink

34stateful

operationsource

DFS

Durably persistsnapshots

asynchronouslyProcessing pipeline continues

Distributed Snapshots

Page 35: Introduction to Stateful Stream Processing with Apache Flink

35

... YourCode

YourCode

YourCode

State

State

State

YourCode

State

● Consistent snapshotting:

Fault tolerance

Page 36: Introduction to Stateful Stream Processing with Apache Flink

36

... YourCode

YourCode

YourCode

State

State

State

YourCode

State

checkpointedstate

checkpointedstate

checkpointedstate

File System Checkpoint

● Consistent snapshotting:

Fault tolerance

Page 37: Introduction to Stateful Stream Processing with Apache Flink

37

... YourCode

YourCode

YourCode

State

State

State

YourCode

State

checkpointedstate

checkpointedstate

checkpointedstate

File System Restore

● Recover all embedded state● Reset position in input stream

Fault tolerance

Page 38: Introduction to Stateful Stream Processing with Apache Flink

38

Documentation:https://ci.apache.org/projects/flink/flink-docs-release-

1.3/internals/stream_checkpointing.html

Fault tolerance

Page 39: Introduction to Stateful Stream Processing with Apache Flink

State Management: misc.

39

▪ Savepoints

▪ Rescaling

▪ Queryable State

Page 40: Introduction to Stateful Stream Processing with Apache Flink

Apache Flink Ecosystem

40

Integration

POSIX Java/ScalaCollections

POSIX

Page 41: Introduction to Stateful Stream Processing with Apache Flink

Apache Flink Stack

41

DataStream APIStream Processing

DataSet APIBatch Processing

RuntimeDistributed Streaming Data Flow

Libraries

Streaming and batch as first class citizens.

Page 42: Introduction to Stateful Stream Processing with Apache Flink

Levels of abstraction

42

Process Function (events, state, time)

DataStream API (streams, windows)

Table API (dynamic tables)

Stream SQL

low-level (statefulstream processing)

stream processing &analytics

declarative DSL

high-level langauge

Page 43: Introduction to Stateful Stream Processing with Apache Flink

API and Execution

43

SourceDataStream<String> lines = env.addSource(new FlinkKafkaConsumer010(…));

DataStream<Event> events = lines.map(line -> parse(line));

DataStream<Statistic> stats = stream.keyBy("id").timeWindow(Time.seconds(5)).sum(new MyAggregationFunction());

stats.addSink(new BucketingSink(path));

map()[1]

keyBy()/window()/

apply()[1]

Transformation

Transformation

Sink

StreamingDataflowkeyBy()/

window()/apply()

[2]

map()[1]

map()[2]

Source[1]

Source[2]

Sink[1]

Page 44: Introduction to Stateful Stream Processing with Apache Flink

Evolution of Flink

44

Page 45: Introduction to Stateful Stream Processing with Apache Flink

Programming APIs

45

Page 46: Introduction to Stateful Stream Processing with Apache Flink

Large State Handling

46

Page 47: Introduction to Stateful Stream Processing with Apache Flink

Conclusion

47

Page 48: Introduction to Stateful Stream Processing with Apache Flink

TL;DR▪ Stateful stream processing as a paradigm for

continuous data processing

▪ Flink is a sophisticated and tested stateful streamprocessor

▪ Efficiency, management, and operational issues forstate are taken very seriously

48

Page 49: Introduction to Stateful Stream Processing with Apache Flink

49

Thank you!@kkloudas@ApacheFlink@dataArtisans

Page 50: Introduction to Stateful Stream Processing with Apache Flink

50

Stream Processingand Apache Flink®'s approach to it

@StephanEwen Apache Flink PMC

CTO @ data ArtisansFLINK FORWARD IS COMING BACK TO BERLINSEPTEMBER 11-13, 2017

BERLIN.FLINK-FORWARD.ORG -

Page 51: Introduction to Stateful Stream Processing with Apache Flink

We are hiring!data-artisans.com/careers