an introduction to distributed data...

68
An Introduction to Distributed Data Streaming Elements and Systems Paris Carbone<[email protected]> PhD Candidate KTH Royal Institute of Technology 1

Upload: others

Post on 20-May-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: An Introduction to Distributed Data Streamingictlabs-summer-school.sics.se/2015/slides/flink_stream1.pdfAn Introduction to Distributed Data Streaming Elements and Systems Paris Carbone

An Introduction to Distributed Data StreamingElements and Systems

Paris Carbone<[email protected]> PhD Candidate

KTH Royal Institute of Technology

1

Page 2: An Introduction to Distributed Data Streamingictlabs-summer-school.sics.se/2015/slides/flink_stream1.pdfAn Introduction to Distributed Data Streaming Elements and Systems Paris Carbone

2

Page 3: An Introduction to Distributed Data Streamingictlabs-summer-school.sics.se/2015/slides/flink_stream1.pdfAn Introduction to Distributed Data Streaming Elements and Systems Paris Carbone

2

Page 4: An Introduction to Distributed Data Streamingictlabs-summer-school.sics.se/2015/slides/flink_stream1.pdfAn Introduction to Distributed Data Streaming Elements and Systems Paris Carbone

2

Page 5: An Introduction to Distributed Data Streamingictlabs-summer-school.sics.se/2015/slides/flink_stream1.pdfAn Introduction to Distributed Data Streaming Elements and Systems Paris Carbone

2

how to avoid this?

Page 6: An Introduction to Distributed Data Streamingictlabs-summer-school.sics.se/2015/slides/flink_stream1.pdfAn Introduction to Distributed Data Streaming Elements and Systems Paris Carbone

2

how to avoid this?

Page 7: An Introduction to Distributed Data Streamingictlabs-summer-school.sics.se/2015/slides/flink_stream1.pdfAn Introduction to Distributed Data Streaming Elements and Systems Paris Carbone

2

how to avoid this?

Q

Page 8: An Introduction to Distributed Data Streamingictlabs-summer-school.sics.se/2015/slides/flink_stream1.pdfAn Introduction to Distributed Data Streaming Elements and Systems Paris Carbone

2

how to avoid this?

Q = +

Q

Page 9: An Introduction to Distributed Data Streamingictlabs-summer-school.sics.se/2015/slides/flink_stream1.pdfAn Introduction to Distributed Data Streaming Elements and Systems Paris Carbone

Motivation

3

Q = +

Page 10: An Introduction to Distributed Data Streamingictlabs-summer-school.sics.se/2015/slides/flink_stream1.pdfAn Introduction to Distributed Data Streaming Elements and Systems Paris Carbone

Motivation

3

Q = +

Page 11: An Introduction to Distributed Data Streamingictlabs-summer-school.sics.se/2015/slides/flink_stream1.pdfAn Introduction to Distributed Data Streaming Elements and Systems Paris Carbone

Motivation

3

Q Q

Q = +

Page 12: An Introduction to Distributed Data Streamingictlabs-summer-school.sics.se/2015/slides/flink_stream1.pdfAn Introduction to Distributed Data Streaming Elements and Systems Paris Carbone

Motivation

3

Q Q

Q = +

Page 13: An Introduction to Distributed Data Streamingictlabs-summer-school.sics.se/2015/slides/flink_stream1.pdfAn Introduction to Distributed Data Streaming Elements and Systems Paris Carbone

Motivation

3

Q Q

Q = +

Page 14: An Introduction to Distributed Data Streamingictlabs-summer-school.sics.se/2015/slides/flink_stream1.pdfAn Introduction to Distributed Data Streaming Elements and Systems Paris Carbone

Motivation

4

Q

Standing Query

Page 15: An Introduction to Distributed Data Streamingictlabs-summer-school.sics.se/2015/slides/flink_stream1.pdfAn Introduction to Distributed Data Streaming Elements and Systems Paris Carbone

Motivation

4

Q

Standing Query

Page 16: An Introduction to Distributed Data Streamingictlabs-summer-school.sics.se/2015/slides/flink_stream1.pdfAn Introduction to Distributed Data Streaming Elements and Systems Paris Carbone

Motivation

4

Q

Standing Query

Page 17: An Introduction to Distributed Data Streamingictlabs-summer-school.sics.se/2015/slides/flink_stream1.pdfAn Introduction to Distributed Data Streaming Elements and Systems Paris Carbone

Motivation

4

Q

Standing Query

Page 18: An Introduction to Distributed Data Streamingictlabs-summer-school.sics.se/2015/slides/flink_stream1.pdfAn Introduction to Distributed Data Streaming Elements and Systems Paris Carbone

Preliminaries

• Data Streaming Paradigm

• Incoming data is unbound - continuous arrival

• Standing queries are evaluated continuously

• Queries operate on the full data stream or on the most recent views of the stream ~ windows

5

Page 19: An Introduction to Distributed Data Streamingictlabs-summer-school.sics.se/2015/slides/flink_stream1.pdfAn Introduction to Distributed Data Streaming Elements and Systems Paris Carbone

Data Streams Basics• Events/Tuples : elements of computation - respect a schema

• Data Streams : unbounded sequences of events

• Stream Operators: consume streams and generate new ones.

• Events are consumed once - no backtracking!

6

f

S1

S2

So

S’1

S’2

Page 20: An Introduction to Distributed Data Streamingictlabs-summer-school.sics.se/2015/slides/flink_stream1.pdfAn Introduction to Distributed Data Streaming Elements and Systems Paris Carbone

Streaming Pipelines

7

stream1

stream2

approximations predictions alerts ……

Q

sources

sinks

Page 21: An Introduction to Distributed Data Streamingictlabs-summer-school.sics.se/2015/slides/flink_stream1.pdfAn Introduction to Distributed Data Streaming Elements and Systems Paris Carbone

Core Abstractions

• Windows

• Synopses (summary state)

• Partitioning

8

Page 22: An Introduction to Distributed Data Streamingictlabs-summer-school.sics.se/2015/slides/flink_stream1.pdfAn Introduction to Distributed Data Streaming Elements and Systems Paris Carbone

Windows

Discussion

Why do we need windows?

9

Page 23: An Introduction to Distributed Data Streamingictlabs-summer-school.sics.se/2015/slides/flink_stream1.pdfAn Introduction to Distributed Data Streaming Elements and Systems Paris Carbone

Windows• We are often interested only in fresh data

• f = “average temperature over the last minute every 20 sec”

• Range: Most data stream processing systems allow window operations on the most recent history (eg. 1 minute, 1000 tuples)

• Slide: The frequency/granularity f is evaluated on a given range

10

#seconds40 80

Average #3

Average #2

0

Average #1

20 60 100

f

W: 1min, 20sec

Page 24: An Introduction to Distributed Data Streamingictlabs-summer-school.sics.se/2015/slides/flink_stream1.pdfAn Introduction to Distributed Data Streaming Elements and Systems Paris Carbone

Window Types

11

#sec40 80

Average #2

0

Average #1

20 60 100

#sec40 80

Average #3

Average #2

0

Average #1

20 60 100

#sec40 80

Average #2

0

Average #1

20 60 100

120

120

Sliding

Tumbling

Jumping

range > slide

range = slide

range < slide

Page 25: An Introduction to Distributed Data Streamingictlabs-summer-school.sics.se/2015/slides/flink_stream1.pdfAn Introduction to Distributed Data Streaming Elements and Systems Paris Carbone

SynopsesWe cannot infinitely store all events seen

• Synopsis: A summary of an infinite stream

• It is in principle any streaming operator state

• Examples: samples, histograms, sketches, state machines…

12

f

sa summary of everything

seen so far1. process t, s 2. update s 3. produce t’

t t’

What about window synopses?

Page 26: An Introduction to Distributed Data Streamingictlabs-summer-school.sics.se/2015/slides/flink_stream1.pdfAn Introduction to Distributed Data Streaming Elements and Systems Paris Carbone

Synopses-Aggregations

• Discussion - Rolling Aggregations

• Propose a synopsis, s=? when

• f= max

• f= ArithmeticMean

• f= stDev

13

Page 27: An Introduction to Distributed Data Streamingictlabs-summer-school.sics.se/2015/slides/flink_stream1.pdfAn Introduction to Distributed Data Streaming Elements and Systems Paris Carbone

Synopses-Approximations

14

• Discussion - Approximate Results

• Propose a synopsis, s=? when

• f= uniform random sample of k records over the whole stream

• f= filter distinct records over windows of 1000 records with a 5% error

Page 28: An Introduction to Distributed Data Streamingictlabs-summer-school.sics.se/2015/slides/flink_stream1.pdfAn Introduction to Distributed Data Streaming Elements and Systems Paris Carbone

Synopses-ML and Graphs

15

• Examples of cool synopses to check out

• Sparsifiers/Spanners - approximating graph properties such as shortest paths

• Change detectors - detecting concept drift

• Incremental decision trees - continuous stream training and classification

Page 29: An Introduction to Distributed Data Streamingictlabs-summer-school.sics.se/2015/slides/flink_stream1.pdfAn Introduction to Distributed Data Streaming Elements and Systems Paris Carbone

Partitioning• One stream operator is not enough

• Data might be too large to process

• e.g. very high input rate, too many stream sources

• State could possibly not fit in memory

16

fs

fs

fs

parallel instances

How do we partition the input streams?

fs

Page 30: An Introduction to Distributed Data Streamingictlabs-summer-school.sics.se/2015/slides/flink_stream1.pdfAn Introduction to Distributed Data Streaming Elements and Systems Paris Carbone

Partitioning• Partitioning defines how we allocate events to each

parallel instance. Typical partitioners are:

• Broadcast

• Shuffle

• Key-based

17

fs

fs

fs

fs

fs

fs

P

P

P

bycolor

Page 31: An Introduction to Distributed Data Streamingictlabs-summer-school.sics.se/2015/slides/flink_stream1.pdfAn Introduction to Distributed Data Streaming Elements and Systems Paris Carbone

Putting Everything Together

18

Fire Detection Pipeline

{area,temp}

{area,smoke} {loc,alert!}

• operators • synopses • windows • partitioning

trigger on detection

trigger periodically

?

Page 32: An Introduction to Distributed Data Streamingictlabs-summer-school.sics.se/2015/slides/flink_stream1.pdfAn Introduction to Distributed Data Streaming Elements and Systems Paris Carbone

Operators

19

As

Fs

Rolling Arithmetic Mean of Temperatures

State Machine-based Fire Alarm

{area,temp} {area,avgTemp}

{alarm}

Src

Sensor Data Sources{area,temp}

Src

{area}Periodic Temperature Updates

Smoke Detections

trivial…

What is the state and its transitions?

Page 33: An Introduction to Distributed Data Streamingictlabs-summer-school.sics.se/2015/slides/flink_stream1.pdfAn Introduction to Distributed Data Streaming Elements and Systems Paris Carbone

Partitioning• We are only interested in correlating smoke and

high temperature within the same area

• Events carry area information so we can partition our computation by area

20

Src Pkey:area

Page 34: An Introduction to Distributed Data Streamingictlabs-summer-school.sics.se/2015/slides/flink_stream1.pdfAn Introduction to Distributed Data Streaming Elements and Systems Paris Carbone

Windowing• Individual sensor data could be potentially faulty

• We need to gather data from all temperature sensors of an area and produce an average

• We want fresh average temperatures

21

Src Pkey:area{area,temp} A

s

As

w

w

w = ?

Page 35: An Introduction to Distributed Data Streamingictlabs-summer-school.sics.se/2015/slides/flink_stream1.pdfAn Introduction to Distributed Data Streaming Elements and Systems Paris Carbone

The Fire Alarm

22

Page 36: An Introduction to Distributed Data Streamingictlabs-summer-school.sics.se/2015/slides/flink_stream1.pdfAn Introduction to Distributed Data Streaming Elements and Systems Paris Carbone

The Fire Alarm

22

Fs

Page 37: An Introduction to Distributed Data Streamingictlabs-summer-school.sics.se/2015/slides/flink_stream1.pdfAn Introduction to Distributed Data Streaming Elements and Systems Paris Carbone

The Fire Alarm

22

Fs

T : avgTemp>40T : avgTemp<40S : Smoke

Page 38: An Introduction to Distributed Data Streamingictlabs-summer-school.sics.se/2015/slides/flink_stream1.pdfAn Introduction to Distributed Data Streaming Elements and Systems Paris Carbone

The Fire Alarm

22

Fs

T : avgTemp>40T : avgTemp<40 …TTTSTTSTTTT….S : Smoke

Page 39: An Introduction to Distributed Data Streamingictlabs-summer-school.sics.se/2015/slides/flink_stream1.pdfAn Introduction to Distributed Data Streaming Elements and Systems Paris Carbone

The Fire Alarm

22

Fs

T : avgTemp>40T : avgTemp<40 …TTTSTTSTTTT….

OK

HOT

SMOKE

FIRE

T

T

TS

S

T

T

S : Smoke

Page 40: An Introduction to Distributed Data Streamingictlabs-summer-school.sics.se/2015/slides/flink_stream1.pdfAn Introduction to Distributed Data Streaming Elements and Systems Paris Carbone

The Fire Alarm

22

Fs

T : avgTemp>40T : avgTemp<40 …TTTSTTSTTTT….

OK

HOT

SMOKE

FIRE

T

T

TS

S

T

T

synopsis= 1 state

S : Smoke

Page 41: An Introduction to Distributed Data Streamingictlabs-summer-school.sics.se/2015/slides/flink_stream1.pdfAn Introduction to Distributed Data Streaming Elements and Systems Paris Carbone

Putting Everything Together

23

{area,temp}

{area,smoke}

Src

Src

P

P

As

As

key:area

key:area

w

w

Fs

Fs

Pkey:area

{area, alert}

{area,avg_temp}

{area,smoke}

Page 42: An Introduction to Distributed Data Streamingictlabs-summer-school.sics.se/2015/slides/flink_stream1.pdfAn Introduction to Distributed Data Streaming Elements and Systems Paris Carbone

Systems: The Big Picture

24

Proprietary Open Source

Google DataFlow

IBM Infosphere

Microsoft Azure

Flink

Storm

Samza

Spark

Page 43: An Introduction to Distributed Data Streamingictlabs-summer-school.sics.se/2015/slides/flink_stream1.pdfAn Introduction to Distributed Data Streaming Elements and Systems Paris Carbone

Evolution

25

’95 Materialised

Views

’01 Complex

Event Processing

’03 TelegraphCQ

’03 STREAM

’05 Borealis

’15 User-Defined

Windows

’12 Policy-Based Windowing

’88 Active

DataBases

’88 HiPac

’12 Twitter Storm

’12 IBM

System S

’13 Spark

Streaming

’14 Apache

Flink

’13 Parallel

Recovery

’05 Decentralised

Stream Queries

’05 High Availability

on Streaming

concepts

systems’13

Google Millwheel

’13 Discretized

Streams

’00 Eddies

02 Aurora ’12

Twitter Storm

Page 44: An Introduction to Distributed Data Streamingictlabs-summer-school.sics.se/2015/slides/flink_stream1.pdfAn Introduction to Distributed Data Streaming Elements and Systems Paris Carbone

Programming Models

26

Compositional Declarative

• Offer basic building blocks for composing custom operators and topologies

• Advanced behaviour such as windowing is often missing

• Custom Optimisation

• Expose a high-level API • Operators are higher order

functions on abstract data stream types

• Advanced behaviour such as windowing is supported

• Self-Optimisation

Page 45: An Introduction to Distributed Data Streamingictlabs-summer-school.sics.se/2015/slides/flink_stream1.pdfAn Introduction to Distributed Data Streaming Elements and Systems Paris Carbone

Programming Model Types

27

DStream, DataStream, PCollection…

• Direct access to the execution graph / topology

• Suitable for engineers

• Transformations abstract operator details

• Suitable for engineers and data analysts

Page 46: An Introduction to Distributed Data Streamingictlabs-summer-school.sics.se/2015/slides/flink_stream1.pdfAn Introduction to Distributed Data Streaming Elements and Systems Paris Carbone

Standing Queries with Apache Storm

28

• Step1: Implement input (Spouts) and intermediate operators (Bolts)

• Step 2: Construct a Topology by combining operators

Spout Bolt Bolt

Spouts are the topology sources

The listen to data feeds

Bolts represent all intermediate computation vertices of the topology

They do arbitrary data manipulation

Each operator can emit/subscribe to Streams (computation results)

Page 47: An Introduction to Distributed Data Streamingictlabs-summer-school.sics.se/2015/slides/flink_stream1.pdfAn Introduction to Distributed Data Streaming Elements and Systems Paris Carbone

Example: Topology Definition

29

numbers new_numbers

numbers new_numbers

toFile

Page 48: An Introduction to Distributed Data Streamingictlabs-summer-school.sics.se/2015/slides/flink_stream1.pdfAn Introduction to Distributed Data Streaming Elements and Systems Paris Carbone

Standing Queries with Apache Flink

30

Flink Runtime

Flink Job Graph Builder/Optimiser

Flink Client

Streaming Program

• Operator fusion • Window Pre-aggregates

• Deploy Long Running Tasks • Monitor Execution

Page 49: An Introduction to Distributed Data Streamingictlabs-summer-school.sics.se/2015/slides/flink_stream1.pdfAn Introduction to Distributed Data Streaming Elements and Systems Paris Carbone

Distributed Stream Execution Paradigms

31(Hadoop, Spark) (Spark Streaming)

1) Real Streaming (Distributed Data Flow)

LONG-LIVED TASK EXECUTION STATE IS KEPT INSIDE TASKS

2) Batched Execution

Page 50: An Introduction to Distributed Data Streamingictlabs-summer-school.sics.se/2015/slides/flink_stream1.pdfAn Introduction to Distributed Data Streaming Elements and Systems Paris Carbone

Windows in Action

32

• DStreams are already partitioned in time windows

• Only time windows supported

• Windows decomposed into policies

• Policies can be user-defined too

range

slide

Page 51: An Introduction to Distributed Data Streamingictlabs-summer-school.sics.se/2015/slides/flink_stream1.pdfAn Introduction to Distributed Data Streaming Elements and Systems Paris Carbone

Windows on Storm?

33src-http://www.michael-noll.com/blog/2013/01/18/implementing-real-time-trending-topics-in-storm/

Page 52: An Introduction to Distributed Data Streamingictlabs-summer-school.sics.se/2015/slides/flink_stream1.pdfAn Introduction to Distributed Data Streaming Elements and Systems Paris Carbone

Partitioning in Action

34

forward() shuffle() broadcast() keyBy()

partitionCustom()

shuffleGrouping() allGrouping() fieldsGrouping()

customGrouping()

repartition(num) reduceByKey() updateStateByKey()

no fine-grained control full control

Page 53: An Introduction to Distributed Data Streamingictlabs-summer-school.sics.se/2015/slides/flink_stream1.pdfAn Introduction to Distributed Data Streaming Elements and Systems Paris Carbone

Synopses in Action

35

implementing a rolling max per key

Page 54: An Introduction to Distributed Data Streamingictlabs-summer-school.sics.se/2015/slides/flink_stream1.pdfAn Introduction to Distributed Data Streaming Elements and Systems Paris Carbone

State in Spark?

36

• Streams are partitioned into small batches • There is practically no state kept in workers (stateless) • How do we keep state??

(Spark Streaming)

put new states in output RDDdstream.updateStateByKey(…)

In S’

Page 55: An Introduction to Distributed Data Streamingictlabs-summer-school.sics.se/2015/slides/flink_stream1.pdfAn Introduction to Distributed Data Streaming Elements and Systems Paris Carbone

Implementing the alarm in Flink

37

Page 56: An Introduction to Distributed Data Streamingictlabs-summer-school.sics.se/2015/slides/flink_stream1.pdfAn Introduction to Distributed Data Streaming Elements and Systems Paris Carbone

So everything works

38

{area,temp}

{area,smoke}

Src

Src

P

P

As

As

key:area

key:area

w

w

Fs

Fs

Pkey:area

{area,avg_temp}

{area,smoke}

or…

Page 57: An Introduction to Distributed Data Streamingictlabs-summer-school.sics.se/2015/slides/flink_stream1.pdfAn Introduction to Distributed Data Streaming Elements and Systems Paris Carbone

Unreliable Sources

39

Standing Query

Q

Page 58: An Introduction to Distributed Data Streamingictlabs-summer-school.sics.se/2015/slides/flink_stream1.pdfAn Introduction to Distributed Data Streaming Elements and Systems Paris Carbone

Unreliable Sources

39

Standing Query

Q

Page 59: An Introduction to Distributed Data Streamingictlabs-summer-school.sics.se/2015/slides/flink_stream1.pdfAn Introduction to Distributed Data Streaming Elements and Systems Paris Carbone

Unreliable Sources

39

Standing Query

Q

Page 60: An Introduction to Distributed Data Streamingictlabs-summer-school.sics.se/2015/slides/flink_stream1.pdfAn Introduction to Distributed Data Streaming Elements and Systems Paris Carbone

Unreliable Sources

39

Standing Query

add more sensors

Q

Page 61: An Introduction to Distributed Data Streamingictlabs-summer-school.sics.se/2015/slides/flink_stream1.pdfAn Introduction to Distributed Data Streaming Elements and Systems Paris Carbone

Unreliable Processing

40

Standing Query

Q

Page 62: An Introduction to Distributed Data Streamingictlabs-summer-school.sics.se/2015/slides/flink_stream1.pdfAn Introduction to Distributed Data Streaming Elements and Systems Paris Carbone

Unreliable Processing

40

Standing Query

Q

Page 63: An Introduction to Distributed Data Streamingictlabs-summer-school.sics.se/2015/slides/flink_stream1.pdfAn Introduction to Distributed Data Streaming Elements and Systems Paris Carbone

recovered!

Unreliable Processing

40

Standing Query

Q

Page 64: An Introduction to Distributed Data Streamingictlabs-summer-school.sics.se/2015/slides/flink_stream1.pdfAn Introduction to Distributed Data Streaming Elements and Systems Paris Carbone

recovered!

Unreliable Processing

40

Standing Query

Qlost smoke events

Page 65: An Introduction to Distributed Data Streamingictlabs-summer-school.sics.se/2015/slides/flink_stream1.pdfAn Introduction to Distributed Data Streaming Elements and Systems Paris Carbone

Resilient Brokers

Main Features

• Topic-based partitioned queues

• Strongly consistent offset mapping to records

41

Page 66: An Introduction to Distributed Data Streamingictlabs-summer-school.sics.se/2015/slides/flink_stream1.pdfAn Introduction to Distributed Data Streaming Elements and Systems Paris Carbone

Processing Guarantees• Kafka solves the source consistency problem

• How about the rest of the states of the computation ? (e.g. alert operator state)

• Each system offers different guarantees

42

Guarantees Technique

Storm at least once event dependency tracking

Spark exactly once source upstream backup

Flink exactly once periodic snapshots

Page 67: An Introduction to Distributed Data Streamingictlabs-summer-school.sics.se/2015/slides/flink_stream1.pdfAn Introduction to Distributed Data Streaming Elements and Systems Paris Carbone

43

Q

Standing Query

Mission Accomplished

Page 68: An Introduction to Distributed Data Streamingictlabs-summer-school.sics.se/2015/slides/flink_stream1.pdfAn Introduction to Distributed Data Streaming Elements and Systems Paris Carbone

Research Topics at KTH/SICS

• Exactly-Once-Output Guarantees

• State management and auto-scaling

• Streaming ML pipelines

• Streaming Graphs

44