Download - A Data Streaming Architecture with Apache Flink

A Data Streaming Architecture with Apache Flink

Robert Metzger@[email protected]

Berlin Buzzwords,June 7, 2016

Talk overview

My take on the stream processing space, and how it changes the way we think about data

Transforming an existing data analysis pattern into the streaming world (“Streaming ETL”)

Demo

2

Apache Flink

Apache Flink is an open source stream processing framework• Low latency• High throughput• Stateful• Distributed

Developed at the Apache Software Foundation, 1.0.0 released in March 2016,used in production

3

Entering the streaming era

4

5

Streaming is the biggest change in data infrastructure

since Hadoop

6

1. Radically simplified infrastructure2. Do more with your data, faster3. Can completely subsume batch

7

Real-world data is produced in a continuous fashion.

New systems like Flink and Kafka embrace streaming

nature of data.

Web serverWeb server Kafka topic Stream

processorStream

processor

Apache Flink stack

8

Gelly

Gelly

Table

/ S

QL

Table

/ S

QL

ML

ML

SA

MO

AS

AM

OA

DataSet (Java/Scala)DataSet (Java/Scala)DataStream (Java / Scala)

DataStream (Java / Scala)

Hadoop M

/RH

adoop M

/RLocalLocalClusterClusterYARNYARN

Apach

e B

eam

Apach

e B

eam

Apach

e B

eam

Apach

e B

eam

Table

/

Str

eam

SQ

LTa

ble

/

Str

eam

SQ

L

Casc

adin

gC

asc

adin

g

Streaming dataflow runtimeSto

rm A

PI

Sto

rm A

PI

Zeppelin

Zeppelin

CEP

CEP

What makes Flink flink?

9

Low latency

High Throughput

Well-behavedflow control

(back pressure)

Make more sense of data

Works on real-timeand historic data

TrueStreaming

Event Time

APIsLibraries

StatefulStreaming

Globally consistentsavepoints

Exactly-once semanticsfor fault tolerance

Windows &user-defined state

Flexible windows(time, count, session, roll-your own)

Complex Event Processing

Moving existing (batch) data analysis into streaming

10

Extract, Transform, Load (ETL)

ETL: Move data from A to B and transform it on the way

Old approach:

Server Logs

Server LogsServer

LogsServer Logs

Server Logs

Server Logs

Mobile

IoT



Old approach:

Server Logs

Server Logs

HDFS / S3

HDFS / S3

“Data Lake”

Server Logs

Server Logs

Server Logs

Server Logs

Mobile

IoT

Tier 0: Raw data

Tier 0: Raw data



Old approach:

Server Logs

Server Logs

HDFS / S3

HDFS / S3

“Data Lake”

Server Logs

Server Logs

Server Logs

Server Logs

Mobile

IoT

Tier 0: Raw data

Tier 0: Raw data

Tier 1: Normalized, cleansed data


Periodic jobsPeriodic jobs Parquet /

ORC in HDFS

Parquet /ORC in HDFS

User



Old approach:

Server Logs

Server Logs

HDFS / S3

HDFS / S3

“Data Lake”

Server Logs

Server Logs

Server Logs

Server Logs

Mobile

IoT

Tier 0: Raw data

Tier 0: Raw data



Periodic jobsPeriodic jobs Parquet /

ORC in HDFS


Tier 2: Aggregated data

Tier 2: Aggregated data

Periodic jobsPeriodic jobs

User

User

“Data Warehouse”

Extract, Transform, Load (Streaming ETL) ETL: Move data from A to B and transform it on the

way Streaming approach:

Server Logs

Server Logs

“Data Lake”

Server Logs

Server Logs

Server Logs

Server Logs

Mobile

IoT

Tier 0: Raw data

Tier 0: Raw data

Stream Processor



Server Logs

Server Logs

“Data Lake”

Server Logs

Server Logs

Server Logs

Server Logs

Mobile

IoT

Kafka Connect

or

Kafka Connect

or

Tier 0: Raw data

Tier 0: Raw data

Cleansing

Cleansing

Transformation

Transformation

Time-WindowTime-

Window

AlertsAlerts

Time-WindowTime-

Window

Stream Processor



Server Logs

Server Logs

“Data Lake”

Server Logs

Server Logs

Server Logs

Server Logs

Mobile

IoT





Kafka Connect

or

Kafka Connect

or

ES Connect

or

ES Connect

or

Rolling file sinkRolling file sink

Tier 0: Raw data

Tier 0: Raw data

Cleansing

Cleansing

Transformation

Transformation

Time-WindowTime-

Window

AlertsAlerts

Time-WindowTime-

Window

User

Batch Processing

Stream Processor



Server Logs

Server Logs

“Data Lake”

Server Logs

Server Logs

Server Logs

Server Logs

Mobile

IoT





Tier 2: Aggregated dataTier 2: Aggregated data

User

Kafka Connect

or

Kafka Connect

or

ES Connect

or

ES Connect

or

Rolling file sinkRolling file sink

JDBC sinkJDBC sink

Cassandrasink

Cassandrasink

Tier 0: Raw data

Tier 0: Raw data

Cleansing

Cleansing

Transformation

Transformation

Time-WindowTime-

Window

AlertsAlerts

Time-WindowTime-

Window

User

Batch Processing

Streaming ETL: Low Latency

19

Less than 500 ms*

Less than 250 ms** Your mileage may vary. These are rule of thumb estimates.

Events are processed immediately No need to wait until the next “load” batch job is running

hours minutes milliseconds

Periodic batch job

Batch processor with micro-batchesLatenc

y

Approach

seconds

Stream processor

Streaming ETL: Event-time aware

20

Events derived from the same real-world activity might arrive out of order in the system

Flink is event-time aware

11:28

11:29

11:28

11:29

11:28

11:29

Same real-world activity Out of sync

clocksOut of sync clocks

Network delaysNetwork delays Machine failuresMachine failures

Demo

21

Job Overview

22

Flink Twitter Source

Flink Twitter Source

Data Ingestion Job

“Streaming ETL” Job

Job Overview

23

(Rolling) file sink(Rolling) file sinkFilter operationFilter operationFilter operationFilter operation

Aggregation to ElasticSearchAggregation to ElasticSearch

Streaming WordCountStreaming WordCount

TopN operatorTopN operator

Demo code @ GitHub

24

https://github.com/rmetzger/flink-streaming-etl




Closing

25

26

https://www.eventbrite.com/e/apache-flink-hackathon-by-berlin-buzzwords-tickets-25580481910

Flink Forward 2016, Berlin

Submission deadline: June 30, 2016Early bird deadline: July 15, 2016

www.flink-forward.org

http://www.flink-forward.org/

We are hiring!data-artisans.com/careers

Questions?

Ask now! eMail: [email protected] Twitter: @rmetzger_

Follow: @ApacheFlink Read: flink.apache.org/blog, data-artisans.com/blog/ Mailinglists: (news | user | dev)@flink.apache.org

29

mailto:[email protected]

Appendix

30

Sources

31

“Large scale ETL with Hadoop” http://www.slideshare.net/OReillyStrata/large-scale-etl-with-hadoop

http://www.slideshare.net/OReillyStrata/large-scale-etl-with-hadoop



Download - A Data Streaming Architecture with Apache Flink

Top Related