A Data Streaming Architecture with Apache Flink
Robert Metzger@[email protected]
Berlin Buzzwords,June 7, 2016
Talk overview
My take on the stream processing space, and how it changes the way we think about data
Transforming an existing data analysis pattern into the streaming world (“Streaming ETL”)
Demo
2
Apache Flink
Apache Flink is an open source stream processing framework• Low latency• High throughput• Stateful• Distributed
Developed at the Apache Software Foundation, 1.0.0 released in March 2016,used in production
3
Entering the streaming era
4
5
Streaming is the biggest change in data infrastructure
since Hadoop
6
1. Radically simplified infrastructure2. Do more with your data, faster3. Can completely subsume batch
7
Real-world data is produced in a continuous fashion.
New systems like Flink and Kafka embrace streaming
nature of data.
Web serverWeb server Kafka topic Stream
processorStream
processor
Apache Flink stack
8
Gelly
Gelly
Table
/ S
QL
Table
/ S
QL
ML
ML
SA
MO
AS
AM
OA
DataSet (Java/Scala)DataSet (Java/Scala)DataStream (Java / Scala)
DataStream (Java / Scala)
Hadoop M
/RH
adoop M
/RLocalLocalClusterClusterYARNYARN
Apach
e B
eam
Apach
e B
eam
Apach
e B
eam
Apach
e B
eam
Table
/
Str
eam
SQ
LTa
ble
/
Str
eam
SQ
L
Casc
adin
gC
asc
adin
g
Streaming dataflow runtimeSto
rm A
PI
Sto
rm A
PI
Zeppelin
Zeppelin
CEP
CEP
What makes Flink flink?
9
Low latency
High Throughput
Well-behavedflow control
(back pressure)
Make more sense of data
Works on real-timeand historic data
TrueStreaming
Event Time
APIsLibraries
StatefulStreaming
Globally consistentsavepoints
Exactly-once semanticsfor fault tolerance
Windows &user-defined state
Flexible windows(time, count, session, roll-your own)
Complex Event Processing
Moving existing (batch) data analysis into streaming
10
Extract, Transform, Load (ETL)
ETL: Move data from A to B and transform it on the way
Old approach:
Server Logs
Server LogsServer
LogsServer Logs
Server Logs
Server Logs
Mobile
IoT
Extract, Transform, Load (ETL)
ETL: Move data from A to B and transform it on the way
Old approach:
Server Logs
Server Logs
HDFS / S3
HDFS / S3
“Data Lake”
Server Logs
Server Logs
Server Logs
Server Logs
Mobile
IoT
Tier 0: Raw data
Tier 0: Raw data
Extract, Transform, Load (ETL)
ETL: Move data from A to B and transform it on the way
Old approach:
Server Logs
Server Logs
HDFS / S3
HDFS / S3
“Data Lake”
Server Logs
Server Logs
Server Logs
Server Logs
Mobile
IoT
Tier 0: Raw data
Tier 0: Raw data
Tier 1: Normalized, cleansed data
Tier 1: Normalized, cleansed data
Periodic jobsPeriodic jobs Parquet /
ORC in HDFS
Parquet /ORC in HDFS
User
Extract, Transform, Load (ETL)
ETL: Move data from A to B and transform it on the way
Old approach:
Server Logs
Server Logs
HDFS / S3
HDFS / S3
“Data Lake”
Server Logs
Server Logs
Server Logs
Server Logs
Mobile
IoT
Tier 0: Raw data
Tier 0: Raw data
Tier 1: Normalized, cleansed data
Tier 1: Normalized, cleansed data
Periodic jobsPeriodic jobs Parquet /
ORC in HDFS
Parquet /ORC in HDFS
Tier 2: Aggregated data
Tier 2: Aggregated data
Periodic jobsPeriodic jobs
User
User
“Data Warehouse”
Extract, Transform, Load (Streaming ETL) ETL: Move data from A to B and transform it on the
way Streaming approach:
Server Logs
Server Logs
“Data Lake”
Server Logs
Server Logs
Server Logs
Server Logs
Mobile
IoT
Tier 0: Raw data
Tier 0: Raw data
Stream Processor
Extract, Transform, Load (Streaming ETL) ETL: Move data from A to B and transform it on the
way Streaming approach:
Server Logs
Server Logs
“Data Lake”
Server Logs
Server Logs
Server Logs
Server Logs
Mobile
IoT
Kafka Connect
or
Kafka Connect
or
Tier 0: Raw data
Tier 0: Raw data
Cleansing
Cleansing
Transformation
Transformation
Time-WindowTime-
Window
AlertsAlerts
Time-WindowTime-
Window
Stream Processor
Extract, Transform, Load (Streaming ETL) ETL: Move data from A to B and transform it on the
way Streaming approach:
Server Logs
Server Logs
“Data Lake”
Server Logs
Server Logs
Server Logs
Server Logs
Mobile
IoT
Tier 1: Normalized, cleansed data
Tier 1: Normalized, cleansed data
Parquet /ORC in HDFS
Parquet /ORC in HDFS
Kafka Connect
or
Kafka Connect
or
ES Connect
or
ES Connect
or
Rolling file sinkRolling file sink
Tier 0: Raw data
Tier 0: Raw data
Cleansing
Cleansing
Transformation
Transformation
Time-WindowTime-
Window
AlertsAlerts
Time-WindowTime-
Window
User
Batch Processing
Stream Processor
Extract, Transform, Load (Streaming ETL) ETL: Move data from A to B and transform it on the
way Streaming approach:
Server Logs
Server Logs
“Data Lake”
Server Logs
Server Logs
Server Logs
Server Logs
Mobile
IoT
Tier 1: Normalized, cleansed data
Tier 1: Normalized, cleansed data
Parquet /ORC in HDFS
Parquet /ORC in HDFS
Tier 2: Aggregated dataTier 2: Aggregated data
User
Kafka Connect
or
Kafka Connect
or
ES Connect
or
ES Connect
or
Rolling file sinkRolling file sink
JDBC sinkJDBC sink
Cassandrasink
Cassandrasink
Tier 0: Raw data
Tier 0: Raw data
Cleansing
Cleansing
Transformation
Transformation
Time-WindowTime-
Window
AlertsAlerts
Time-WindowTime-
Window
User
Batch Processing
Streaming ETL: Low Latency
19
Less than 500 ms*
Less than 250 ms** Your mileage may vary. These are rule of thumb estimates.
Events are processed immediately No need to wait until the next “load” batch job is running
hours minutes milliseconds
Periodic batch job
Batch processor with micro-batchesLatenc
y
Approach
seconds
Stream processor
Streaming ETL: Event-time aware
20
Events derived from the same real-world activity might arrive out of order in the system
Flink is event-time aware
11:28
11:29
11:28
11:29
11:28
11:29
Same real-world activity Out of sync
clocksOut of sync clocks
Network delaysNetwork delays Machine failuresMachine failures
Demo
21
Job Overview
22
Flink Twitter Source
Flink Twitter Source
Data Ingestion Job
“Streaming ETL” Job
Job Overview
23
(Rolling) file sink(Rolling) file sinkFilter operationFilter operationFilter operationFilter operation
Aggregation to ElasticSearchAggregation to ElasticSearch
Streaming WordCountStreaming WordCount
TopN operatorTopN operator
Demo code @ GitHub
24
https://github.com/rmetzger/flink-streaming-etl
Closing
25
26
https://www.eventbrite.com/e/apache-flink-hackathon-by-berlin-buzzwords-tickets-25580481910
Flink Forward 2016, Berlin
Submission deadline: June 30, 2016Early bird deadline: July 15, 2016
www.flink-forward.org
We are hiring!data-artisans.com/careers
Questions?
Ask now! eMail: [email protected] Twitter: @rmetzger_
Follow: @ApacheFlink Read: flink.apache.org/blog, data-artisans.com/blog/ Mailinglists: (news | user | dev)@flink.apache.org
29
Appendix
30
Sources
31
“Large scale ETL with Hadoop” http://www.slideshare.net/OReillyStrata/large-scale-etl-with-hadoop