aljoscha krettek - apache flink for iot: how event-time processing enables easy and accurate...
TRANSCRIPT
1
Aljoscha Krettek@aljoscha
Big Data SpainNovember 17, 2016
Apache Flink for IoT:How Event-Time Processing Enables Easy and Accurate Analytics
What I’d Like to Talk About
2
Streaming architecture and Flink
IoT and event-time stream processing
Use-case examples
3
Original creators of Apache Flink®
Providers of the dA Platform, a supported
Flink distribution
Intro: The Streaming Architecture
4
5
Big Data Architecture
Collect events in HDFS (or similar) Periodically run (batch) jobs to process Problems:• Huge latency• Natural boundaries in data don’t match
batch boundaries
6
Rethinking Data Architecture
Real-time reaction to events
Continuous applications
Process both real-time and historical data
What is (Distributed) Streaming Streaming:
Computations on never-ending “streams” of data records (“events”)
Distributed:Computation spread across many machines
7
Your code
Your code
Your code
Your code
What is Stateful Streaming Result depends on
history of stream A stateful stream
processor should gives the tools to manage state• Recover, roll back, version,
upgrade, etc8
Your code
state
What is Event-Time Streaming Events have timestamps
Processing depends on timestamps
An event-time stream processor should give you the tools to reason about time• Handle streams that are out of
order9
Your code
state
t3 t1 t2t4 t1-t2 t3-t4
10
app state
app state
app state
event log
Queryservice
Recap: What is Streaming? Continuous processing of data that is
continuously generated I.e., pretty much all “big” data It’s all about state and time Flink does all of that
11
12
IoT and Event-time Stream Processing
13 1read.bi/1yDOQQ3
The 'Internet Of Everything' Will Generate $14.4 Trillion Of Value Over The Next Decade.1
Example Event Sources
14
A Simple Definition
15
IoT use cases from the system’s perspective:
A large number of (distributed) things continuously generating a large amount of data.
IoT: Some Insights
16
Data is continuously produced → Stream Processing
Events have a timestamp→ Event-time based processing
Data/Events can arrive with huge delays/out-of-order
Most analyses happen on time windows
What Is Event-Time Processing
17
1977 1980 1983 1999 2002 2005 2015
Processing Time
EpisodeIV
EpisodeV
EpisodeVI
EpisodeI
EpisodeII
EpisodeIII
EpisodeVII
Event Time
What Is Event-Time Processing
18
1312735961112
1234567891011121314Processing Time
Event timestamp
Message Queue
What’s The Problem?
19
13
12
735961112
1234567891011121314Processing Time
Processing-Time Windows 137356
12 137 356Event-Time Windows
12
1112
Mismatch between event time and processing time.
Sources of Time Mismatch Big Mismatch• Network disconnects• Slow network
Small Mismatch• The nature of distributed systems• Differing system clock time
20
Small Event-Time Mismatch
21
Robust Stream Processing with Apache Flink®:A Simple Walkthroughhttp://data-artisans.com/robust-stream-processing-flink-walkthrough/
22
23
24
Recap: Event-Time IoT use cases need event-time
processing Even small mismatch of event
time/processing time will lead to wrong results
25
26
Use-Case Examples
30 Flink applications in production for more than one year. 10 billion events (2TB) processed daily
Complex jobs of > 30 operators running 24/7, processing 30 billion events daily, maintaining state of 100s of GB with exactly-once guarantees
27
King Challenges:• Many games (Candy Crush, Farm
Heroes, Pet Rescue, and Bubble Witch…)• 300 million monthly unique users • 30 billion events received every day
Need event-time based statistics
28https://techblog.king.com/rbea-scalable-real-time-analytics-king/
Solution: RBEA
29https://techblog.king.com/rbea-scalable-real-time-analytics-king/
Solution: RBEA Multiplexing of multiple data scientist
requests into a single Flink job Groovy as language for analysis
scripts Event-time windowing
30https://techblog.king.com/rbea-scalable-real-time-analytics-king/
Bouygues Telecom
31http://flink-forward.org/kb_sessions/a-brief-history-of-time-with-apache-flink-real-time-monitoring-and-analysis-with-flink-kafka-hb/
~120users*
5 FlinkProductionApps
750 TBStorage
4 billionEvents/day
2015
~300users*
30 FlinkProductionApps
2 PBStorage
10 billionEvents/day
2016* Users of the information system
Bouygues: Challenges Low latency & streaming fashion counters Massive amounts of data + bursty loads Reliability Multiple flow correlation Time management: • Out of order & late events → our worst enemies
32http://flink-forward.org/kb_sessions/a-brief-history-of-time-with-apache-flink-real-time-monitoring-and-analysis-with-flink-kafka-hb/
33http://flink-forward.org/kb_sessions/a-brief-history-of-time-with-apache-flink-real-time-monitoring-and-analysis-with-flink-kafka-hb/
In Summary
34
If you need to ask: you already have a streaming use case!
IoT requires Proper Time Management
Apache Flink has done that for a long time now*
* Since version 0.10
35
Thank you!
@aljoscha@ApacheFlink @dataArtisans
36
One day of hands-on Flink training
One day of conference
Tickets are on sale
Call for Papers is already open
Please visit our website:http://sf.flink-forward.org
Follow us on Twitter: @FlinkForward
We are hiring!
data-artisans.com/careers