apache flink stream processing
TRANSCRIPT
Apache Flink Stream Processing
Suneel Marthi @suneelmarthi
Washington DC Apache Flink Meetup, Capital One, Vienna, VA
November 19, 2015
Flink Stack
3
Streaming dataflow runtime
Specialized Abstractions / APIs
Core APIs
Flink Core Runtime
Deployment
The Full Flink Stack
Gel
ly
Tabl
e
ML
SAM
OA
DataSet (Java/Scala) DataStream
Had
oop
M/R
Local Cluster Yarn Tez Embedded
Dat
aflo
w
Dat
aflo
w (W
iP)
MRQ
L
Tabl
e
Casc
adin
gStreaming dataflow runtime
Stor
m (W
iP)
Zepp
elin
Stream Processing ?
▪ Real World Data doesn’t originate in micro batches and is pushed through systems.
▪ Stream Analysis today is an extension of the Batch paradigm.
▪ Recent frameworks like Apache Flink, Confluent are built to handle streaming data.
5
Web server KafkaTopic
Requirements for a Stream Processor
▪ Low Latency ▪ Quick Results (milliseconds)
▪ High Throughput ▪ able to handle million events/sec
▪ Exactly-once guarantees ▪ Deliver results in failure scenarios
6
Fault Tolerance in Streaming
▪ at least once: all operators see all events ▪ Storm: re-processes the entire stream in
failure scenarios ▪ exactly once: operators do not perform
duplicate updates to their state ▪ Flink: Distributed Snapshots ▪ Spark: Micro-batches
7
Batch is an extension of Streaming
▪ Batch: process a bounded stream (DataSet) on a stream processor
▪ Form a Global Window over the entire DataSet for join or grouping operations
Flink Window Processing
9
Courtesy: Data Artisans
What is a Window?
▪ Grouping of elements info finite buckets ▪ by timestamps ▪ by record counts
▪ Have a maximum timestamp, which means, at some point, all elements that need to be assigned to a window would have arrived.
10
Why Window?
▪ Process subsets of Streams ▪ based on timestamps ▪ or by record counts
▪ Have a maximum timestamp, which means, at some point, all elements that need to be assigned to a window will have arrived.
11
Different Window Schemes
▪ Global Windows: All incoming elements are assigned to the same window
stream.window(GlobalWindows.create()); ▪ Tumbling time Windows: elements are assigned to a window of
size (1 sec below) based on their timestamp, elements assigned to exactly one window keyedStream.timeWindow(Time.of(5, TimeUnit.SECONDS));
▪ Sliding time Windows: elements are assigned to a window of certain size based on their timestamp, windows “slide” by the provided value and hence overlap
stream.window(SlidingTimeWindows.of(Time.of(5, TimeUnit.SECONDS), Time.of(1, TimeUnit.SECONDS)));
12
Different Window Schemes
▪ Tumbling count Windows: defines window of 1000 elements, that “tumbles”. Elements are grouped according to their arrival time in groups of 1000 elements, each element belongs to exactly one window stream.countWindow(1000);
▪ Sliding count Windows: defines a window of 1000 elements that slides every “100” elements, elements can belong to multiple windows.
stream.countWindow(1000, 100)
13
Tumbling Count Windows Animation
14
Courtesy: Data Artisans
Count Windows
15Tumbling Count Window, Size = 3
Count Windows
16Tumbling Count Window, Size = 3
Count Windows
17Tumbling Count Window, Size = 3
Count Windows
18Tumbling Count Window, Size = 3
Count Windows
19Tumbling Count Window, Size = 3
Count Windows
20Tumbling Count Window, Size = 3
Count Windows
21
Tumbling Count Window, Size = 3
Count Windows
22
Tumbling Count Window, Size = 3 Sliding every 2 elements
Count Windows
23
Tumbling Count Window, Size = 3 Sliding every 2 elements
Count Windows
24
Tumbling Count Window, Size = 3 Sliding every 2 elements
Count Windows
25
Tumbling Count Window, Size = 3 Sliding every 2 elements
Flink Streaming API
26
Flink DataStream API
27
public class StreamingWordCount { public static void main(String[] args) throws Exception { final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); // Create a DataStream from lines in File DataStream<String> text = env.readTextFile(“/path”); DataStream<Tuple2<String, Integer>> counts = text .flatMap(new LineSplitter()) // Converts DataStream -> KeyedStream .keyBy(0) //Group by first element of the Tuple .sum(1); counts.print();
env.execute(“Execute Streaming Word Counts”); //Execute the WordCount job }
//FlatMap implantation which converts each line to many <Word,1> pairs public static class LineSplitter implements FlatMapFunction<String, Tuple2<String, Integer>> { @Override public void flatMap(String line, Collector<Tuple2<String, Integer>> out) { for (String word : line.split(" ")) { out.collect(new Tuple2<String, Integer>(word, 1)); } } }
Source code - https://github.com/smarthi/DC-FlinkMeetup/blob/master/src/main/java/org/apache/flink/examples/StreamingWordCount.java
Streaming WordCount (Explained)
▪ Obtain a StreamExecutionEnvironment ▪ Connect to a DataSource ▪ Specify Transformations on the
DataStreams ▪ Specifying Output for the processed data ▪ Executing the program
28
Flink DataStream API
29
public class StreamingWordCount { public static void main(String[] args) throws Exception { final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); // Create a DataStream from lines in File DataStream<String> text = env.readTextFile(“/path”); DataStream<Tuple2<String, Integer>> counts = text .flatMap(new LineSplitter()) // Converts DataStream -> KeyedStream .keyBy(0) //Group by first element of the Tuple .sum(1); counts.print();
env.execute(“Execute Streaming Word Counts”); //Execute the WordCount job }
//FlatMap implantation which converts each line to many <Word,1> pairs public static class LineSplitter implements FlatMapFunction<String, Tuple2<String, Integer>> { @Override public void flatMap(String line, Collector<Tuple2<String, Integer>> out) { for (String word : line.split(" ")) { out.collect(new Tuple2<String, Integer>(word, 1)); } } }
Source code - https://github.com/smarthi/DC-FlinkMeetup/blob/master/src/main/java/org/apache/flink/examples/StreamingWordCount.java
Flink Window API
30
Keyed Windows (Grouped by Key)
31
public class WindowWordCount { public static void main(String[] args) throws Exception { final StreamExecutionEnvironment env = StreamExecutionEnvironment .getExecutionEnvironment(); // Create a DataStream from lines in File DataStream<String> text = env.readTextFile(“/path”);
DataStream<Tuple2<String, Integer>> counts = text .flatMap(new LineSplitter()) .keyBy(0) //Group by first element of the Tuple
// create a Window of 'windowSize' records and slide window // by 'slideSize' records.countWindow(windowSize, slideSize)
.sum(1); counts.print();
env.execute(“Execute Streaming Word Counts”); //Execute the WordCount job }
//FlatMap implantation which converts each line to many <Word,1> pairs public static class LineSplitter implements FlatMapFunction<String, Tuple2<String, Integer>> { @Override public void flatMap(String line, Collector<Tuple2<String, Integer>> out) { for (String word : line.split(" ")) { out.collect(new Tuple2<String, Integer>(word, 1)); } } } https://github.com/smarthi/DC-FlinkMeetup/blob/master/src/main/java/org/apache/flink/
examples/WindowWordCount.java
Keyed Windows
32
public class WindowWordCount { public static void main(String[] args) throws Exception { final StreamExecutionEnvironment env = StreamExecutionEnvironment .getExecutionEnvironment(); // Create a DataStream from lines in File DataStream<String> text = env.readTextFile(“/path”);
DataStream<Tuple2<String, Integer>> counts = text .flatMap(new LineSplitter()) .keyBy(0) //Group by first element of the Tuple
// Converts KeyedStream -> WindowStream .timeWindow(Time.of(1, TimeUnit.SECONDS)) .sum(1); counts.print();
env.execute(“Execute Streaming Word Counts”); //Execute the WordCount job }
//FlatMap implantation which converts each line to many <Word,1> pairs public static class LineSplitter implements FlatMapFunction<String, Tuple2<String, Integer>> { @Override public void flatMap(String line, Collector<Tuple2<String, Integer>> out) { for (String word : line.split(" ")) { out.collect(new Tuple2<String, Integer>(word, 1)); } } }
https://github.com/smarthi/DC-FlinkMeetup/blob/master/src/main/java/org/apache/flink/examples/WindowWordCount.java
Global Windows
33
All incoming elements of a given key are assigned to the same window. lines.flatMap(new LineSplitter()) //group by the tuple field "0" .keyBy(0) // all records for a given key are assigned to the same window .GlobalWindows.create() // and sum up tuple field "1" .sum(1) // consider only word counts > 1 .filter(new WordCountFilter())
Flink Streaming API (Tumbling Windows)
34
• All incoming elements are assigned to a window of a certain size based on their timestamp,
• Each element is assigned to exactly one window
Flink Streaming API (Tumbling Window)
35
public class WindowWordCount { public static void main(String[] args) throws Exception { final StreamExecutionEnvironment env = StreamExecutionEnvironment .getExecutionEnvironment(); // Create a DataStream from lines in File DataStream<String> text = env.readTextFile(“/path”);
DataStream<Tuple2<String, Integer>> counts = text .flatMap(new LineSplitter()) .keyBy(0) //Group by first element of the Tuple
// Tumbling Window .timeWindow(Time.of(1, TimeUnit.SECONDS)) .sum(1); counts.print();
env.execute(“Execute Streaming Word Counts”); //Execute the WordCount job }
//FlatMap implantation which converts each line to many <Word,1> pairs public static class LineSplitter implements FlatMapFunction<String, Tuple2<String, Integer>> { @Override public void flatMap(String line, Collector<Tuple2<String, Integer>> out) { for (String word : line.split(" ")) { out.collect(new Tuple2<String, Integer>(word, 1)); } } }
https://github.com/smarthi/DC-FlinkMeetup/blob/master/src/main/java/org/apache/flink/examples/WindowWordCount.java
Demos
36
}
Twitter + Flink Streaming
37
• Create a Flink DataStream from live Twitter feed • Split the Stream into multiple DataStreams based
on some criterion • Persist the respective streams to Storage
https://github.com/smarthi/DC-FlinkMeetup/blob/master/src/main/java/org/apache/flink/examples/twitter
Flink Event Processing: Animation
38
Courtesy: Ufuk Celebi and Stephan Ewen, Data Artisans
39
32-35
24-27
20-23
8-110-3
4-7
Tumbling Windows of 4 Seconds
123412
4
59
9 0
20
20
22212326323321
26
353642
39
tl;dr
40
• Event Time Processing is unique to Apache Flink • Flink provides exactly-once guarantees • With Release 0.10.0, Flink supports Streaming
windows, sessions, triggers, multi-triggers, deltas and event-time.
References
41
• Data Streaming Fault Tolerance in Flink Data Streaming Fault Tolerance in Flink • Light Weight Asynchronous snapshots for
distributed Data Flows http://arxiv.org/pdf/1506.08603.pdf
• Google DataFlow paper Google Data Flow
Acknowledgements
42
Thanks to following folks from Data Artisans for their help and feedback:
• Ufuk Celebi • Till Rohrmann • Stephan Ewen • Marton Balassi • Robert Metzger • Fabian Hueske • Kostas Tzoumas
Questions ???
43