márton balassi [email protected]...
TRANSCRIPT
![Page 2: Márton Balassi mbalassi@apache.org @MartonBalassibiconsulting.hu/letoltes/2016budapestdata/balassi_marton_worksho… · Apache Spark •Unified batch and stream processing over a](https://reader033.vdocuments.site/reader033/viewer/2022042220/5ec6b24d3377fe37217e8675/html5/thumbnails/2.jpg)
Stream processing by example
2016-06-14 Budapest Data Forum 2
![Page 3: Márton Balassi mbalassi@apache.org @MartonBalassibiconsulting.hu/letoltes/2016budapestdata/balassi_marton_worksho… · Apache Spark •Unified batch and stream processing over a](https://reader033.vdocuments.site/reader033/viewer/2022042220/5ec6b24d3377fe37217e8675/html5/thumbnails/3.jpg)
Real-Time Player Statistics
• Compute real-time, queryablestatistics
• Billions of events / day
• Millions of active users / day
• State quickly grows beyond memory
• Complex event processing logic
• Strong consistency requirements
2016-06-14 Budapest Data Forum 3
DB
DB
![Page 4: Márton Balassi mbalassi@apache.org @MartonBalassibiconsulting.hu/letoltes/2016budapestdata/balassi_marton_worksho… · Apache Spark •Unified batch and stream processing over a](https://reader033.vdocuments.site/reader033/viewer/2022042220/5ec6b24d3377fe37217e8675/html5/thumbnails/4.jpg)
Real-time dashboard of telco network
2016-06-14 Budapest Data Forum 4
• Example query: Download speedheatmap of premium users in thelast 5 minutes
• Dependant on ~1 TB slowly changing enrichment data
• Multi GB/s input rate
• Some of the use cases require complex windowing logic
DB
![Page 5: Márton Balassi mbalassi@apache.org @MartonBalassibiconsulting.hu/letoltes/2016budapestdata/balassi_marton_worksho… · Apache Spark •Unified batch and stream processing over a](https://reader033.vdocuments.site/reader033/viewer/2022042220/5ec6b24d3377fe37217e8675/html5/thumbnails/5.jpg)
Open source stream processors
2016-06-14 Budapest Data Forum 5
![Page 6: Márton Balassi mbalassi@apache.org @MartonBalassibiconsulting.hu/letoltes/2016budapestdata/balassi_marton_worksho… · Apache Spark •Unified batch and stream processing over a](https://reader033.vdocuments.site/reader033/viewer/2022042220/5ec6b24d3377fe37217e8675/html5/thumbnails/6.jpg)
Apache Streaming Landscape
2016-06-14 Budapest Data Forum 6
![Page 7: Márton Balassi mbalassi@apache.org @MartonBalassibiconsulting.hu/letoltes/2016budapestdata/balassi_marton_worksho… · Apache Spark •Unified batch and stream processing over a](https://reader033.vdocuments.site/reader033/viewer/2022042220/5ec6b24d3377fe37217e8675/html5/thumbnails/7.jpg)
Apache Storm
• Pioneer of real-time analytics
• Distributed dataflow abstraction with low-level control
• Time windowing and state introduced recently
When to use Storm
• Very low latency requirements
• No need for advanced state/windowing
• At-least-once is acceptable
2016-06-14 Budapest Data Forum 7
![Page 8: Márton Balassi mbalassi@apache.org @MartonBalassibiconsulting.hu/letoltes/2016budapestdata/balassi_marton_worksho… · Apache Spark •Unified batch and stream processing over a](https://reader033.vdocuments.site/reader033/viewer/2022042220/5ec6b24d3377fe37217e8675/html5/thumbnails/8.jpg)
Apache Samza
• Builds heavily on Kafka’s log based philosophy
• Pluggable components, but runs best with Kafka
• Scalable operator state with RocksDB
• Basic time windowing
When to use Samza
• Join streams with large states
• At-least-once is acceptable
2016-06-14 Budapest Data Forum 8
![Page 9: Márton Balassi mbalassi@apache.org @MartonBalassibiconsulting.hu/letoltes/2016budapestdata/balassi_marton_worksho… · Apache Spark •Unified batch and stream processing over a](https://reader033.vdocuments.site/reader033/viewer/2022042220/5ec6b24d3377fe37217e8675/html5/thumbnails/9.jpg)
Kafka Streams
• Streaming library on top of Apache Kafka
• Similar features to Samza but nicer API
• Big win for operational simplicity
When to use Kafka Streams
• Kafka based data infrastructure
• Join streams with large states
• At-least-once is acceptable
2016-06-14 Budapest Data Forum 9
![Page 10: Márton Balassi mbalassi@apache.org @MartonBalassibiconsulting.hu/letoltes/2016budapestdata/balassi_marton_worksho… · Apache Spark •Unified batch and stream processing over a](https://reader033.vdocuments.site/reader033/viewer/2022042220/5ec6b24d3377fe37217e8675/html5/thumbnails/10.jpg)
Apache Spark
• Unified batch and stream processing over a batch runtime
• Good integration with batch programs
• Lags behind recent streaming advancements but evolving quickly
• Spark 2.0 comes with new streaming engine
When to use Spark
• Simpler data exploration
• Combine with (Spark) batch analytics
• Medium latency is acceptable2016-06-14 Budapest Data Forum 10
![Page 11: Márton Balassi mbalassi@apache.org @MartonBalassibiconsulting.hu/letoltes/2016budapestdata/balassi_marton_worksho… · Apache Spark •Unified batch and stream processing over a](https://reader033.vdocuments.site/reader033/viewer/2022042220/5ec6b24d3377fe37217e8675/html5/thumbnails/11.jpg)
Apache Flink
• Unified batch and stream processing over dataflow engine
• Leader of open source streaming innovation
• Highly flexible and robust stateful and windowing computations
• Savepoints for state management
When to use Flink
• Advanced streaming analytics
• Complex windowing/state
• Need for high TP - low latency2016-06-14 Budapest Data Forum 11
Batch data
Kafka, RabbitMQ ...
HDFS, JDBC ...
Stream Data
![Page 12: Márton Balassi mbalassi@apache.org @MartonBalassibiconsulting.hu/letoltes/2016budapestdata/balassi_marton_worksho… · Apache Spark •Unified batch and stream processing over a](https://reader033.vdocuments.site/reader033/viewer/2022042220/5ec6b24d3377fe37217e8675/html5/thumbnails/12.jpg)
Apache Apex
• Native streaming engine built natively on YARN
• Stateful operators with checkpointing to HDFS
• Advanced partitioning support with locality optimizations
When to use Apex
• Advanced streaming analytics
• Very low latency requirements
• Need extensive operator library
2016-06-14 Budapest Data Forum 12
![Page 13: Márton Balassi mbalassi@apache.org @MartonBalassibiconsulting.hu/letoltes/2016budapestdata/balassi_marton_worksho… · Apache Spark •Unified batch and stream processing over a](https://reader033.vdocuments.site/reader033/viewer/2022042220/5ec6b24d3377fe37217e8675/html5/thumbnails/13.jpg)
System comparison
2016-06-14 Budapest Data Forum 13
Model Native Micro-batch Native Native Native
API Compositional Declarative Compositional Declarative Compositional
Fault
toleranceRecord ACKs RDD-based Log-based Checkpoints Checkpoints
Guarantee At-least-once Exactly-once At-least-once Exactly-once Exactly-once
StateStateful
operators
State as
DStream
Stateful
operators
Stateful
operators
Stateful
operators
Windowing Time based Time based Time based Flexible Time based
Latency Very-Low High Low Low Very-Low
Throughput Medium Very-High High Very-High Very-High
![Page 14: Márton Balassi mbalassi@apache.org @MartonBalassibiconsulting.hu/letoltes/2016budapestdata/balassi_marton_worksho… · Apache Spark •Unified batch and stream processing over a](https://reader033.vdocuments.site/reader033/viewer/2022042220/5ec6b24d3377fe37217e8675/html5/thumbnails/14.jpg)
Under the hood
2016-06-14 Budapest Data Forum 14
![Page 15: Márton Balassi mbalassi@apache.org @MartonBalassibiconsulting.hu/letoltes/2016budapestdata/balassi_marton_worksho… · Apache Spark •Unified batch and stream processing over a](https://reader033.vdocuments.site/reader033/viewer/2022042220/5ec6b24d3377fe37217e8675/html5/thumbnails/15.jpg)
Native Streaming
2016-06-14 Budapest Data Forum 15
![Page 16: Márton Balassi mbalassi@apache.org @MartonBalassibiconsulting.hu/letoltes/2016budapestdata/balassi_marton_worksho… · Apache Spark •Unified batch and stream processing over a](https://reader033.vdocuments.site/reader033/viewer/2022042220/5ec6b24d3377fe37217e8675/html5/thumbnails/16.jpg)
Distributed dataflow runtime
• Long standing operators
• Pipelined execution
• Usually possible to createcyclic flows
2016-06-14 Budapest Data Forum 16
Pros
• Full expressivity
• Low-latency execution
• Stateful operators
Cons
• Fault-tolerance is hard
• Throughput may suffer
• Load balancing is an issue
![Page 17: Márton Balassi mbalassi@apache.org @MartonBalassibiconsulting.hu/letoltes/2016budapestdata/balassi_marton_worksho… · Apache Spark •Unified batch and stream processing over a](https://reader033.vdocuments.site/reader033/viewer/2022042220/5ec6b24d3377fe37217e8675/html5/thumbnails/17.jpg)
Micro-batching
2016-06-14 Budapest Data Forum 17
![Page 18: Márton Balassi mbalassi@apache.org @MartonBalassibiconsulting.hu/letoltes/2016budapestdata/balassi_marton_worksho… · Apache Spark •Unified batch and stream processing over a](https://reader033.vdocuments.site/reader033/viewer/2022042220/5ec6b24d3377fe37217e8675/html5/thumbnails/18.jpg)
Micro-batch runtime
• Computation broken downto time intervals
• Load aware scheduling
• Easy interaction with batch
2016-06-14 Budapest Data Forum 18
Pros
• Easy to reason about
• High-throughput
• FT comes for “free”
• Dynamic load balancing
Cons
• Latency depends on batch size
• Limited expressivity
• Stateless by nature
![Page 19: Márton Balassi mbalassi@apache.org @MartonBalassibiconsulting.hu/letoltes/2016budapestdata/balassi_marton_worksho… · Apache Spark •Unified batch and stream processing over a](https://reader033.vdocuments.site/reader033/viewer/2022042220/5ec6b24d3377fe37217e8675/html5/thumbnails/19.jpg)
Programming models
2016-06-14 Budapest Data Forum 19
![Page 20: Márton Balassi mbalassi@apache.org @MartonBalassibiconsulting.hu/letoltes/2016budapestdata/balassi_marton_worksho… · Apache Spark •Unified batch and stream processing over a](https://reader033.vdocuments.site/reader033/viewer/2022042220/5ec6b24d3377fe37217e8675/html5/thumbnails/20.jpg)
Hierarchy of Streaming APIs
2016-06-14 Budapest Data Forum 20
DataStreamDStream
Transformations, abstract operators
For both engineers and data analysts
Allows (some) automatic query
optimization
Spout, Consumer,
Bolt, Task,
Topology Direct access to the execution graph
Suitable for engineers
Fine grained access but lower productivity
Declarative
Compositional
![Page 21: Márton Balassi mbalassi@apache.org @MartonBalassibiconsulting.hu/letoltes/2016budapestdata/balassi_marton_worksho… · Apache Spark •Unified batch and stream processing over a](https://reader033.vdocuments.site/reader033/viewer/2022042220/5ec6b24d3377fe37217e8675/html5/thumbnails/21.jpg)
Apache Beam
• One API to rule them all: combined batch and streaming analytics
• Open sourced by Google, based on DataFlow
• Advanced windowing
• Runners on different systems• Google Cloud
• Flink
• Spark
• (Others to follow…)
• Useful for benchmarking?2016-06-14 Budapest Data Forum 21
![Page 22: Márton Balassi mbalassi@apache.org @MartonBalassibiconsulting.hu/letoltes/2016budapestdata/balassi_marton_worksho… · Apache Spark •Unified batch and stream processing over a](https://reader033.vdocuments.site/reader033/viewer/2022042220/5ec6b24d3377fe37217e8675/html5/thumbnails/22.jpg)
Apache Beam
• What results are calculated?
• Where in event time are results calculated?
• When in processing time are results materialized?
• How do refinements of results relate?
2016-06-14 Budapest Data Forum 22
![Page 23: Márton Balassi mbalassi@apache.org @MartonBalassibiconsulting.hu/letoltes/2016budapestdata/balassi_marton_worksho… · Apache Spark •Unified batch and stream processing over a](https://reader033.vdocuments.site/reader033/viewer/2022042220/5ec6b24d3377fe37217e8675/html5/thumbnails/23.jpg)
Counting words…
2016-06-14 Budapest Data Forum 23
![Page 24: Márton Balassi mbalassi@apache.org @MartonBalassibiconsulting.hu/letoltes/2016budapestdata/balassi_marton_worksho… · Apache Spark •Unified batch and stream processing over a](https://reader033.vdocuments.site/reader033/viewer/2022042220/5ec6b24d3377fe37217e8675/html5/thumbnails/24.jpg)
WordCount
2016-06-14 Budapest Data Forum 24
storm dublin flinkapache storm sparkstreaming samza stormflink apache flinkbigdata stormflink streaming
(storm, 4)(dublin, 1)(flink, 4)(apache, 2)(spark, 1)(streaming, 2)(samza, 1)(bigdata, 1)
![Page 25: Márton Balassi mbalassi@apache.org @MartonBalassibiconsulting.hu/letoltes/2016budapestdata/balassi_marton_worksho… · Apache Spark •Unified batch and stream processing over a](https://reader033.vdocuments.site/reader033/viewer/2022042220/5ec6b24d3377fe37217e8675/html5/thumbnails/25.jpg)
Storm
2016-06-14 Budapest Data Forum 25
Assembling the topology
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("spout", new SentenceSpout(), 5);
builder.setBolt("split", new Splitter(), 8).shuffleGrouping("spout");
builder.setBolt("count", new Counter(), 12)
.fieldsGrouping("split", new Fields("word"));
public class Counter extends BaseBasicBolt {
Map<String, Integer> counts = new HashMap<String, Integer>();
public void execute(Tuple tuple, BasicOutputCollector collector) {
String word = tuple.getString(0);
Integer count = counts.containsKey(word) ? counts.get(word) + 1 : 1;
counts.put(word, count);
collector.emit(new Values(word, count));
}
}
Rolling word count bolt
![Page 26: Márton Balassi mbalassi@apache.org @MartonBalassibiconsulting.hu/letoltes/2016budapestdata/balassi_marton_worksho… · Apache Spark •Unified batch and stream processing over a](https://reader033.vdocuments.site/reader033/viewer/2022042220/5ec6b24d3377fe37217e8675/html5/thumbnails/26.jpg)
Samza
2016-06-14 Budapest Data Forum 26
public class WordCountTask implements StreamTask {
private KeyValueStore<String, Integer> store;
public void process( IncomingMessageEnvelope
envelope,
MessageCollector collector,
TaskCoordinator coordinator) {
String word = envelope.getMessage();
Integer count = store.get(word);
if(count == null){count = 0;}
store.put(word, count + 1);
collector.send(new OutgoingMessageEnvelope(new
SystemStream("kafka", ”wc"), Tuple2.of(word, count)));
}
}
Rolling word count task
![Page 27: Márton Balassi mbalassi@apache.org @MartonBalassibiconsulting.hu/letoltes/2016budapestdata/balassi_marton_worksho… · Apache Spark •Unified batch and stream processing over a](https://reader033.vdocuments.site/reader033/viewer/2022042220/5ec6b24d3377fe37217e8675/html5/thumbnails/27.jpg)
Apex
2016-06-14 Budapest Data Forum 27
![Page 28: Márton Balassi mbalassi@apache.org @MartonBalassibiconsulting.hu/letoltes/2016budapestdata/balassi_marton_worksho… · Apache Spark •Unified batch and stream processing over a](https://reader033.vdocuments.site/reader033/viewer/2022042220/5ec6b24d3377fe37217e8675/html5/thumbnails/28.jpg)
Flink
2016-06-14 Budapest Data Forum 28
val lines: DataStream[String] = env.socketTextStream(...)
lines.flatMap {line => line.split(" ")
.map(word => Word(word,1))}
.keyBy("word")
.sum("frequency").print()
case class Word (word: String, frequency: Int)
val lines: DataStream[String] = env.socketTextStream(...)
lines.flatMap {line => line.split(" ")
.map(word => Word(word,1))}
.keyBy("word").
.timeWindow(Time.seconds(5))
.sum("frequency").print()
Rolling word count
Window word count
![Page 29: Márton Balassi mbalassi@apache.org @MartonBalassibiconsulting.hu/letoltes/2016budapestdata/balassi_marton_worksho… · Apache Spark •Unified batch and stream processing over a](https://reader033.vdocuments.site/reader033/viewer/2022042220/5ec6b24d3377fe37217e8675/html5/thumbnails/29.jpg)
Spark
2016-06-14 Budapest Data Forum 29
Window word count
Rolling word count (new feature )
val lines = env.fromSocketStream(...)
val words = lines.flatMap(line => line.split(" "))
.map(word => (word,1))
val wordCounts = words.reduceByKey(_ + _)
wordCounts.print()
val func = (word: String, one: Option[Int], state: State[Int]) => {
val sum = one.getOrElse(0) + state.getOption.getOrElse(0)
val output = (word, sum)
state.update(sum)
output
}
val stateDstream = wordDstream.mapWithState(
StateSpec.function(func).initialState(initialRDD))
stateDstream.print()
![Page 30: Márton Balassi mbalassi@apache.org @MartonBalassibiconsulting.hu/letoltes/2016budapestdata/balassi_marton_worksho… · Apache Spark •Unified batch and stream processing over a](https://reader033.vdocuments.site/reader033/viewer/2022042220/5ec6b24d3377fe37217e8675/html5/thumbnails/30.jpg)
Beam
2016-06-14 Budapest Data Forum 30
PCollection<String> windowedLines = input
.apply(Window.<String>into(
FixedWindows.of(Duration.standardMinutes(5))));
PCollection<KV<String, Long>> wordCounts = windowedLines
.apply(ParDo.of(new DoFn<String, String>() {
@Override
public void processElement(ProcessContext c) {
for (String word : c.element().split("[^a-zA-Z']+")){
if (!word.isEmpty()) {
c.output(word);
}
}
}
}))
.apply(Count.<String>perElement());
Window word count (minimalistic version)
![Page 31: Márton Balassi mbalassi@apache.org @MartonBalassibiconsulting.hu/letoltes/2016budapestdata/balassi_marton_worksho… · Apache Spark •Unified batch and stream processing over a](https://reader033.vdocuments.site/reader033/viewer/2022042220/5ec6b24d3377fe37217e8675/html5/thumbnails/31.jpg)
Fault tolerance and statefulprocessing
2016-06-14 Budapest Data Forum 31
![Page 32: Márton Balassi mbalassi@apache.org @MartonBalassibiconsulting.hu/letoltes/2016budapestdata/balassi_marton_worksho… · Apache Spark •Unified batch and stream processing over a](https://reader033.vdocuments.site/reader033/viewer/2022042220/5ec6b24d3377fe37217e8675/html5/thumbnails/32.jpg)
Fault tolerance intro
• Fault-tolerance in streaming systems is inherently harder than in batch
• Can’t just restart computation
• State is a problem
• Fast recovery is crucial
• Streaming topologies run 24/7 for a long period
• Fault-tolerance is a complex issue• No single point of failure is allowed
• Guaranteeing input processing
• Consistent operator state
• Fast recovery
• At-least-once vs Exactly-once semantics
2016-06-14 Budapest Data Forum 32
![Page 33: Márton Balassi mbalassi@apache.org @MartonBalassibiconsulting.hu/letoltes/2016budapestdata/balassi_marton_worksho… · Apache Spark •Unified batch and stream processing over a](https://reader033.vdocuments.site/reader033/viewer/2022042220/5ec6b24d3377fe37217e8675/html5/thumbnails/33.jpg)
Storm record acknowledgements
• Track the lineage of tuples as they are processed (anchors and acks)
• Special “acker” bolts track each lineage DAG (efficient xor based algorithm)
• Replay the root of failed (or timed out) tuples
2016-06-14 Budapest Data Forum 33
![Page 34: Márton Balassi mbalassi@apache.org @MartonBalassibiconsulting.hu/letoltes/2016budapestdata/balassi_marton_worksho… · Apache Spark •Unified batch and stream processing over a](https://reader033.vdocuments.site/reader033/viewer/2022042220/5ec6b24d3377fe37217e8675/html5/thumbnails/34.jpg)
Samza offset tracking
• Exploits the properties of a durable, offset based messaging layer
• Each task maintains its current offset, which moves forward as it processes elements
• The offset is checkpointed and restored on failure (some messages might be repeated)
2016-06-14 Budapest Data Forum 34
![Page 35: Márton Balassi mbalassi@apache.org @MartonBalassibiconsulting.hu/letoltes/2016budapestdata/balassi_marton_worksho… · Apache Spark •Unified batch and stream processing over a](https://reader033.vdocuments.site/reader033/viewer/2022042220/5ec6b24d3377fe37217e8675/html5/thumbnails/35.jpg)
Spark RDD recomputation
• Immutable data model with repeatable computation
• Failed RDDs are recomputed using their lineage
• Checkpoint RDDs to reduce lineage length
• Parallel recovery of failed RDDs
• Exactly-once semantics
2016-06-14 Budapest Data Forum 35
![Page 36: Márton Balassi mbalassi@apache.org @MartonBalassibiconsulting.hu/letoltes/2016budapestdata/balassi_marton_worksho… · Apache Spark •Unified batch and stream processing over a](https://reader033.vdocuments.site/reader033/viewer/2022042220/5ec6b24d3377fe37217e8675/html5/thumbnails/36.jpg)
Flink state checkpointing
• Consistent global snapshots with exactly-once semantics
• Algorithm designed for stateful dataflows (minimal runtime overhead)
• Pluggable state backends: Memory, FS, RocksDB, MySQL…
2016-06-14 Budapest Data Forum 36
![Page 37: Márton Balassi mbalassi@apache.org @MartonBalassibiconsulting.hu/letoltes/2016budapestdata/balassi_marton_worksho… · Apache Spark •Unified batch and stream processing over a](https://reader033.vdocuments.site/reader033/viewer/2022042220/5ec6b24d3377fe37217e8675/html5/thumbnails/37.jpg)
Apex state checkpointing
• Algorithms similar to Flink’s but also buffers output windows
• Larger memory overhead but faster, granular recovery
• Pluggable checkpoint backend, HDFS by default
2016-06-14 Budapest Data Forum 37
![Page 38: Márton Balassi mbalassi@apache.org @MartonBalassibiconsulting.hu/letoltes/2016budapestdata/balassi_marton_worksho… · Apache Spark •Unified batch and stream processing over a](https://reader033.vdocuments.site/reader033/viewer/2022042220/5ec6b24d3377fe37217e8675/html5/thumbnails/38.jpg)
Performance
How much does all this matter?
The winners of last year's Twitter hack week managed to reduce the resources needed for a specific job by 99%. [1]
There are many recent benchmarks out there
• Storm, Flink & Spark by Yahoo [2]
• Apex by DataTorrent [3,4]
• Flink by data Artisans [1,5]
[1] http://data-artisans.com/extending-the-yahoo-streaming-benchmark
[2] https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at
[3] https://www.datatorrent.com/blog/blog-implementing-linear-road-benchmark-in-apex/
[4] https://www.datatorrent.com/blog/blog-apex-performance-benchmark/
[5] http://data-artisans.com/high-throughput-low-latency-and-exactly-once-stream-processing-with-apache-flink/
2016-06-14 Budapest Data Forum 38
![Page 39: Márton Balassi mbalassi@apache.org @MartonBalassibiconsulting.hu/letoltes/2016budapestdata/balassi_marton_worksho… · Apache Spark •Unified batch and stream processing over a](https://reader033.vdocuments.site/reader033/viewer/2022042220/5ec6b24d3377fe37217e8675/html5/thumbnails/39.jpg)
Next steps for streaming
• Dynamic scaling (with state)
• Rolling upgrades
• Better state handling
• More Beam runners
• Libraries: CEP, ML
• Better batch integration
2016-06-14 Budapest Data Forum 39
![Page 40: Márton Balassi mbalassi@apache.org @MartonBalassibiconsulting.hu/letoltes/2016budapestdata/balassi_marton_worksho… · Apache Spark •Unified batch and stream processing over a](https://reader033.vdocuments.site/reader033/viewer/2022042220/5ec6b24d3377fe37217e8675/html5/thumbnails/40.jpg)
Closing
2016-06-14 Budapest Data Forum 40
![Page 41: Márton Balassi mbalassi@apache.org @MartonBalassibiconsulting.hu/letoltes/2016budapestdata/balassi_marton_worksho… · Apache Spark •Unified batch and stream processing over a](https://reader033.vdocuments.site/reader033/viewer/2022042220/5ec6b24d3377fe37217e8675/html5/thumbnails/41.jpg)
Summary
• Streaming systems are gaining popularity with many businesses migrating some of their infrastructure
• The open source space sees a lot of innovation
• When choosing an application consider your specific use cases, do not just follow the herd
• We have a recommended reading section :)
2016-06-14 Budapest Data Forum 41
![Page 42: Márton Balassi mbalassi@apache.org @MartonBalassibiconsulting.hu/letoltes/2016budapestdata/balassi_marton_worksho… · Apache Spark •Unified batch and stream processing over a](https://reader033.vdocuments.site/reader033/viewer/2022042220/5ec6b24d3377fe37217e8675/html5/thumbnails/42.jpg)
Thank you!
![Page 43: Márton Balassi mbalassi@apache.org @MartonBalassibiconsulting.hu/letoltes/2016budapestdata/balassi_marton_worksho… · Apache Spark •Unified batch and stream processing over a](https://reader033.vdocuments.site/reader033/viewer/2022042220/5ec6b24d3377fe37217e8675/html5/thumbnails/43.jpg)
Recommended reading
![Page 44: Márton Balassi mbalassi@apache.org @MartonBalassibiconsulting.hu/letoltes/2016budapestdata/balassi_marton_worksho… · Apache Spark •Unified batch and stream processing over a](https://reader033.vdocuments.site/reader033/viewer/2022042220/5ec6b24d3377fe37217e8675/html5/thumbnails/44.jpg)
Apache Beam
• http://beam.incubator.apache.org/beam/capability/2016/03/17/capability-matrix.html
• https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
• https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102
2016-06-14 Budapest Data Forum 44
![Page 45: Márton Balassi mbalassi@apache.org @MartonBalassibiconsulting.hu/letoltes/2016budapestdata/balassi_marton_worksho… · Apache Spark •Unified batch and stream processing over a](https://reader033.vdocuments.site/reader033/viewer/2022042220/5ec6b24d3377fe37217e8675/html5/thumbnails/45.jpg)
Apache Spark Streaming
• https://databricks.com/blog/2016/02/01/faster-stateful-stream-processing-in-spark-streaming.html
• http://www.slideshare.net/databricks/2016-spark-summit-east-keynote-matei-zaharia
2016-06-14 Budapest Data Forum 45
![Page 46: Márton Balassi mbalassi@apache.org @MartonBalassibiconsulting.hu/letoltes/2016budapestdata/balassi_marton_worksho… · Apache Spark •Unified batch and stream processing over a](https://reader033.vdocuments.site/reader033/viewer/2022042220/5ec6b24d3377fe37217e8675/html5/thumbnails/46.jpg)
Apache Flink
• http://flink.apache.org/news/2015/12/04/Introducing-windows.html
• http://data-artisans.com/flink-1-0-0/
• http://data-artisans.com/how-apache-flink-enables-new-streaming-applications/
2016-06-14 Budapest Data Forum 46
![Page 47: Márton Balassi mbalassi@apache.org @MartonBalassibiconsulting.hu/letoltes/2016budapestdata/balassi_marton_worksho… · Apache Spark •Unified batch and stream processing over a](https://reader033.vdocuments.site/reader033/viewer/2022042220/5ec6b24d3377fe37217e8675/html5/thumbnails/47.jpg)
Apache Storm
• https://community.hortonworks.com/articles/14171/windowing-and-state-checkpointing-in-apache-storm.html
• https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at
2016-06-14 Budapest Data Forum 47
![Page 48: Márton Balassi mbalassi@apache.org @MartonBalassibiconsulting.hu/letoltes/2016budapestdata/balassi_marton_worksho… · Apache Spark •Unified batch and stream processing over a](https://reader033.vdocuments.site/reader033/viewer/2022042220/5ec6b24d3377fe37217e8675/html5/thumbnails/48.jpg)
Samza / Kafka Streams
• http://docs.confluent.io/2.1.0-alpha1/streams/architecture.html
• http://www.confluent.io/blog/introducing-kafka-streams-stream-processing-made-simple
• http://docs.confluent.io/2.1.0-alpha1/streams/index.html
• http://www.slideshare.net/edibice/extremely-low-latency-web-scale-fraud-prevention-with-apache-samza-kafka-and-friends
• http://radar.oreilly.com/2014/07/why-local-state-is-a-fundamental-primitive-in-stream-processing.html
2016-06-14 Budapest Data Forum 48
![Page 49: Márton Balassi mbalassi@apache.org @MartonBalassibiconsulting.hu/letoltes/2016budapestdata/balassi_marton_worksho… · Apache Spark •Unified batch and stream processing over a](https://reader033.vdocuments.site/reader033/viewer/2022042220/5ec6b24d3377fe37217e8675/html5/thumbnails/49.jpg)
Apache Apex
• http://docs.datatorrent.com/application_development/#apache-apex-platform-overview
• http://docs.datatorrent.com/application_development/#fault-tolerance
• https://github.com/apache/incubator-apex-malhar/tree/master/demos
• https://www.datatorrent.com/introducing-apache-apex-incubating/
2016-06-14 Budapest Data Forum 49
![Page 50: Márton Balassi mbalassi@apache.org @MartonBalassibiconsulting.hu/letoltes/2016budapestdata/balassi_marton_worksho… · Apache Spark •Unified batch and stream processing over a](https://reader033.vdocuments.site/reader033/viewer/2022042220/5ec6b24d3377fe37217e8675/html5/thumbnails/50.jpg)
List of Figures (in order of usage)
• https://upload.wikimedia.org/wikipedia/commons/thumb/2/2a/CPT-FSM-abcd.svg/326px-CPT-FSM-abcd.svg.png
• https://storm.apache.org/images/topology.png
• https://databricks.com/wp-content/uploads/2015/07/image11-1024x655.png
• https://databricks.com/wp-content/uploads/2015/07/image21-1024x734.png
• https://people.csail.mit.edu/matei/papers/2012/hotcloud_spark_streaming.pdf, page 2.
• http://www.slideshare.net/ptgoetz/storm-hadoop-summit2014, page 69-71.
• http://samza.apache.org/img/0.9/learn/documentation/container/checkpointing.svg
• https://databricks.com/wp-content/uploads/2015/07/image41-1024x602.png
• https://storm.apache.org/documentation/images/spout-vs-state.png
• http://samza.apache.org/img/0.9/learn/documentation/container/stateful_job.png
2016-06-14 Budapest Data Forum 50