trivento summercamp masterclass 9/9/2016
TRANSCRIPT
Lambda Architecture And Beyond
Stavros KontopoulosSenior Software Engineer @ Lightbend, M.Sc.
Trivento Summercamp 2016 Amersfoort
De oude Prodentfabriek
Introduction
2
Introduction: Who Am I?
Agenda
A bit of history of Big Data Processing
Batch Systems vs Streaming Systems
What is Lambda Architecture?
Advantages, Disadvantages?
Use cases
Data Lakes, Data Silos etc...
Implementing Lambda Architecture, ML support, Implementation Tips
Beyond the Lambda Architecture (Kappa, FastData, Zeta etc)3
Last warning...
4
Data Processing
Batch processing: processing done on a bounded dataset.
Stream Processing (Streaming): processing done on an unbounded datasets. Data items are pushed or pulled.
Two categories of systems: batch vs streaming systems.
5
Big Data - The story
Internet scale apps moved data size from Gigabytes to Petabytes.
Once upon a time there were traditional RDBMS like Oracle and Data Warehouses but volume, velocity and variety changed the game.
6
Big Data - The story
MapReduce was a major breakthrough (Google published the seminal paper in 2004).
Nutch project already had an implementation in 2005
2006 becomes a subproject of Lucene with the name Hadoop.
2008 Yahoo brings Hadoop to production with a 10K cluster. Same year it becomes a top-level apache project.
Hadoop is good for batch processing.
Big Data - The story
Word Count example - Inverted Index.
8
Split 1
Split N
doc1, doc2 ...
...
doc300, doc100
MAP REDUCE
(w1,1)…(w20,1)
(w41,1)…(w1,1)
Shuffle
(w1, (1,1,1…))...
(w41, (1,1,…))...
(w1, 13)...
(w1, 3)...
Big Data - The story
Giuseppe DeCandia et al., ”Dynamo: amazon's highly available key-value store.” changed the DataBase world in 2007.
NoSQL Databases along with general system like Hadoop solve problems cannot be solved with traditional RDBMs.
Technology facts: Cheap memory, SSDs, HDDs are the new tape, more cpus over more powerful cpus.
9
Big Data - The story
There is a major shift in the industry as batch processing is not enough any more.
Batch jobs usually take hours if not days to complete, in many applications that is not acceptable.
10
Big Data - The story
The trend now is near-real time computation which implies streaming algorithms and needs new semantics. Fast Data (data in motion) & Big Data (data at rest) at the same time.
The enterprise needs to get smarter, all major players across industries use ML on top of massive datasets to make better decisions.
11Images: https://www.tesla.com/sites/default/files/pictures/thumbs/model_s/red_models.jpg?201501121530 https://i.ytimg.com/vi/cj83dL72cvg/maxresdefault.jpg
Big Data - The story
OpsClarity report:92% plan to increase their investment in stream processing applications in the
next year79% plan to reduce or eliminate investment in batch processing32% use real time analysis to power core customer-facing applications44% agreed that it is tedious to correlate issues across the pipeline68% identified lack of experience and underlying complexity of new data
frameworks as their barrier to adoption
http://info.opsclarity.com/2016-fast-data-streaming-applications-report.html
12
Big Data - The story
13 Image: http://info.opsclarity.com/2016-fast-data-streaming-applications-report.html
Big Data - The story
14
In OpsClarity report:
● Apache Kafka is the most popular broker technology (ingestion queue)
● HDFS the most used data sink
● Apache Spark is the most popular data processing tool.
Big Data Landscape
15 Image: http://mattturck.com/wp-content/uploads/2016/03/Big-Data-Landscape-2016-v18-FINAL.png
Big Data System
A Big Data System must have at least the following components at its core:
DFS - Distributed File System like (S3, HDFS) or a distributed database system (DDS).
Distributed Data processing tool like: Spark, Hadoop etc
Tools and services to manage the previous systems.
16
Big Data System - Layered View
A Big Data System has at least an infrastructure layer and application layer.
17
Big Data System Design Considerations / Problems
Data Locality
Data Versioning
Code change
Resource allocation
Deployment/Operation
Integration
Backup/Failover Strategy
Scaling Strategy
Security
Monitoring/Logging
Orchestration
Output Validation in data pipelines
18
Big Data System Quality
A Big Data System should be:
fault-tolerant
easy to debug
generic enough
scalable
extensible
able to support ad-hoc queries
high throughput
able to support low latency reads/writes
simple to operate
secure
19
Big Data and Immutable Data
Immutable data provide the following benefits:
Fault-tolerance to human error (you can always replay history and fix things)
Simplicity no index is needed for retrieve and update, just append newly arrived data.
20
Big Data System - Delivery/Processing Semantics
21
In distributed systems failure is part of the game. What semantics I can achieve for message delivery?
at-most-once delivery: for each message sent, that message is delivered zero or one times.
at-least-once delivery: for each message sent potentially multiple attempts are made at delivering it, such that at least one succeeds; messages may be duplicated but not lost.
exactly-once delivery: for each message sent exactly one delivery is made to the recipient; the message can neither be lost nor duplicated.
In theory it is impossible to have exactly once delivery.
In practice we might care more for exactly-once state changes and at-least once delivery. Example: Keeping state at some operator of the streaming graph.
Batch Systems - The Hadoop Ecosystem
22
Yarn (Yet Another Resource Negotiator) deployed in production at Yahoo in March 2013.
Same year Cloudera, the dominant Hadoop vendor, embraced Spark as the next-generation replacement for MapReduce.
Image: Lightbend Inc.
Batch Systems - The Hadoop Ecosystem
Hadoop clusters, the gold standard for big data from ~2008 to the present.
Strengths:
Lowest CapEx system for Big Data.
Excellent for ingesting and integrating diverse datasets.
Flexible: from classic analytics (aggregations and data warehousing) to machine learning.
23
Batch Systems - The Hadoop Ecosystem
Weaknesses:
Complex administration.
YARN can’t manage all distributed services.
MapReduce, has poor performance, a difficult programming model, and doesn’t support stream
processing.
24
Analyzing Infinite Data Streams
25
What does it mean to run a SQL query on an unbounded data set.
How should I deal with the late data which I see.
What kind of time measurement should I use? Event-time, Processing time or Ingestion time?
Accuracy of computations on bounded datasets vs on unbounded datasets
Algorithms for streaming computations?
Analyzing Infinite Data Streams
26
Two cases for processing:
Single event processing: event transformation, trigger an alarm on an error event
Event aggregations: summary statistics, group-by, join and similar queries. For example compute the average temperature for the last 5 minutes from a sensor data stream.
Analyzing Infinite Data Streams
27
Event aggregation introduces the concept of windowing wrt to the notion of time selected:
Event time (the time that events happen): Important for most use cases where context and correctness matter at the same time. Example: billing applications, anomaly detection.
Processing time (the time they are observed during processing): Use cases where I only care about what I process in a window. Example: accumulated clicks on a page per second.
System Arrival or Ingestion time (the time that events arrived at the streaming system).
Ideally event time = Processing time. Reality is: there is skew.
Analyzing Infinite Data Streams
28
Windows come in different flavors:
Tumbling windows discretize a stream into non-overlapping windows.
Sliding Windows: slide over the stream of data.
Analyzing Infinite Data Streams
29
Watermarks: indicates that no elements with a timestamp older or equal to the watermark timestamp should arrive for the specific window of data.
Triggers: decide when the window is evaluated or purged.
Analyzing Infinite Data Streams
30
Given the advances in streaming we can:
Trade-off latency with cost and accuracy
In certain use-cases replace batch processing with streaming
Analyzing Infinite Data Streams
31
Recent advances in Streaming are a result of the pioneer work:
MillWheel: Fault-Tolerant Stream Processing at Internet Scale, VLDB 2013.
The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing, Proceedings of the VLDB Endowment, vol. 8 (2015), pp. 1792-1803
Analyzing Infinite Data Streams
32
Apache Beam is the open source successor of Google’s DataFlow
It is becoming the standard api streaming. Provides the advanced semantics needed for the current needs in streaming applications.
Streaming Systems Architecture
33
User provides a graph of computations through a high level API where data flows on the edges of this graph. Each vertex its an operator which executes a user operation-computation. For example: stream.map().keyBy()...
Operators can run in multiple instances and preserve state (unlike batch processing where we have immutable datasets).
State can be persisted and restored in the presence of failures.
Analyzing Infinite Data Streams - Flink Example
34
sealed trait SensorType { def stype: String }case object TemperatureSensor extends SensorType { val stype = "TEMP" }case object HumiditySensor extends SensorType { val stype = "HUM" }
case class SensorData(var sensorId: String, var value: Double, var sensorType: SensorType, timestamp: Long)
https://github.com/skonto/trivento-summercamp-2016
Analyzing Infinite Data Streams - Flink Example
35
class SensorDataSource(val sensorType: SensorType, val numberOfSensors: Int, val watermarkTag: Int, val numberOfElements: Int = -1) extends SourceFunction[SensorData] { final val serialVersionUID = 1L @volatile var isRunning = true var counter = 1 var timestamp = 0 val randomGen = Random
require(numberOfSensors > 0) require(numberOfElements >= -1)
lazy val initialReading: Double = { sensorType match { case TemperatureSensor => 27.0 case HumiditySensor => 0.75 } }
override def run(ctx: SourceContext[SensorData]): Unit = {
val counterCondition = { if(numberOfElements == -1) { x: Int => isRunning } else { x: Int => isRunning && counter <= x } }
while (counterCondition(numberOfElements)) { Thread.sleep(10) // send sensor data every 10 milliseconds
val dataId = randomGen.nextInt(numberOfSensors) + 1 val data = SensorData(dataId.toString, initialReading + Random.nextGaussian()/initialReading, sensorType, timestamp) ctx.collectWithTimestamp(data, timestamp) // time starts at 0 in millisecs timestamp = timestamp + 1
if (timestamp % watermarkTag == 0) { // watermark should be mod 0 ctx.emitWatermark(new Watermark(timestamp)) // watermark in milliseconds } counter = counter + 1 } }
override def cancel(): Unit = { // No cleanup needed isRunning = false }}
The Source
https://github.com/skonto/trivento-summercamp-2016
Analyzing Infinite Data Streams - Flink Example
36
object SensorSimple { def main(args: Array[String]): Unit = { val env = StreamExecutionEnvironment.getExecutionEnvironment // set default env parallelism for all operators env.setParallelism(2) env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime) val numberOfSensors = 2 val watermarkTag = 10 val numberOfElements = 1000
val sensorDataStream = env.addSource(new SensorDataSource(TemperatureSensor, numberOfSensors, watermarkTag, numberOfElements))
sensorDataStream.writeAsText("inputData.txt")
val windowedKeyed = sensorDataStream .keyBy(data => data.sensorId) .timeWindow(Time.milliseconds(10))
windowedKeyed.max("value") .writeAsText("outputMaxValue.txt")
windowedKeyed.apply(new SensorAverage()) .writeAsText("outputAverage.txt") env.execute("Sensor Data Simple Statistics") }}
class SensorAverage extends WindowFunction[SensorData, SensorData, String, TimeWindow] { def apply(key: String, window: TimeWindow, input: Iterable[SensorData], out: Collector[SensorData]): Unit = { if (input.nonEmpty) { val average = input.map(_.value).sum / input.size out.collect(input.head.copy(value = average)) } }}
The Job
https://github.com/skonto/trivento-summercamp-2016
Analyzing Infinite Data Streams - Flink Example
37
Operator 1 Operator 2
Watermark 1 (10) 0 3 6 2
7 5849
Operators run the operations defined by the graph of the streaming computation. Example Operators (KeyBy, Map, FlatMap etc)
Two instances of the same operator with parallelism 2 (previous example).
Watermark N (10*N) ..
....
....
..
....
....
....
1
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22...
time
file1 file2
window 2window 1
Streaming vs Batch Systems
38
Metric Batch Streaming
Data size per job TB to PB MB to TB (in flight)
Time between data arrival and processing
Many minutes to hours Microseconds to minutes
Job execution times Minutes to hours Microseconds to minutes
World of Patterns
Pattern (in general) … is a perceptible regularity or a template (Wikipedia).
Software Patterns: well-defined, reusable solution to a commonly occurring problem in software design eg. Template Method, Singleton etc.
Software Architecture Patterns: An architectural pattern is a general, reusable solution to a commonly occurring problem in software architecture within a given context (Wikipedia) eg. client-server n-tier.
39
World of Patterns
Software Architecture vs Software Design.
We use them everywhere but… they are not a silver bullet. Why?
40
Software Architecture before Lambda Architecture
Many definitions for software architecture.
“Architecture: system fundamental concepts or properties of a system in its environment embodied in its elements, ⟨ ⟩relationships, and in the principles of its design and evolution”. (ISO/IEC/IEEE 42010).
“Software architecture refers to the fundamental structures of a software system, the discipline of creating such
structures, and the documentation of these structures. These structures are needed to reason about the software
system.” Wikipedia
“It is about structure and vision”. Software architecture for developers, Simon Brown.
“The highest-level breakdown of a system into its parts; the decisions that are hard to change; there are multiple
architectures in a system; what is architecturally significant can change over a system's lifetime; and, in the end,
architecture boils down to whatever the important stuff is.” Patterns of Enterprise Application Architecture, Martin Fowler
41
Software Architecture is important
Architectural decisions are decisions that have non-local consequences and they serve specific goals eg. in order to achieve a performance goal like high throughput I decided to use buffering within my system.
Architectural decisions are important for your in-house project or your proposal if you are a consultant.
42
Sound Architecture Principles: Why I Need it?
Scalability/Elasticity
Extensibility: requirements will change expect that
Minimized costs
Security awareness
Well designed APIs for integration
Well-tested, don’t go to production and cross fingers.
43
Follow common sense...
At the end of the day expect to throw everything out of the window under some circumstances. Business matters the most.
Example: Non-functional requirements changed since load is huge and you are becoming successful, maybe you are the next Facebook.
44
Software Architecture is important
...because there is high cost of not making specific decisions or making them not early enough.
45
Software Architecture is important
How about the wrong decisions?
Image: http://www.awesomeinventions.com/wp-content/uploads/2014/10/balcony.jpg
46
Software Architecture is important
Many more benefits where architecture is present:
A documented architecture assists communication
Guides implementation imposing constraints
Assists in technology decisions
Assists in cost and time estimation
Influences the structure of your organization and vice versa
I
47
Software Architecture LifeCycle
Steps:
Architectural Requirements
Architectural Design
Architectural Documentation
Architectural Evaluation / Implementation
48
Lambda Architecture - Intro
“Computing arbitrary functions on an arbitrary dataset in real time is a daunting problem. There is no single tool that provides a complete solution. Instead, you have to use a variety of tools and techniques to build a complete Big Data system. The lambda architecture solves the problem of computing arbitrary functions on arbitrary data in real time by decomposing the problem into three layers: the batch layer, the serving layer, and the speed layer.”
49
Nathan Marz and James Warren, Big Data: Principles and best practices of scalable real-time data systems, Manning Publications.
Photo: https://images-na.ssl-images-amazon.com/images/I/51Bd93AGuOL._SX258_BO1,204,203,200_.jpg
Lambda Architecture - Cont’d (1/5)
50
Image: http://lambda-architecture.net/img/la-overview_small.png
Batch Layer: perfect accuracy, indexed batch viewsServing Layer: random access query support based on batch & real-time views Speed Layer: process real-time streams, provides real-time views, lower accuracy
Master dataset: append-only, immutable set of raw data
Lambda Architecture - Cont’d (2/5)
Example components for each part:
Batch layer: Hadoop
Batch Output Indexing: Druid, Impala etc
Speed Output Indexing: Druid, Cassandra, HBase etc
Speed processing: Spark, Flink etc
51
Lambda Architecture - Cont’d (3/5)
Basic functions:
batch view = function (all data) <- high latency, high throughput
realtime view = function (realtime view, new data) <- low latency, low throughput
query = function (batch view, realtime view ) <- eventual accuracy
52
Lambda Architecture - Cont’d (4/5)
Key Properties:
Eventual Accuracy
Batch is always behind in time, continuously produces batch outputs. Whenever a new batch output is available updates the latest one. Finally batch layer will catch up with the speed layer.
Complexity Isolation
53
Lambda Architecture - Cont’d (5/5)
Advantages:
Immutable data.
Reprocessing takes care code change, human error etc.
Disadvantages:
Operate/maintain two different systems (batch & streaming) is hard.
Programming in two different paradigms makes the code-base complex.
54
What about Data Lakes?
A data lake accumulates data from different applications.
It does not transform data in any way.
Access from multiple users, no data silos, data is not hidden in special systems.
There is no schema following the data, only raw data. We apply a schema when we read the data
Includes structured, semi-structured, and unstructured data
55
Data Lakes Categories
Data reservoirs: Governed accumulation of data for later use. Data are secured and go under the process of ingestion, cleansing, profiling and indexing.
Exploratory lakes: Accumulation of data without governance for ad-hoc analysis by data scientists et al to gain insights.
Analytical lakes: Ingest your data to feed data pipelines for analytics.
56
Data Lakes vs Data Warehouse
Can be a replacement of a data warehouse in several scenarios when that makes sense.
57
Data Lake Data Warehouse
Schema Schema on-read Schema on-write
Users Data scientists, people who need ad hoc analysis
Business analysts
Data Structured, semi-structured, unstructured
Rigid structure
Flexibility High, reprocessing is easy.
Low tied to business processes.
Data Lakes usually fail!
Most project fail... you have been warned! Your next data lake can become a big data swamp.
58Image: http://www.sharenator.com/Demotivationals_pt_3_P/
Data Lakes extended with a Lambda Architecture
You can always use your Lambda Architecture on top of a data lake if that makes sense. A data lake can be your DFS with specific services build around it, like metadata management. It can make things easy especially when you start small and try to figure out what you need.
It can be very simple where you use the batch layer for loading the data from a source for streaming only. No presentation layer is needed.
How about Kafka?
59
Azure Data Lake
60
Image: https://azure.microsoft.com/en-us/solutions/data-lake/
How about Data Silos?
Separate containers of data.
The big data platform or the big data system at hand should unify business information, development teams and data in a business useful way.
Think about a scenario with microservices, event sourcing and analytics.
61
Use Cases
Yahoo
Netflix
Flickr
62
Flickr’s Use case - The Problem
Magic View Feature: computer vision pipeline to generates a set of computer vision tags and reverse indexes are created per user along with aggregated tag info.
Initially only batch then a streaming layer was added for live experience.
Backfills needed because of missed photos from the streaming layer (approximation errors) and code changes.
Backfills via streaming were slow due to the nature of RMW access pattern.
63
Flickr’s Use case - Solution
64
Result = Combiner(Query(data))
Implementing The Lambda Architecture
Smack stack based Lambda Architecture:
65
mesos
Spark
hdfs
Spark or Flink
Kafka Cassandra Query app
Akka driven apps user
Machine Learning Support for Lambda Architecture
Build a model and serve it. Simple models vs complex models.
Spark for model build and flink for model service.
Parameter servers:
https://issues.apache.org/jira/browse/SPARK-6932
https://github.com/rjagerman/glint
http://parameterserver.org/
http://www.petuum.com/bosen.html
https://github.com/JohnLangford/vowpal_wabbit/wiki
66
Real World Implementation Tips
Jvm based technologies like Cassandra, Kafka need correct GC settings.
Monitoring is a must. Cassandra, Kafka etc provide jmx interfaces to get the counter values you need. You need to know and understand which are useful to monitor closely.
It is not wise to co-locate everything, you need to be care full about components requirements. For example zookeeper should run on its own box but if co-located it should have it own high-speed volume assigned for its commit log.
Vendors offer specific requirements for production, stem from experience using the technology in production.
https://docs.datastax.com/en/landing_page/doc/landing_page/recommendedSettingsLinux.html
http://www.confluent.io/blog/design-and-deployment-considerations-for-deploying-apache-kafka-on-aws/
67
Real World Implementation Tips
OS settings.
Misuse technologies. Example: Kafka is not a database.
Design decisions. Example: Time series data on Cassandra.
Data locality and data move. Example: Kafka rebalance.
Logging. How I monitor my job? Log correlation?
For batch processing you need a flexible orchestration tool like: https://github.com/apache/incubator-airflowWithin your data-center vs across data-centers. On cloud: Availability zones
vs regions.Learn your technology.
68
Beyond the Lambda Architecture
Kappa Architecture (2014)
Zeta Architecture (2015)
IoT-A Architecture (2010- 2013)
Butterfly Architecture (~2015)
Fast Data architecture (~2016)
69
Kappa Architecture
Introduced by Jay Kreps, the co-creator of Apache Kafka and CEO of Confluent in 2014.
See https://www.oreilly.com/ideas/questioning-the-lambda-architecture
Lambda architecture is good but it is too much to try to keep in sync two layer and in practice it is hard to achieve
“The resulting operational complexity of systems implementing the Lambda Architecture is the one thing that seems to be universally agreed on by everyone doing it.”
Batch processing is a sub-set of streaming processing. Different technologies want to take advantage of this fact and provide a holistic solution:
Flink, http://data-artisans.com/batch-is-a-special-case-of-streaming/
Spark, https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html
70
Kappa Architecture
1. Use Kafka or some other system that will let you retain the full log of the data you want to be able to reprocess and that allows for multiple subscribers. For example, if you want to reprocess up to 30 days of data, set your retention in Kafka to 30 days.
2. When you want to do the reprocessing, start a second instance of your stream processing job that starts processing from the beginning of the retained data, but direct this output data to a new output table.
3. When the second job has caught up, switch the application to read from the new table.4. Stop the old version of the job, and delete the old output table.
Re-processing is done only when code changes. 71
Image: https://dmgpayxepw99m.cloudfront.net/kappa-61d0afc292912b61ce62517fa2bd4309.png
Kappa Architecture Pros & Cons
72
Pros: ● Develop and maintain only one streaming system.● Reprocessing only when code changes.
Cons: ● Need temp storage for the reprocessing streaming job.
Kappa Architecture - When to use?
73
● Algorithms of streaming and batch processing are the same.● Batch and real-time outputs can be the same.
Zeta Architecture
Introduced by MapR for supporting as-it-happens business (March 2015).
Goals:
Exploit all existing hardware in the data center.
Back-up and disaster recovery support for real-time continuity
Tolerance for human mistake
End-to-End Security
Support google scale systems
74
Zeta Architecture - ComponentsSeven pluggable components:
Distributed File System: All applications write here.
Real-time Data Storage: Needed for high-speed business applications.
Pluggable Compute Model / Execution Engine: Different needs need different engines.
Deployment / Container Management: Allows for a common way to deploy resources.
75
Zeta Architecture - ComponentsSeven pluggable components:
Solution Architecture: Focuses on solving a specific business problem.
Enterprise Applications: Used to drive the architecture. Now they are realized via existing components.
Dynamic and Global Resource Management: Allows dynamic allocation of resources which fits the business needs each time.
76
Zeta Architecture
Components and reference applications
77
Image: https://www.mapr.com/zeta-architecture
Zeta Architecture Example
78Images: https://www.mapr.com/zeta-architecture
IoT-A Architecture
Targets IoT applications proposed by Michael Hausenblas (MapR, Mesosphere) 2015.
IoT leads to a Big Data architecture because:
High volume of data from sensors
Time-Series format of data or other type of formats.
Data are generated at high-speed and business needs real-time processing.
79
IoT-A Architecture
Basic Architecture:
Message Queue / Streaming Block (MQ/SP)
DB: A real-time DB for indexing sensor data. Low Latency.
DFS: The distributed file system where batch jobs can be run and batch reports can be created.
80
IoT-A Architecture
81
http://iot-a.info/
IoT-A Architecture - Implementation Technologies
82
http://iot-a.info/
Butterfly Architecture
83
● Introduced by Milind Bhandarkar (Pivotal).● The weak point of the Lambda architecture lies in the distributed file system which cannot
serve all layers.● They propose the use of different memory technologies than DRAM (like storage class
memory) to implement an efficient object storage engine.● They use different abstractions compared to files or dirs of DFS: datasets, dataframes,
eventstreams.
mutable immutable
unmanaged managed
log publish
Data frames
Data sets
Storage
ETL
Butterfly Image: http://sketch2draw.com/wp-content/uploads/2013/05/butterfly_thumb.jpg
A Fast Data Architecture
84Image: Dean Wampler, "Fast Data Architectures for Streaming Applications", Lightbend and O'Reilly Media, September 2016
Example IoT Application
85Image: Dean Wampler, "Fast Data Architectures for Streaming Applications", Lightbend and O'Reilly Media, September 2016
Streaming Implementations Status
86
Apache Spark: Structured Streaming in v2 starts the improvement of the streaming engine. Still based on micro-batches but event-time support was added.
Apache Flink: SQL API supported from v0.9 and on. Still important features are on the roadmap: scaling streaming jobs, mesos support, dynamic allocation.
Picking the Right Tool for Streaming
87
Criteria to choose:Processing semantics (strong consistency is needed for correctness)Latency guarantees
Deployment / Operation
Ecosystem build around it
Complex event processing (CEP)
Batch & Streaming API support
Community & Support
Picking the Right Tool for Streaming
88
Some tipsPick Flink if you need sub-second latency and Beam supportPick Spark Streaming for its integration with spark ML libraries, micro-batch mode ideal for
training models, has mature deployment capabilities. Pick Gearpump for materializing Akka Streams in a distributed fashion.Pick Kafka streams for low level simple transformations of Kafka messages (It is a distributed
solution out of the box). (Check Confluent Platform for many useful tools around Kafka).
Questions?
Thank you!
89
References
Books:
Practical Hadoop Migration: How to Integrate Your RDBMS with the Hadoop Ecosystem and Re-Architect Relational Applications to NoSQL: Bhushan Lakhe: 9781484212882: Amazon.com: Books
Designing Software Architectures: A Practical Approach (SEI Series in Software Engineering): Humberto Cervantes, Rick Kazman: 9780134390789: Amazon.com: Books
Big Data: Principles and best practices of scalable realtime data systems: Nathan Marz, James Warren: 9781617290343: Amazon.com: Books
90
References - Cont’dWeb resources/Articles:
Questioning the Lambda Architecture - O'Reilly Media
Structured Streaming In Apache Spark | Databricks Blog
The world beyond batch: Streaming 101 - O'Reilly Media
The world beyond batch: Streaming 102 - O'Reilly Media
Data Centric Enterprise | MapR
Why local state is a fundamental primitive in stream processing - O'Reilly Media
Data processing architectures – Lambda and Kappa - Ericsson Research BlogEricsson Research Blog
2016 State of Fast Data Survey | OpsClarity
Zeta Architecture | MapR
Is Big Data Still a Thing? (The 2016 Big Data Landscape) – Matt Turck
IoT-a (MapR)
Powering Flickr’s Magic view by fusing bulk and real-time compute | code.flickr.com
Data Lake vs Data Warehouse: Key Differences
Don' t Let Your Data Lake Turn into a Swamp
Extending Data Lake using the Lambda Architecture June 2015
Azure Data Lake
Executive Summary: Data Growth, Business Opportunities, and the IT Imperatives | The Digital Universe of Opportunities: Rich Data and the Increasing Value of the Internet of Things
Dean Wampler, "Fast Data Architectures for Streaming Applications", Lightbend and O'Reilly Media, September 2016
How Apache Flink™ enables new streaming applications – data Artisans
91