trivento summercamp masterclass 9/9/2016

91
Lambda Architecture And Beyond Stavros Kontopoulos Senior Software Engineer @ Lightbend, M.Sc. Trivento Summercamp 2016 Amersfoort De oude Prodentfabriek

Upload: stavros-kontopoulos

Post on 15-Feb-2017

149 views

Category:

Software


0 download

TRANSCRIPT

Page 1: Trivento summercamp masterclass 9/9/2016

Lambda Architecture And Beyond

Stavros KontopoulosSenior Software Engineer @ Lightbend, M.Sc.

Trivento Summercamp 2016 Amersfoort

De oude Prodentfabriek

Page 2: Trivento summercamp masterclass 9/9/2016

Introduction

2

Introduction: Who Am I?

Page 3: Trivento summercamp masterclass 9/9/2016

Agenda

A bit of history of Big Data Processing

Batch Systems vs Streaming Systems

What is Lambda Architecture?

Advantages, Disadvantages?

Use cases

Data Lakes, Data Silos etc...

Implementing Lambda Architecture, ML support, Implementation Tips

Beyond the Lambda Architecture (Kappa, FastData, Zeta etc)3

Page 4: Trivento summercamp masterclass 9/9/2016

Last warning...

4

Page 5: Trivento summercamp masterclass 9/9/2016

Data Processing

Batch processing: processing done on a bounded dataset.

Stream Processing (Streaming): processing done on an unbounded datasets. Data items are pushed or pulled.

Two categories of systems: batch vs streaming systems.

5

Page 6: Trivento summercamp masterclass 9/9/2016

Big Data - The story

Internet scale apps moved data size from Gigabytes to Petabytes.

Once upon a time there were traditional RDBMS like Oracle and Data Warehouses but volume, velocity and variety changed the game.

6

Page 7: Trivento summercamp masterclass 9/9/2016

Big Data - The story

MapReduce was a major breakthrough (Google published the seminal paper in 2004).

Nutch project already had an implementation in 2005

2006 becomes a subproject of Lucene with the name Hadoop.

2008 Yahoo brings Hadoop to production with a 10K cluster. Same year it becomes a top-level apache project.

Hadoop is good for batch processing.

Page 8: Trivento summercamp masterclass 9/9/2016

Big Data - The story

Word Count example - Inverted Index.

8

Split 1

Split N

doc1, doc2 ...

...

doc300, doc100

MAP REDUCE

(w1,1)…(w20,1)

(w41,1)…(w1,1)

Shuffle

(w1, (1,1,1…))...

(w41, (1,1,…))...

(w1, 13)...

(w1, 3)...

Page 9: Trivento summercamp masterclass 9/9/2016

Big Data - The story

Giuseppe DeCandia et al., ”Dynamo: amazon's highly available key-value store.” changed the DataBase world in 2007.

NoSQL Databases along with general system like Hadoop solve problems cannot be solved with traditional RDBMs.

Technology facts: Cheap memory, SSDs, HDDs are the new tape, more cpus over more powerful cpus.

9

Page 10: Trivento summercamp masterclass 9/9/2016

Big Data - The story

There is a major shift in the industry as batch processing is not enough any more.

Batch jobs usually take hours if not days to complete, in many applications that is not acceptable.

10

Page 11: Trivento summercamp masterclass 9/9/2016

Big Data - The story

The trend now is near-real time computation which implies streaming algorithms and needs new semantics. Fast Data (data in motion) & Big Data (data at rest) at the same time.

The enterprise needs to get smarter, all major players across industries use ML on top of massive datasets to make better decisions.

11Images: https://www.tesla.com/sites/default/files/pictures/thumbs/model_s/red_models.jpg?201501121530 https://i.ytimg.com/vi/cj83dL72cvg/maxresdefault.jpg

Page 12: Trivento summercamp masterclass 9/9/2016

Big Data - The story

OpsClarity report:92% plan to increase their investment in stream processing applications in the

next year79% plan to reduce or eliminate investment in batch processing32% use real time analysis to power core customer-facing applications44% agreed that it is tedious to correlate issues across the pipeline68% identified lack of experience and underlying complexity of new data

frameworks as their barrier to adoption

http://info.opsclarity.com/2016-fast-data-streaming-applications-report.html

12

Page 13: Trivento summercamp masterclass 9/9/2016

Big Data - The story

13 Image: http://info.opsclarity.com/2016-fast-data-streaming-applications-report.html

Page 14: Trivento summercamp masterclass 9/9/2016

Big Data - The story

14

In OpsClarity report:

● Apache Kafka is the most popular broker technology (ingestion queue)

● HDFS the most used data sink

● Apache Spark is the most popular data processing tool.

Page 15: Trivento summercamp masterclass 9/9/2016

Big Data Landscape

15 Image: http://mattturck.com/wp-content/uploads/2016/03/Big-Data-Landscape-2016-v18-FINAL.png

Page 16: Trivento summercamp masterclass 9/9/2016

Big Data System

A Big Data System must have at least the following components at its core:

DFS - Distributed File System like (S3, HDFS) or a distributed database system (DDS).

Distributed Data processing tool like: Spark, Hadoop etc

Tools and services to manage the previous systems.

16

Page 17: Trivento summercamp masterclass 9/9/2016

Big Data System - Layered View

A Big Data System has at least an infrastructure layer and application layer.

17

Page 18: Trivento summercamp masterclass 9/9/2016

Big Data System Design Considerations / Problems

Data Locality

Data Versioning

Code change

Resource allocation

Deployment/Operation

Integration

Backup/Failover Strategy

Scaling Strategy

Security

Monitoring/Logging

Orchestration

Output Validation in data pipelines

18

Page 19: Trivento summercamp masterclass 9/9/2016

Big Data System Quality

A Big Data System should be:

fault-tolerant

easy to debug

generic enough

scalable

extensible

able to support ad-hoc queries

high throughput

able to support low latency reads/writes

simple to operate

secure

19

Page 20: Trivento summercamp masterclass 9/9/2016

Big Data and Immutable Data

Immutable data provide the following benefits:

Fault-tolerance to human error (you can always replay history and fix things)

Simplicity no index is needed for retrieve and update, just append newly arrived data.

20

Page 21: Trivento summercamp masterclass 9/9/2016

Big Data System - Delivery/Processing Semantics

21

In distributed systems failure is part of the game. What semantics I can achieve for message delivery?

at-most-once delivery: for each message sent, that message is delivered zero or one times.

at-least-once delivery: for each message sent potentially multiple attempts are made at delivering it, such that at least one succeeds; messages may be duplicated but not lost.

exactly-once delivery: for each message sent exactly one delivery is made to the recipient; the message can neither be lost nor duplicated.

In theory it is impossible to have exactly once delivery.

In practice we might care more for exactly-once state changes and at-least once delivery. Example: Keeping state at some operator of the streaming graph.

Page 22: Trivento summercamp masterclass 9/9/2016

Batch Systems - The Hadoop Ecosystem

22

Yarn (Yet Another Resource Negotiator) deployed in production at Yahoo in March 2013.

Same year Cloudera, the dominant Hadoop vendor, embraced Spark as the next-generation replacement for MapReduce.

Image: Lightbend Inc.

Page 23: Trivento summercamp masterclass 9/9/2016

Batch Systems - The Hadoop Ecosystem

Hadoop clusters, the gold standard for big data from ~2008 to the present.

Strengths:

Lowest CapEx system for Big Data.

Excellent for ingesting and integrating diverse datasets.

Flexible: from classic analytics (aggregations and data warehousing) to machine learning.

23

Page 24: Trivento summercamp masterclass 9/9/2016

Batch Systems - The Hadoop Ecosystem

Weaknesses:

Complex administration.

YARN can’t manage all distributed services.

MapReduce, has poor performance, a difficult programming model, and doesn’t support stream

processing.

24

Page 25: Trivento summercamp masterclass 9/9/2016

Analyzing Infinite Data Streams

25

What does it mean to run a SQL query on an unbounded data set.

How should I deal with the late data which I see.

What kind of time measurement should I use? Event-time, Processing time or Ingestion time?

Accuracy of computations on bounded datasets vs on unbounded datasets

Algorithms for streaming computations?

Page 26: Trivento summercamp masterclass 9/9/2016

Analyzing Infinite Data Streams

26

Two cases for processing:

Single event processing: event transformation, trigger an alarm on an error event

Event aggregations: summary statistics, group-by, join and similar queries. For example compute the average temperature for the last 5 minutes from a sensor data stream.

Page 27: Trivento summercamp masterclass 9/9/2016

Analyzing Infinite Data Streams

27

Event aggregation introduces the concept of windowing wrt to the notion of time selected:

Event time (the time that events happen): Important for most use cases where context and correctness matter at the same time. Example: billing applications, anomaly detection.

Processing time (the time they are observed during processing): Use cases where I only care about what I process in a window. Example: accumulated clicks on a page per second.

System Arrival or Ingestion time (the time that events arrived at the streaming system).

Ideally event time = Processing time. Reality is: there is skew.

Page 28: Trivento summercamp masterclass 9/9/2016

Analyzing Infinite Data Streams

28

Windows come in different flavors:

Tumbling windows discretize a stream into non-overlapping windows.

Sliding Windows: slide over the stream of data.

Page 29: Trivento summercamp masterclass 9/9/2016

Analyzing Infinite Data Streams

29

Watermarks: indicates that no elements with a timestamp older or equal to the watermark timestamp should arrive for the specific window of data.

Triggers: decide when the window is evaluated or purged.

Page 30: Trivento summercamp masterclass 9/9/2016

Analyzing Infinite Data Streams

30

Given the advances in streaming we can:

Trade-off latency with cost and accuracy

In certain use-cases replace batch processing with streaming

Page 31: Trivento summercamp masterclass 9/9/2016

Analyzing Infinite Data Streams

31

Recent advances in Streaming are a result of the pioneer work:

MillWheel: Fault-Tolerant Stream Processing at Internet Scale, VLDB 2013.

The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing, Proceedings of the VLDB Endowment, vol. 8 (2015), pp. 1792-1803

Page 32: Trivento summercamp masterclass 9/9/2016

Analyzing Infinite Data Streams

32

Apache Beam is the open source successor of Google’s DataFlow

It is becoming the standard api streaming. Provides the advanced semantics needed for the current needs in streaming applications.

Page 33: Trivento summercamp masterclass 9/9/2016

Streaming Systems Architecture

33

User provides a graph of computations through a high level API where data flows on the edges of this graph. Each vertex its an operator which executes a user operation-computation. For example: stream.map().keyBy()...

Operators can run in multiple instances and preserve state (unlike batch processing where we have immutable datasets).

State can be persisted and restored in the presence of failures.

Page 34: Trivento summercamp masterclass 9/9/2016

Analyzing Infinite Data Streams - Flink Example

34

sealed trait SensorType { def stype: String }case object TemperatureSensor extends SensorType { val stype = "TEMP" }case object HumiditySensor extends SensorType { val stype = "HUM" }

case class SensorData(var sensorId: String, var value: Double, var sensorType: SensorType, timestamp: Long)

https://github.com/skonto/trivento-summercamp-2016

Page 35: Trivento summercamp masterclass 9/9/2016

Analyzing Infinite Data Streams - Flink Example

35

class SensorDataSource(val sensorType: SensorType, val numberOfSensors: Int, val watermarkTag: Int, val numberOfElements: Int = -1) extends SourceFunction[SensorData] { final val serialVersionUID = 1L @volatile var isRunning = true var counter = 1 var timestamp = 0 val randomGen = Random

require(numberOfSensors > 0) require(numberOfElements >= -1)

lazy val initialReading: Double = { sensorType match { case TemperatureSensor => 27.0 case HumiditySensor => 0.75 } }

override def run(ctx: SourceContext[SensorData]): Unit = {

val counterCondition = { if(numberOfElements == -1) { x: Int => isRunning } else { x: Int => isRunning && counter <= x } }

while (counterCondition(numberOfElements)) { Thread.sleep(10) // send sensor data every 10 milliseconds

val dataId = randomGen.nextInt(numberOfSensors) + 1 val data = SensorData(dataId.toString, initialReading + Random.nextGaussian()/initialReading, sensorType, timestamp) ctx.collectWithTimestamp(data, timestamp) // time starts at 0 in millisecs timestamp = timestamp + 1

if (timestamp % watermarkTag == 0) { // watermark should be mod 0 ctx.emitWatermark(new Watermark(timestamp)) // watermark in milliseconds } counter = counter + 1 } }

override def cancel(): Unit = { // No cleanup needed isRunning = false }}

The Source

https://github.com/skonto/trivento-summercamp-2016

Page 36: Trivento summercamp masterclass 9/9/2016

Analyzing Infinite Data Streams - Flink Example

36

object SensorSimple { def main(args: Array[String]): Unit = { val env = StreamExecutionEnvironment.getExecutionEnvironment // set default env parallelism for all operators env.setParallelism(2) env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime) val numberOfSensors = 2 val watermarkTag = 10 val numberOfElements = 1000

val sensorDataStream = env.addSource(new SensorDataSource(TemperatureSensor, numberOfSensors, watermarkTag, numberOfElements))

sensorDataStream.writeAsText("inputData.txt")

val windowedKeyed = sensorDataStream .keyBy(data => data.sensorId) .timeWindow(Time.milliseconds(10))

windowedKeyed.max("value") .writeAsText("outputMaxValue.txt")

windowedKeyed.apply(new SensorAverage()) .writeAsText("outputAverage.txt") env.execute("Sensor Data Simple Statistics") }}

class SensorAverage extends WindowFunction[SensorData, SensorData, String, TimeWindow] { def apply(key: String, window: TimeWindow, input: Iterable[SensorData], out: Collector[SensorData]): Unit = { if (input.nonEmpty) { val average = input.map(_.value).sum / input.size out.collect(input.head.copy(value = average)) } }}

The Job

https://github.com/skonto/trivento-summercamp-2016

Page 37: Trivento summercamp masterclass 9/9/2016

Analyzing Infinite Data Streams - Flink Example

37

Operator 1 Operator 2

Watermark 1 (10) 0 3 6 2

7 5849

Operators run the operations defined by the graph of the streaming computation. Example Operators (KeyBy, Map, FlatMap etc)

Two instances of the same operator with parallelism 2 (previous example).

Watermark N (10*N) ..

....

....

..

....

....

....

1

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22...

time

file1 file2

window 2window 1

Page 38: Trivento summercamp masterclass 9/9/2016

Streaming vs Batch Systems

38

Metric Batch Streaming

Data size per job TB to PB MB to TB (in flight)

Time between data arrival and processing

Many minutes to hours Microseconds to minutes

Job execution times Minutes to hours Microseconds to minutes

Page 39: Trivento summercamp masterclass 9/9/2016

World of Patterns

Pattern (in general) … is a perceptible regularity or a template (Wikipedia).

Software Patterns: well-defined, reusable solution to a commonly occurring problem in software design eg. Template Method, Singleton etc.

Software Architecture Patterns: An architectural pattern is a general, reusable solution to a commonly occurring problem in software architecture within a given context (Wikipedia) eg. client-server n-tier.

39

Page 40: Trivento summercamp masterclass 9/9/2016

World of Patterns

Software Architecture vs Software Design.

We use them everywhere but… they are not a silver bullet. Why?

40

Page 41: Trivento summercamp masterclass 9/9/2016

Software Architecture before Lambda Architecture

Many definitions for software architecture.

“Architecture: system fundamental concepts or properties of a system in its environment embodied in its elements, ⟨ ⟩relationships, and in the principles of its design and evolution”. (ISO/IEC/IEEE 42010).

“Software architecture refers to the fundamental structures of a software system, the discipline of creating such

structures, and the documentation of these structures. These structures are needed to reason about the software

system.” Wikipedia

“It is about structure and vision”. Software architecture for developers, Simon Brown.

“The highest-level breakdown of a system into its parts; the decisions that are hard to change; there are multiple

architectures in a system; what is architecturally significant can change over a system's lifetime; and, in the end,

architecture boils down to whatever the important stuff is.” Patterns of Enterprise Application Architecture, Martin Fowler

41

Page 42: Trivento summercamp masterclass 9/9/2016

Software Architecture is important

Architectural decisions are decisions that have non-local consequences and they serve specific goals eg. in order to achieve a performance goal like high throughput I decided to use buffering within my system.

Architectural decisions are important for your in-house project or your proposal if you are a consultant.

42

Page 43: Trivento summercamp masterclass 9/9/2016

Sound Architecture Principles: Why I Need it?

Scalability/Elasticity

Extensibility: requirements will change expect that

Minimized costs

Security awareness

Well designed APIs for integration

Well-tested, don’t go to production and cross fingers.

43

Page 44: Trivento summercamp masterclass 9/9/2016

Follow common sense...

At the end of the day expect to throw everything out of the window under some circumstances. Business matters the most.

Example: Non-functional requirements changed since load is huge and you are becoming successful, maybe you are the next Facebook.

44

Page 45: Trivento summercamp masterclass 9/9/2016

Software Architecture is important

...because there is high cost of not making specific decisions or making them not early enough.

45

Page 46: Trivento summercamp masterclass 9/9/2016

Software Architecture is important

How about the wrong decisions?

Image: http://www.awesomeinventions.com/wp-content/uploads/2014/10/balcony.jpg

46

Page 47: Trivento summercamp masterclass 9/9/2016

Software Architecture is important

Many more benefits where architecture is present:

A documented architecture assists communication

Guides implementation imposing constraints

Assists in technology decisions

Assists in cost and time estimation

Influences the structure of your organization and vice versa

I

47

Page 48: Trivento summercamp masterclass 9/9/2016

Software Architecture LifeCycle

Steps:

Architectural Requirements

Architectural Design

Architectural Documentation

Architectural Evaluation / Implementation

48

Page 49: Trivento summercamp masterclass 9/9/2016

Lambda Architecture - Intro

“Computing arbitrary functions on an arbitrary dataset in real time is a daunting problem. There is no single tool that provides a complete solution. Instead, you have to use a variety of tools and techniques to build a complete Big Data system. The lambda architecture solves the problem of computing arbitrary functions on arbitrary data in real time by decomposing the problem into three layers: the batch layer, the serving layer, and the speed layer.”

49

Nathan Marz and James Warren, Big Data: Principles and best practices of scalable real-time data systems, Manning Publications.

Photo: https://images-na.ssl-images-amazon.com/images/I/51Bd93AGuOL._SX258_BO1,204,203,200_.jpg

Page 50: Trivento summercamp masterclass 9/9/2016

Lambda Architecture - Cont’d (1/5)

50

Image: http://lambda-architecture.net/img/la-overview_small.png

Batch Layer: perfect accuracy, indexed batch viewsServing Layer: random access query support based on batch & real-time views Speed Layer: process real-time streams, provides real-time views, lower accuracy

Master dataset: append-only, immutable set of raw data

Page 51: Trivento summercamp masterclass 9/9/2016

Lambda Architecture - Cont’d (2/5)

Example components for each part:

Batch layer: Hadoop

Batch Output Indexing: Druid, Impala etc

Speed Output Indexing: Druid, Cassandra, HBase etc

Speed processing: Spark, Flink etc

51

Page 52: Trivento summercamp masterclass 9/9/2016

Lambda Architecture - Cont’d (3/5)

Basic functions:

batch view = function (all data) <- high latency, high throughput

realtime view = function (realtime view, new data) <- low latency, low throughput

query = function (batch view, realtime view ) <- eventual accuracy

52

Page 53: Trivento summercamp masterclass 9/9/2016

Lambda Architecture - Cont’d (4/5)

Key Properties:

Eventual Accuracy

Batch is always behind in time, continuously produces batch outputs. Whenever a new batch output is available updates the latest one. Finally batch layer will catch up with the speed layer.

Complexity Isolation

53

Page 54: Trivento summercamp masterclass 9/9/2016

Lambda Architecture - Cont’d (5/5)

Advantages:

Immutable data.

Reprocessing takes care code change, human error etc.

Disadvantages:

Operate/maintain two different systems (batch & streaming) is hard.

Programming in two different paradigms makes the code-base complex.

54

Page 55: Trivento summercamp masterclass 9/9/2016

What about Data Lakes?

A data lake accumulates data from different applications.

It does not transform data in any way.

Access from multiple users, no data silos, data is not hidden in special systems.

There is no schema following the data, only raw data. We apply a schema when we read the data

Includes structured, semi-structured, and unstructured data

55

Page 56: Trivento summercamp masterclass 9/9/2016

Data Lakes Categories

Data reservoirs: Governed accumulation of data for later use. Data are secured and go under the process of ingestion, cleansing, profiling and indexing.

Exploratory lakes: Accumulation of data without governance for ad-hoc analysis by data scientists et al to gain insights.

Analytical lakes: Ingest your data to feed data pipelines for analytics.

56

Page 57: Trivento summercamp masterclass 9/9/2016

Data Lakes vs Data Warehouse

Can be a replacement of a data warehouse in several scenarios when that makes sense.

57

Data Lake Data Warehouse

Schema Schema on-read Schema on-write

Users Data scientists, people who need ad hoc analysis

Business analysts

Data Structured, semi-structured, unstructured

Rigid structure

Flexibility High, reprocessing is easy.

Low tied to business processes.

Page 58: Trivento summercamp masterclass 9/9/2016

Data Lakes usually fail!

Most project fail... you have been warned! Your next data lake can become a big data swamp.

58Image: http://www.sharenator.com/Demotivationals_pt_3_P/

Page 59: Trivento summercamp masterclass 9/9/2016

Data Lakes extended with a Lambda Architecture

You can always use your Lambda Architecture on top of a data lake if that makes sense. A data lake can be your DFS with specific services build around it, like metadata management. It can make things easy especially when you start small and try to figure out what you need.

It can be very simple where you use the batch layer for loading the data from a source for streaming only. No presentation layer is needed.

How about Kafka?

59

Page 60: Trivento summercamp masterclass 9/9/2016

Azure Data Lake

60

Image: https://azure.microsoft.com/en-us/solutions/data-lake/

Page 61: Trivento summercamp masterclass 9/9/2016

How about Data Silos?

Separate containers of data.

The big data platform or the big data system at hand should unify business information, development teams and data in a business useful way.

Think about a scenario with microservices, event sourcing and analytics.

61

Page 62: Trivento summercamp masterclass 9/9/2016

Use Cases

Yahoo

Netflix

Flickr

62

Page 63: Trivento summercamp masterclass 9/9/2016

Flickr’s Use case - The Problem

Magic View Feature: computer vision pipeline to generates a set of computer vision tags and reverse indexes are created per user along with aggregated tag info.

Initially only batch then a streaming layer was added for live experience.

Backfills needed because of missed photos from the streaming layer (approximation errors) and code changes.

Backfills via streaming were slow due to the nature of RMW access pattern.

63

Page 64: Trivento summercamp masterclass 9/9/2016

Flickr’s Use case - Solution

64

Result = Combiner(Query(data))

Page 65: Trivento summercamp masterclass 9/9/2016

Implementing The Lambda Architecture

Smack stack based Lambda Architecture:

65

mesos

Spark

hdfs

Spark or Flink

Kafka Cassandra Query app

Akka driven apps user

Page 66: Trivento summercamp masterclass 9/9/2016

Machine Learning Support for Lambda Architecture

Build a model and serve it. Simple models vs complex models.

Spark for model build and flink for model service.

Parameter servers:

https://issues.apache.org/jira/browse/SPARK-6932

https://github.com/rjagerman/glint

http://parameterserver.org/

http://www.petuum.com/bosen.html

https://github.com/JohnLangford/vowpal_wabbit/wiki

66

Page 67: Trivento summercamp masterclass 9/9/2016

Real World Implementation Tips

Jvm based technologies like Cassandra, Kafka need correct GC settings.

Monitoring is a must. Cassandra, Kafka etc provide jmx interfaces to get the counter values you need. You need to know and understand which are useful to monitor closely.

It is not wise to co-locate everything, you need to be care full about components requirements. For example zookeeper should run on its own box but if co-located it should have it own high-speed volume assigned for its commit log.

Vendors offer specific requirements for production, stem from experience using the technology in production.

https://docs.datastax.com/en/landing_page/doc/landing_page/recommendedSettingsLinux.html

http://www.confluent.io/blog/design-and-deployment-considerations-for-deploying-apache-kafka-on-aws/

67

Page 68: Trivento summercamp masterclass 9/9/2016

Real World Implementation Tips

OS settings.

Misuse technologies. Example: Kafka is not a database.

Design decisions. Example: Time series data on Cassandra.

Data locality and data move. Example: Kafka rebalance.

Logging. How I monitor my job? Log correlation?

For batch processing you need a flexible orchestration tool like: https://github.com/apache/incubator-airflowWithin your data-center vs across data-centers. On cloud: Availability zones

vs regions.Learn your technology.

68

Page 69: Trivento summercamp masterclass 9/9/2016

Beyond the Lambda Architecture

Kappa Architecture (2014)

Zeta Architecture (2015)

IoT-A Architecture (2010- 2013)

Butterfly Architecture (~2015)

Fast Data architecture (~2016)

69

Page 70: Trivento summercamp masterclass 9/9/2016

Kappa Architecture

Introduced by Jay Kreps, the co-creator of Apache Kafka and CEO of Confluent in 2014.

See https://www.oreilly.com/ideas/questioning-the-lambda-architecture

Lambda architecture is good but it is too much to try to keep in sync two layer and in practice it is hard to achieve

“The resulting operational complexity of systems implementing the Lambda Architecture is the one thing that seems to be universally agreed on by everyone doing it.”

Batch processing is a sub-set of streaming processing. Different technologies want to take advantage of this fact and provide a holistic solution:

Flink, http://data-artisans.com/batch-is-a-special-case-of-streaming/

Spark, https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html

70

Page 71: Trivento summercamp masterclass 9/9/2016

Kappa Architecture

1. Use Kafka or some other system that will let you retain the full log of the data you want to be able to reprocess and that allows for multiple subscribers. For example, if you want to reprocess up to 30 days of data, set your retention in Kafka to 30 days.

2. When you want to do the reprocessing, start a second instance of your stream processing job that starts processing from the beginning of the retained data, but direct this output data to a new output table.

3. When the second job has caught up, switch the application to read from the new table.4. Stop the old version of the job, and delete the old output table.

Re-processing is done only when code changes. 71

Image: https://dmgpayxepw99m.cloudfront.net/kappa-61d0afc292912b61ce62517fa2bd4309.png

Page 72: Trivento summercamp masterclass 9/9/2016

Kappa Architecture Pros & Cons

72

Pros: ● Develop and maintain only one streaming system.● Reprocessing only when code changes.

Cons: ● Need temp storage for the reprocessing streaming job.

Page 73: Trivento summercamp masterclass 9/9/2016

Kappa Architecture - When to use?

73

● Algorithms of streaming and batch processing are the same.● Batch and real-time outputs can be the same.

Page 74: Trivento summercamp masterclass 9/9/2016

Zeta Architecture

Introduced by MapR for supporting as-it-happens business (March 2015).

Goals:

Exploit all existing hardware in the data center.

Back-up and disaster recovery support for real-time continuity

Tolerance for human mistake

End-to-End Security

Support google scale systems

74

Page 75: Trivento summercamp masterclass 9/9/2016

Zeta Architecture - ComponentsSeven pluggable components:

Distributed File System: All applications write here.

Real-time Data Storage: Needed for high-speed business applications.

Pluggable Compute Model / Execution Engine: Different needs need different engines.

Deployment / Container Management: Allows for a common way to deploy resources.

75

Page 76: Trivento summercamp masterclass 9/9/2016

Zeta Architecture - ComponentsSeven pluggable components:

Solution Architecture: Focuses on solving a specific business problem.

Enterprise Applications: Used to drive the architecture. Now they are realized via existing components.

Dynamic and Global Resource Management: Allows dynamic allocation of resources which fits the business needs each time.

76

Page 77: Trivento summercamp masterclass 9/9/2016

Zeta Architecture

Components and reference applications

77

Image: https://www.mapr.com/zeta-architecture

Page 78: Trivento summercamp masterclass 9/9/2016

Zeta Architecture Example

78Images: https://www.mapr.com/zeta-architecture

Page 79: Trivento summercamp masterclass 9/9/2016

IoT-A Architecture

Targets IoT applications proposed by Michael Hausenblas (MapR, Mesosphere) 2015.

IoT leads to a Big Data architecture because:

High volume of data from sensors

Time-Series format of data or other type of formats.

Data are generated at high-speed and business needs real-time processing.

79

Page 80: Trivento summercamp masterclass 9/9/2016

IoT-A Architecture

Basic Architecture:

Message Queue / Streaming Block (MQ/SP)

DB: A real-time DB for indexing sensor data. Low Latency.

DFS: The distributed file system where batch jobs can be run and batch reports can be created.

80

Page 81: Trivento summercamp masterclass 9/9/2016

IoT-A Architecture

81

http://iot-a.info/

Page 82: Trivento summercamp masterclass 9/9/2016

IoT-A Architecture - Implementation Technologies

82

http://iot-a.info/

Page 83: Trivento summercamp masterclass 9/9/2016

Butterfly Architecture

83

● Introduced by Milind Bhandarkar (Pivotal).● The weak point of the Lambda architecture lies in the distributed file system which cannot

serve all layers.● They propose the use of different memory technologies than DRAM (like storage class

memory) to implement an efficient object storage engine.● They use different abstractions compared to files or dirs of DFS: datasets, dataframes,

eventstreams.

mutable immutable

unmanaged managed

log publish

Data frames

Data sets

Storage

ETL

Butterfly Image: http://sketch2draw.com/wp-content/uploads/2013/05/butterfly_thumb.jpg

Page 84: Trivento summercamp masterclass 9/9/2016

A Fast Data Architecture

84Image: Dean Wampler, "Fast Data Architectures for Streaming Applications", Lightbend and O'Reilly Media, September 2016

Page 85: Trivento summercamp masterclass 9/9/2016

Example IoT Application

85Image: Dean Wampler, "Fast Data Architectures for Streaming Applications", Lightbend and O'Reilly Media, September 2016

Page 86: Trivento summercamp masterclass 9/9/2016

Streaming Implementations Status

86

Apache Spark: Structured Streaming in v2 starts the improvement of the streaming engine. Still based on micro-batches but event-time support was added.

Apache Flink: SQL API supported from v0.9 and on. Still important features are on the roadmap: scaling streaming jobs, mesos support, dynamic allocation.

Page 87: Trivento summercamp masterclass 9/9/2016

Picking the Right Tool for Streaming

87

Criteria to choose:Processing semantics (strong consistency is needed for correctness)Latency guarantees

Deployment / Operation

Ecosystem build around it

Complex event processing (CEP)

Batch & Streaming API support

Community & Support

Page 88: Trivento summercamp masterclass 9/9/2016

Picking the Right Tool for Streaming

88

Some tipsPick Flink if you need sub-second latency and Beam supportPick Spark Streaming for its integration with spark ML libraries, micro-batch mode ideal for

training models, has mature deployment capabilities. Pick Gearpump for materializing Akka Streams in a distributed fashion.Pick Kafka streams for low level simple transformations of Kafka messages (It is a distributed

solution out of the box). (Check Confluent Platform for many useful tools around Kafka).

Page 89: Trivento summercamp masterclass 9/9/2016

Questions?

Thank you!

89

Page 91: Trivento summercamp masterclass 9/9/2016

References - Cont’dWeb resources/Articles:

Questioning the Lambda Architecture - O'Reilly Media

Structured Streaming In Apache Spark | Databricks Blog

The world beyond batch: Streaming 101 - O'Reilly Media

The world beyond batch: Streaming 102 - O'Reilly Media

Data Centric Enterprise | MapR

Why local state is a fundamental primitive in stream processing - O'Reilly Media

Data processing architectures – Lambda and Kappa - Ericsson Research BlogEricsson Research Blog

2016 State of Fast Data Survey | OpsClarity

Zeta Architecture | MapR

Is Big Data Still a Thing? (The 2016 Big Data Landscape) – Matt Turck

IoT-a (MapR)

Powering Flickr’s Magic view by fusing bulk and real-time compute | code.flickr.com

Data Lake vs Data Warehouse: Key Differences

Don' t Let Your Data Lake Turn into a Swamp

Extending Data Lake using the Lambda Architecture June 2015

Azure Data Lake

Executive Summary: Data Growth, Business Opportunities, and the IT Imperatives | The Digital Universe of Opportunities: Rich Data and the Increasing Value of the Internet of Things

Dean Wampler, "Fast Data Architectures for Streaming Applications", Lightbend and O'Reilly Media, September 2016

How Apache Flink™ enables new streaming applications – data Artisans

91