big data and fast data - big and fast combined, is it possible?

2013 © Trivadis

BASEL BERN LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MÜNCHEN STUTTGART WIEN

WELCOME Big Data and Fast Data big and fast combined – is it possible?

Guido Schmutz und Albert Blarer

24. April 2013

24. April 2013 Big Data und Fast Data

1

2013 © Trivadis

Guido Schmutz

•  Working for Trivadis for more than 16 years

•  Oracle ACE Director for Fusion Middleware and SOA •  Co-Author of different books •  Consultant, Trainer Software Architect for Java, Oracle, SOA

and EDA •  Member of Trivadis Architecture Board •  Technology Manager @ Trivadis

•  More than 25 years of software development experience

•  Contact: [email protected] •  Blog: http://guidoschmutz.wordpress.com •  Twitter: gschmutz

14.06.2012

2 Where and When should I use the Oracle Service Bus (OSB)

2013 © Trivadis


2013 © Trivadis

Mit über 600 IT- und Fachexperten bei Ihnen vor Ort.

4

11 Trivadis Niederlassungen mitüber 600 Mitarbeitenden

200 Service Level Agreements

Mehr als 4'000 Trainingsteilnehmer

Forschungs- und Entwicklungs-budget: CHF 5.0 / EUR 4 Mio.

Finanziell unabhängig und nachhaltig profitabel

Erfahrung aus mehr als 1'900 Projekten pro Jahr bei über 800 Kunden

Stand 12/2012

Hamburg

Düsseldorf

Frankfurt

Freiburg München

Wien

Basel

Zürich Bern Lausanne

4

Stuttgart

Datum Trivadis – das Unternehmen

2013 © Trivadis

Credits

Nathan Marz

Author of „Big Data – Principles and best practics of scalable realtime data systems“ – Manning Press

Used to be working at Backtype and Twitter

Creator of

•  Storm

•  Cascalog

•  ElephantDB


5

2013 © Trivadis

Agenda

1.  Big Data, what is it?

2.  Motivation

3.  The Lambda Architecture

4.  Implementing the Lambda Architecture

5.  Summary


6

2013 © Trivadis

Big Data Definition (Gartner et al)

14.02.2013 Big Data 4 Sales

7

Velocity

Tera-, Peta-, Exa-, Zetta-, Yota- bytes and constantly growing

“Traditional” computing in RDBMS is not scalable enough.

We search for “linear scalability”

“Only … structured information is not enough” – “95% of produced data in

unstructured”

Characteristics of Big Data: Its Volume, Velocity and Variety in combination

+ Veracity (IBM) - information uncertainty + Time to action ? – Big Data + Event Processing = Fast Data

2013 © Trivadis

Big Data Emerging Technologies


8

§  MapReduce (e.g. Apache Hadoop)

§  Event Stream Processing & CEP (e.g. Storm or Esper)

§  New messaging systems (e.g. Apache Kafka)

§  Integration tools (e.g. Spring or Camus)

§  New database paradigms (e.g. NoSQL or NewSQL)

§  Data mining tools (e.g. Apache Mahout )

§  Data extraction and detection tools (e.g. Apache Tika )

2013 © Trivadis

14.02.2013 Big Data 4 Sales

9

2013 © Trivadis

Volume Development

0

20

40

60

80

100

0

2000

4000

6000

8000

2005 2007 2009 2011 2013 2015

Agg

rega

te U

ncer

tain

ty %

Glo

bal D

ata

Volu

me

in E

xaby

tes

Year

Sensors: “internet of things”

Social Media: video, audio, text

VoIP: Skype, MSN, ICQ, ...

Enterprise Data: data dictionary, ERD, ...


10

2013 © Trivadis

Velocity


11

§  Velocity requirement examples: §  Recommendation Engine §  Predictive Analytics §  Marketing Campaign Analysis §  Customer Retention and Churn Analysis §  Social Graph Analysis §  Capital Markets Analysis §  Risk Management §  Rogue Trading §  Fraud Detection §  Retail Banking §  Network Monitoring §  Research and Development

2013 © Trivadis

Agenda


2.  Motivation



5.  Summary


12

2013 © Trivadis

What is a data system?

•  A system that manages the storage and querying of data with a lifetime measured in years encompassing every version of the application to ever exist, every hardware failure and every human mistake ever made.

•  A data system answers questions based on information that was acquired in the past

•  Not all bits of information are equal •  Some information is derived from other


13

2013 © Trivadis

Desired Properties of a (Big) Data System

Robust and fault-tolerant

Low latency reads and updates

Scalable

General

Extensible

Allows ad hoc queries

Minimal maintenance

Debuggable


14

2013 © Trivadis

Typical problem in today’s architecture/systems

Bugs will be deployed to production over the lifetime of a data system

Operational mistakes will be made

Humans are part of the overall system •  Just like hard disks, CPUs, memory, software •  design for human error like you design for any other fault

Examples of human error •  Deploy a bug that increments counters by two instead of by one •  Accidentally delete data from database •  Accidental DOS on important internal service

Worst two consequences: data loss or data corruption

As long as an error doesn‘t lose or corrupt good data, you can fix what went wrong


15

Lack of Human Fault Tolerance

2013 © Trivadis

Mutability

The U and D in CRUD

A mutable system updates the current state of the world

Mutable systems inherently lack human fault-tolerance

Easy to corrupt or lose data


16

Capturing change traditionally


Name City Guido Berne Albert Zurich

Name City Guido Basel Albert Zurich

2013 © Trivadis

Immutability

An immutable system captures historical records of events

Each event happens at a particular time and is always true


17

Capturing change by storing events


Name City Timestamp Guido Berne 1.8.1999 Albert Zurich 10.5.1988

Name City Timestamp Guido Berne 1.8.1999 Albert Zurich 10.5.1988 Guido Basel 1.4.2013

2013 © Trivadis

Immutability

Immutability greatly restricts the range of errors that can cause data loss or data corruption

Vastly more human fault-tolerant

Much easier to reason about systems based on immutability

Conclusion: Your source of truth should always be immutable


18


2013 © Trivadis

What about traditional/today’s architectures ?

Source of Truth is mutable!

Rather than build systems like this ….


19

Mutable Database

Application (Query)

RDBMS NoSQL

NewSQL

Mobile Web RIA

Rich Client

Source of Truth

Source of Truth

2013 © Trivadis

A different kind of architecture with immutable source of truth

… why not building them like this


20

HDFS NoSQL

NewSQL RDBMS

View on Data

Mobile Web RIA

Rich Client

Source of Truth

Immutable data

View on Data

Application (Query)

Source of Truth

2013 © Trivadis

How to create the views on the Immutable data?

On the fly ?

Materialized, i.e. Pre-computed ?


21

Immutable data View

Immutable data

Pre- Computed

Views

Query

Query

2013 © Trivadis

Data = the most raw information

Data is information which is not derived from anywhere else •  The most raw form of information •  Data is the special information from which everything else is derived

Questions on data can be answered by running functions that take data as input

The most general purpose data system can answer questions by running functions that take the entire dataset as input

query = function (all data)

The lambda architecture provides a general purpose approach for implementing arbitrary functions on an arbitrary datasets


22

2013 © Trivadis

Data = the most raw information


23

1.2.13 Add iPAD 64GB 10.3.13 Add Sony RX-100 11..3.13 Add Canon GX-10 11.3.13 Remove Sony RX-100 12.3.13 Add Nikon S-100 14.4.13 Add BoseQC-15 15.4.13 Add MacBook Pro 15 20.4.13 Remove Canon GX10

iPAD 64GB Nikon S-100 BoseQC-15 MacBook Pro 15

4 derive derive

Favorite Product List Changes Current Favorite

Product List Current Product Count

Raw information => data Information => derived

2013 © Trivadis

Big Data and Batch Processing


24

Immutable data

Batch View Query ? ? Incoming

Data

How to compute the batch views ?

How to compute queries from the views ?

2013 © Trivadis

Big Data and Batch Processing


25

Fully processed data Last full batch period

Time forbatch job

time

now non-processed data

time

now

batch-processed data

§  Using only batch processing, leaves you always with a portion of non-processed data.

Adapted from Ted Dunning (March 2012): http://www.youtube.com/watch?v=7PcmbI5aC20

But we are not done yet …

2013 © Trivadis

Adding Real-Time Processing


26

Immutable data

Batch Views

Query

? Data Stream

Realtime Views

Incoming Data

How to compute queries from the views ? How to compute real-time views

2013 © Trivadis

Adding Real-Time Processing


27

1.2.13 Add iPAD 64GB 10.3.13 Add Sony RX-100 11..3.13 Add Canon GX-10 11.3.13 Remove Sony RX-100 12.3.13 Add Nikon S-100 14.4.13 Add BoseQC-15 15.4.13 Add MacBook Pro 15 20.4.13 Remove Canon GX10 Now Add Canon Scanner

iPAD 64GB Nikon S-100 BoseQC-15 MacBook Pro 15

5

compute

Favorite Product List Changes Current Favorite

Product List

Current Product Count

Now Canon Scanner compute Add Canon Scanner

Stream of Favorite Product List Changes

Immutable data

Views

Data Stream

Query

2013 © Trivadis

Big Data and Real Time Processing


28

time

Fully processed data Last full batch period

now

Time forbatch job

batch processingworked fine here

(e.g. Hadoop)

real time processingworks here

blended view for end user

Adapted from Ted Dunning (March 2012): http://www.youtube.com/watch?v=7PcmbI5aC20

2013 © Trivadis

Agenda


2.  Motivation



5.  Summary


29

2013 © Trivadis

Lambda Architecture


30

Immutable data

Batch View

Query

Data Stream

Realtime View

Incoming Data

Serving Layer

Speed Layer

Batch Layer

A

B C D

E F

G

2013 © Trivadis

Lambda Architecture

A.  All data is sent to both the batch and speed layer

B.  Master data set is an immutable, append-only set of data

C.  Batch layer pre-computes query functions from scratch, result is called Batch Views. Batch layer constantly re-computes the batch views.

D.  Batch views are indexed and stored in a scalable database to get particular values very quickly. Swaps in new batch views when they are available

E.  Speed layer compensates for the high latency of updates to the Batch Views in the Serving layer.

F.  Uses fast incremental algorithms and read/write databases to produce real-time views

G.  Queries are resolved by getting results from both batch and real-time views


31

2013 © Trivadis

Layered Architecture

Stores the immutable constantly growing dataset Computes arbitrary views from this dataset using BigData technologies (can take hours) Can be always recreated Responsible for indexing and exposing the pre-computed batch views so that they can be queried Exposes the incremented real-time views Merges the batch and the real-time views into a consistent result Computes the views from the constant stream of data it receives Needed to compensate for the high latency of the batch layer Incremental model and views are transient


32

Serving Layer

Batch Layer

Speed Layer

2013 © Trivadis

Agenda


2.  Motivation



5.  Summary


33

2013 © Trivadis

Lambda Architecture


34

Speed Layer

Precompute Views

query

Source: Marz, N. & Warren, J. (2013) Big Data. Manning.

Batch Layer

Precomputed information All data

Incremented information Process stream

Incoming Data

Batch recompute

Realtime increment

Serving Layer

batch view

batch view

real time view

real time view

Mer

ge

2013 © Trivadis

Lambda Architecture


35

one possible product/framework mapping

Speed Layer

Precompute Views

query

Batch Layer



Incoming Data

Batch recompute

Realtime increment

Serving Layer

batch view

batch view

real time view

real time view

Mer

ge

2013 © Trivadis

Implementing Batch Layer

Immutable Data

•  Append only

•  Normalized

•  Stores master copy of all data

Pre-computed information

•  Function that takes all data as input

query = function(all-data)

•  High Latency, Batch processing

•  Unrestrained computation

•  Horizontal scalable


36

Immutable data

Batch Views compute

Precompute Views

Batch Layer


Batch recompute

Batch Layer Serving Layer

2013 © Trivadis

Apache Hadoop HDFS

HDFS = the Hadoop Distributed File System

A distributed file storage system

Redundant storage

Designed to reliably store data using commodity hardware

Designed to expect hardware failures

Intended for large files

Designed for batch inserts


37

Batch Layer

2013 © Trivadis

Apache Hadoop Map Reduce


38

§  Hadoop Map Reduce is an open source implementation of the MapReduce framework.

§  Map Reduce is §  a programming model, introduced by Google, for processing large data sets,

in a distributed environment §  De-facto standard to compute huge amounts of data §  An execution framework for organizing and performing such computations

MAP

master node

REDUCE

worker node 1

worker node 2

worker node 3

problem data

solution data

Batch Layer

2013 © Trivadis

Hadoop MapReduce Flow


39

Source: Bill Graham, Twitter Inc.

Batch Layer

2013 © Trivadis

Hadoop MapReduce


40

Batch Layer

2013 © Trivadis

Cascading

Application framework for Java developers to simply develop robust Data Analytics and Data Management applications on Apache Hadoop

adds an abstraction layer over the Hadoop API

core concepts of the cascading API: •  Pipe: a series of processing steps (parsing, looping, filtering, etc) defining the

data processing to be done •  Flow: association of a pipe (or set of pipes) with a data-source and data-sink


41

Batch Layer

2013 © Trivadis

Apache Pig

Apache Pig is a platform for analyzing large data sets

Key Properties

•  Ease of programming

•  Optimization opportunities

•  Extensibility


43

Batch Layer

2013 © Trivadis

Implementing Serving Layerfor Batch Views

Need a database that •  Is batch-writable •  Adding new information is atomic •  Has fast random reads •  Is scalable •  Is highly available •  Can be optimized for Storage

•  Information can be de-normalized

•  But no Random writes required!

•  Can be a simple database


44

Serving Layer

batch view

batch view

Batch Layer

Precomputed information

Immutable data

Batch Views compute

Batch Layer Serving Layer

2013 © Trivadis

SploutSQL

Full SQL => unlike NoSQL

For BigData => unlike RDBMS

Web latency & throughput => unlike Apache Hive, Apache Drill

Why does it scale •  Data is partitioned •  Partitions are distributed

across nodes •  Adding more nodes

increase capacity •  Generation does not

impact serving


45

Serving Layer

Source: Datasalt.

2013 © Trivadis

Implementing Speed Layer

Stream Processing

Continuous computation

Transactional

Storing a limited window of data •  Compensating for the last few

hours of data

All the complexity is isolated in the speed layer

•  If anything goes wrong, it‘s autocorrected by the next batch run


47

Speed Layer


Realtime increment

Data Stream

RealtimeViews derive

Speed Layer Serving Layer

2013 © Trivadis

Apache Kafka

A high throughput distributed messaging system

Originated at LinkedIn

Sequential disk access


48

2013 © Trivadis

Twitter Storm – the “real-time Hadoop”


49

§  Strom is a distributed and fault-tolerant real-time computing platform §  data flow model, data flows through network of transformation entities

§  Key concepts §  Tuple: ordered list of elements §  Streams: unbounded sequence of tuples §  Spouts: Source of streams §  Bolts: Process tuples and create new streams §  Topologies: directed graph of Spouts and Bolts

§  Use Cases §  Stream Processing §  Continuous Computation §  Distributed RPC

SPOUT

BOLT

„MAP“ „REDUCE“

„PERSIST“

problem data

data source

solution data

Speed Layer

Serving Layer

BOLT

BOLT

2013 © Trivadis

Twitter Trident

Higher level abstraction over Storm

Trident State

Grouped Stream

Functions, Filters

Aggregators

Query

Similar to Pig and Cascading


51

Speed Layer

Serving Layer

2013 © Trivadis

Implementing Serving Layerfor Real-Time Views

Incremental updates are made available as real-time views

Requires a database that support random read and random writes •  Relational, NoSQL or NewSQL (in memory) databases can be used •  Here we are typically not in the BigData range

Results are only needed until the data made it through the batch layer

Complexity isolation


53

Data Stream

RealtimeViews derive



real time view

real time view

Incremented information

2013 © Trivadis

Cassandra

Fully distributed, no single-point-of-failure

Linearly scalable

Fault tolerant

Performant

Durable

Integrated caching

Tunable consistency


54

Serving Layer

2013 © Trivadis

Implementing Serving LayerMerge of Batch and Realtime Views

An interesting feature of Storm / Trident is the ability to execute distributed RPC (DRPC) calls in parallel

This can be used to implement the merge functionality when a query is executed


55

Serving Layer

batch view

batch view

real time view

real time view

RealtimeViews

Serving Layer

Batch Views

Mer

ge

query

2013 © Trivadis

Summary – The lambda architecture


58

§  The Lambda Architecture §  Can discard batch views and real-time views and recreate everything from

scratch §  Mistakes corrected via re-computation §  Data storage layer optimized independently from query resolution layer §  Still in a very early …. But a very interesting idea!

-  Today a zoo of technologies are needed => Operations won‘t like it §  Different query language for batch and real time §  An abstraction over batch and speed layer needed

-  Cascading and Trident are already similar §  Industry standards needed!

2013 © Trivadis


THANK YOU. Trivadis AG

Guido Schmutz & Albert Blarer

Europa-Strasse 5CH-8095 Glattbrugg

[email protected] www.trivadis.com


59

big data and fast data - big and fast combined, is it possible?

Technology

big data systemrobust

enterprise data

data loss

data corruptionas

querying of data

data dictionary

big data und fast data5

big data und fast data6