big data and fast data - big and fast combined, is it possible?
DESCRIPTION
Big Data (volume) and real-time information processing (velocity) are two important aspects of Big Data systems. At first sight, these two aspects seem to be incompatible. Are traditional software architectures still the right choice? Do we need new, revolutionary architectures to tackle the requirements of Big Data. This presentation discusses the idea of the so-called lambda architecture for Big Data, which acts on the assumption of a bisection of the data-processing: in a batch-phase a temporally bounded, large dataset is processed either through traditional ETL or MapReduce. In parallel, a real-time, online processing is constantly calculating the values of the new data coming in during the batch phase. The combination of the two results, batch and online processing is giving the constantly up-to-date view. This talk presents how such an architecture can be implemented using Oracle products such as Oracle NoSQL, Hadoop and Oracle Event Processing.TRANSCRIPT
2013 © Trivadis
BASEL BERN LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MÜNCHEN STUTTGART WIEN
WELCOME Big Data and Fast Data big and fast combined – is it possible?
Guido Schmutz und Albert Blarer
24. April 2013
24. April 2013 Big Data und Fast Data
1
2013 © Trivadis
Guido Schmutz
• Working for Trivadis for more than 16 years
• Oracle ACE Director for Fusion Middleware and SOA • Co-Author of different books • Consultant, Trainer Software Architect for Java, Oracle, SOA
and EDA • Member of Trivadis Architecture Board • Technology Manager @ Trivadis
• More than 25 years of software development experience
• Contact: [email protected] • Blog: http://guidoschmutz.wordpress.com • Twitter: gschmutz
14.06.2012
2 Where and When should I use the Oracle Service Bus (OSB)
2013 © Trivadis
BASEL BERN LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MÜNCHEN STUTTGART WIEN
2013 © Trivadis
Mit über 600 IT- und Fachexperten bei Ihnen vor Ort.
4
11 Trivadis Niederlassungen mitüber 600 Mitarbeitenden
200 Service Level Agreements
Mehr als 4'000 Trainingsteilnehmer
Forschungs- und Entwicklungs-budget: CHF 5.0 / EUR 4 Mio.
Finanziell unabhängig und nachhaltig profitabel
Erfahrung aus mehr als 1'900 Projekten pro Jahr bei über 800 Kunden
Stand 12/2012
Hamburg
Düsseldorf
Frankfurt
Freiburg München
Wien
Basel
Zürich Bern Lausanne
4
Stuttgart
Datum Trivadis – das Unternehmen
2013 © Trivadis
Credits
Nathan Marz
Author of „Big Data – Principles and best practics of scalable realtime data systems“ – Manning Press
Used to be working at Backtype and Twitter
Creator of
• Storm
• Cascalog
• ElephantDB
24. April 2013 Big Data und Fast Data
5
2013 © Trivadis
Agenda
1. Big Data, what is it?
2. Motivation
3. The Lambda Architecture
4. Implementing the Lambda Architecture
5. Summary
24. April 2013 Big Data und Fast Data
6
2013 © Trivadis
Big Data Definition (Gartner et al)
14.02.2013 Big Data 4 Sales
7
Velocity
Tera-, Peta-, Exa-, Zetta-, Yota- bytes and constantly growing
“Traditional” computing in RDBMS is not scalable enough.
We search for “linear scalability”
“Only … structured information is not enough” – “95% of produced data in
unstructured”
Characteristics of Big Data: Its Volume, Velocity and Variety in combination
+ Veracity (IBM) - information uncertainty + Time to action ? – Big Data + Event Processing = Fast Data
2013 © Trivadis
Big Data Emerging Technologies
24. April 2013 Big Data und Fast Data
8
§ MapReduce (e.g. Apache Hadoop)
§ Event Stream Processing & CEP (e.g. Storm or Esper)
§ New messaging systems (e.g. Apache Kafka)
§ Integration tools (e.g. Spring or Camus)
§ New database paradigms (e.g. NoSQL or NewSQL)
§ Data mining tools (e.g. Apache Mahout )
§ Data extraction and detection tools (e.g. Apache Tika )
2013 © Trivadis
14.02.2013 Big Data 4 Sales
9
2013 © Trivadis
Volume Development
0
20
40
60
80
100
0
2000
4000
6000
8000
2005 2007 2009 2011 2013 2015
Agg
rega
te U
ncer
tain
ty %
Glo
bal D
ata
Volu
me
in E
xaby
tes
Year
Sensors: “internet of things”
Social Media: video, audio, text
VoIP: Skype, MSN, ICQ, ...
Enterprise Data: data dictionary, ERD, ...
24. April 2013 Big Data und Fast Data
10
2013 © Trivadis
Velocity
24. April 2013 Big Data und Fast Data
11
§ Velocity requirement examples: § Recommendation Engine § Predictive Analytics § Marketing Campaign Analysis § Customer Retention and Churn Analysis § Social Graph Analysis § Capital Markets Analysis § Risk Management § Rogue Trading § Fraud Detection § Retail Banking § Network Monitoring § Research and Development
2013 © Trivadis
Agenda
1. Big Data, what is it?
2. Motivation
3. The Lambda Architecture
4. Implementing the Lambda Architecture
5. Summary
24. April 2013 Big Data und Fast Data
12
2013 © Trivadis
What is a data system?
• A system that manages the storage and querying of data with a lifetime measured in years encompassing every version of the application to ever exist, every hardware failure and every human mistake ever made.
• A data system answers questions based on information that was acquired in the past
• Not all bits of information are equal • Some information is derived from other
24. April 2013 Big Data und Fast Data
13
2013 © Trivadis
Desired Properties of a (Big) Data System
Robust and fault-tolerant
Low latency reads and updates
Scalable
General
Extensible
Allows ad hoc queries
Minimal maintenance
Debuggable
24. April 2013 Big Data und Fast Data
14
2013 © Trivadis
Typical problem in today’s architecture/systems
Bugs will be deployed to production over the lifetime of a data system
Operational mistakes will be made
Humans are part of the overall system • Just like hard disks, CPUs, memory, software • design for human error like you design for any other fault
Examples of human error • Deploy a bug that increments counters by two instead of by one • Accidentally delete data from database • Accidental DOS on important internal service
Worst two consequences: data loss or data corruption
As long as an error doesn‘t lose or corrupt good data, you can fix what went wrong
24. April 2013 Big Data und Fast Data
15
Lack of Human Fault Tolerance
2013 © Trivadis
Mutability
The U and D in CRUD
A mutable system updates the current state of the world
Mutable systems inherently lack human fault-tolerance
Easy to corrupt or lose data
24. April 2013 Big Data und Fast Data
16
Capturing change traditionally
Lack of Human Fault Tolerance
Name City Guido Berne Albert Zurich
Name City Guido Basel Albert Zurich
2013 © Trivadis
Immutability
An immutable system captures historical records of events
Each event happens at a particular time and is always true
24. April 2013 Big Data und Fast Data
17
Capturing change by storing events
Lack of Human Fault Tolerance
Name City Timestamp Guido Berne 1.8.1999 Albert Zurich 10.5.1988
Name City Timestamp Guido Berne 1.8.1999 Albert Zurich 10.5.1988 Guido Basel 1.4.2013
2013 © Trivadis
Immutability
Immutability greatly restricts the range of errors that can cause data loss or data corruption
Vastly more human fault-tolerant
Much easier to reason about systems based on immutability
Conclusion: Your source of truth should always be immutable
24. April 2013 Big Data und Fast Data
18
Lack of Human Fault Tolerance
2013 © Trivadis
What about traditional/today’s architectures ?
Source of Truth is mutable!
Rather than build systems like this ….
24. April 2013 Big Data und Fast Data
19
Mutable Database
Application (Query)
RDBMS NoSQL
NewSQL
Mobile Web RIA
Rich Client
Source of Truth
Source of Truth
2013 © Trivadis
A different kind of architecture with immutable source of truth
… why not building them like this
24. April 2013 Big Data und Fast Data
20
HDFS NoSQL
NewSQL RDBMS
View on Data
Mobile Web RIA
Rich Client
Source of Truth
Immutable data
View on Data
Application (Query)
Source of Truth
2013 © Trivadis
How to create the views on the Immutable data?
On the fly ?
Materialized, i.e. Pre-computed ?
24. April 2013 Big Data und Fast Data
21
Immutable data View
Immutable data
Pre- Computed
Views
Query
Query
2013 © Trivadis
Data = the most raw information
Data is information which is not derived from anywhere else • The most raw form of information • Data is the special information from which everything else is derived
Questions on data can be answered by running functions that take data as input
The most general purpose data system can answer questions by running functions that take the entire dataset as input
query = function (all data)
The lambda architecture provides a general purpose approach for implementing arbitrary functions on an arbitrary datasets
24. April 2013 Big Data und Fast Data
22
2013 © Trivadis
Data = the most raw information
24. April 2013 Big Data und Fast Data
23
1.2.13 Add iPAD 64GB 10.3.13 Add Sony RX-100 11..3.13 Add Canon GX-10 11.3.13 Remove Sony RX-100 12.3.13 Add Nikon S-100 14.4.13 Add BoseQC-15 15.4.13 Add MacBook Pro 15 20.4.13 Remove Canon GX10
iPAD 64GB Nikon S-100 BoseQC-15 MacBook Pro 15
4 derive derive
Favorite Product List Changes Current Favorite
Product List Current Product Count
Raw information => data Information => derived
2013 © Trivadis
Big Data and Batch Processing
24. April 2013 Big Data und Fast Data
24
Immutable data
Batch View Query ? ? Incoming
Data
How to compute the batch views ?
How to compute queries from the views ?
2013 © Trivadis
Big Data and Batch Processing
24. April 2013 Big Data und Fast Data
25
Fully processed data Last full batch period
Time forbatch job
time
now non-processed data
time
now
batch-processed data
§ Using only batch processing, leaves you always with a portion of non-processed data.
Adapted from Ted Dunning (March 2012): http://www.youtube.com/watch?v=7PcmbI5aC20
But we are not done yet …
2013 © Trivadis
Adding Real-Time Processing
24. April 2013 Big Data und Fast Data
26
Immutable data
Batch Views
Query
? Data Stream
Realtime Views
Incoming Data
How to compute queries from the views ? How to compute real-time views
2013 © Trivadis
Adding Real-Time Processing
24. April 2013 Big Data und Fast Data
27
1.2.13 Add iPAD 64GB 10.3.13 Add Sony RX-100 11..3.13 Add Canon GX-10 11.3.13 Remove Sony RX-100 12.3.13 Add Nikon S-100 14.4.13 Add BoseQC-15 15.4.13 Add MacBook Pro 15 20.4.13 Remove Canon GX10 Now Add Canon Scanner
iPAD 64GB Nikon S-100 BoseQC-15 MacBook Pro 15
5
compute
Favorite Product List Changes Current Favorite
Product List
Current Product Count
Now Canon Scanner compute Add Canon Scanner
Stream of Favorite Product List Changes
Immutable data
Views
Data Stream
Query
2013 © Trivadis
Big Data and Real Time Processing
24. April 2013 Big Data und Fast Data
28
time
Fully processed data Last full batch period
now
Time forbatch job
batch processingworked fine here
(e.g. Hadoop)
real time processingworks here
blended view for end user
Adapted from Ted Dunning (March 2012): http://www.youtube.com/watch?v=7PcmbI5aC20
2013 © Trivadis
Agenda
1. Big Data, what is it?
2. Motivation
3. The Lambda Architecture
4. Implementing the Lambda Architecture
5. Summary
24. April 2013 Big Data und Fast Data
29
2013 © Trivadis
Lambda Architecture
24. April 2013 Big Data und Fast Data
30
Immutable data
Batch View
Query
Data Stream
Realtime View
Incoming Data
Serving Layer
Speed Layer
Batch Layer
A
B C D
E F
G
2013 © Trivadis
Lambda Architecture
A. All data is sent to both the batch and speed layer
B. Master data set is an immutable, append-only set of data
C. Batch layer pre-computes query functions from scratch, result is called Batch Views. Batch layer constantly re-computes the batch views.
D. Batch views are indexed and stored in a scalable database to get particular values very quickly. Swaps in new batch views when they are available
E. Speed layer compensates for the high latency of updates to the Batch Views in the Serving layer.
F. Uses fast incremental algorithms and read/write databases to produce real-time views
G. Queries are resolved by getting results from both batch and real-time views
24. April 2013 Big Data und Fast Data
31
2013 © Trivadis
Layered Architecture
Stores the immutable constantly growing dataset Computes arbitrary views from this dataset using BigData technologies (can take hours) Can be always recreated Responsible for indexing and exposing the pre-computed batch views so that they can be queried Exposes the incremented real-time views Merges the batch and the real-time views into a consistent result Computes the views from the constant stream of data it receives Needed to compensate for the high latency of the batch layer Incremental model and views are transient
24. April 2013 Big Data und Fast Data
32
Serving Layer
Batch Layer
Speed Layer
2013 © Trivadis
Agenda
1. Big Data, what is it?
2. Motivation
3. The Lambda Architecture
4. Implementing the Lambda Architecture
5. Summary
24. April 2013 Big Data und Fast Data
33
2013 © Trivadis
Lambda Architecture
24. April 2013 Big Data und Fast Data
34
Speed Layer
Precompute Views
query
Source: Marz, N. & Warren, J. (2013) Big Data. Manning.
Batch Layer
Precomputed information All data
Incremented information Process stream
Incoming Data
Batch recompute
Realtime increment
Serving Layer
batch view
batch view
real time view
real time view
Mer
ge
2013 © Trivadis
Lambda Architecture
24. April 2013 Big Data und Fast Data
35
one possible product/framework mapping
Speed Layer
Precompute Views
query
Batch Layer
Precomputed information All data
Incremented information Process stream
Incoming Data
Batch recompute
Realtime increment
Serving Layer
batch view
batch view
real time view
real time view
Mer
ge
2013 © Trivadis
Implementing Batch Layer
Immutable Data
• Append only
• Normalized
• Stores master copy of all data
Pre-computed information
• Function that takes all data as input
query = function(all-data)
• High Latency, Batch processing
• Unrestrained computation
• Horizontal scalable
24. April 2013 Big Data und Fast Data
36
Immutable data
Batch Views compute
Precompute Views
Batch Layer
Precomputed information All data
Batch recompute
Batch Layer Serving Layer
2013 © Trivadis
Apache Hadoop HDFS
HDFS = the Hadoop Distributed File System
A distributed file storage system
Redundant storage
Designed to reliably store data using commodity hardware
Designed to expect hardware failures
Intended for large files
Designed for batch inserts
24. April 2013 Big Data und Fast Data
37
Batch Layer
2013 © Trivadis
Apache Hadoop Map Reduce
24. April 2013 Big Data und Fast Data
38
§ Hadoop Map Reduce is an open source implementation of the MapReduce framework.
§ Map Reduce is § a programming model, introduced by Google, for processing large data sets,
in a distributed environment § De-facto standard to compute huge amounts of data § An execution framework for organizing and performing such computations
MAP
master node
REDUCE
worker node 1
worker node 2
worker node 3
problem data
solution data
Batch Layer
2013 © Trivadis
Hadoop MapReduce Flow
24. April 2013 Big Data und Fast Data
39
Source: Bill Graham, Twitter Inc.
Batch Layer
2013 © Trivadis
Hadoop MapReduce
24. April 2013 Big Data und Fast Data
40
Batch Layer
2013 © Trivadis
Cascading
Application framework for Java developers to simply develop robust Data Analytics and Data Management applications on Apache Hadoop
adds an abstraction layer over the Hadoop API
core concepts of the cascading API: • Pipe: a series of processing steps (parsing, looping, filtering, etc) defining the
data processing to be done • Flow: association of a pipe (or set of pipes) with a data-source and data-sink
24. April 2013 Big Data und Fast Data
41
Batch Layer
2013 © Trivadis
Casading
24. April 2013 Big Data und Fast Data
42
2013 © Trivadis
Apache Pig
Apache Pig is a platform for analyzing large data sets
Key Properties
• Ease of programming
• Optimization opportunities
• Extensibility
24. April 2013 Big Data und Fast Data
43
Batch Layer
2013 © Trivadis
Implementing Serving Layerfor Batch Views
Need a database that • Is batch-writable • Adding new information is atomic • Has fast random reads • Is scalable • Is highly available • Can be optimized for Storage
• Information can be de-normalized
• But no Random writes required!
• Can be a simple database
24. April 2013 Big Data und Fast Data
44
Serving Layer
batch view
batch view
Batch Layer
Precomputed information
Immutable data
Batch Views compute
Batch Layer Serving Layer
2013 © Trivadis
SploutSQL
Full SQL => unlike NoSQL
For BigData => unlike RDBMS
Web latency & throughput => unlike Apache Hive, Apache Drill
Why does it scale • Data is partitioned • Partitions are distributed
across nodes • Adding more nodes
increase capacity • Generation does not
impact serving
24. April 2013 Big Data und Fast Data
45
Serving Layer
Source: Datasalt.
2013 © Trivadis
SploutSQL
24. April 2013 Big Data und Fast Data
46
Serving Layer
2013 © Trivadis
Implementing Speed Layer
Stream Processing
Continuous computation
Transactional
Storing a limited window of data • Compensating for the last few
hours of data
All the complexity is isolated in the speed layer
• If anything goes wrong, it‘s autocorrected by the next batch run
24. April 2013 Big Data und Fast Data
47
Speed Layer
Incremented information Process stream
Realtime increment
Data Stream
RealtimeViews derive
Speed Layer Serving Layer
2013 © Trivadis
Apache Kafka
A high throughput distributed messaging system
Originated at LinkedIn
Sequential disk access
24. April 2013 Big Data und Fast Data
48
2013 © Trivadis
Twitter Storm – the “real-time Hadoop”
24. April 2013 Big Data und Fast Data
49
§ Strom is a distributed and fault-tolerant real-time computing platform § data flow model, data flows through network of transformation entities
§ Key concepts § Tuple: ordered list of elements § Streams: unbounded sequence of tuples § Spouts: Source of streams § Bolts: Process tuples and create new streams § Topologies: directed graph of Spouts and Bolts
§ Use Cases § Stream Processing § Continuous Computation § Distributed RPC
SPOUT
BOLT
„MAP“ „REDUCE“
„PERSIST“
problem data
data source
solution data
Speed Layer
Serving Layer
BOLT
BOLT
2013 © Trivadis
Twitter Storm
24. April 2013 Big Data und Fast Data
50
Speed Layer
Serving Layer
2013 © Trivadis
Twitter Trident
Higher level abstraction over Storm
Trident State
Grouped Stream
Functions, Filters
Aggregators
Query
Similar to Pig and Cascading
24. April 2013 Big Data und Fast Data
51
Speed Layer
Serving Layer
2013 © Trivadis
Twitter Trident
24. April 2013 Big Data und Fast Data
52
Speed Layer
Serving Layer
2013 © Trivadis
Implementing Serving Layerfor Real-Time Views
Incremental updates are made available as real-time views
Requires a database that support random read and random writes • Relational, NoSQL or NewSQL (in memory) databases can be used • Here we are typically not in the BigData range
Results are only needed until the data made it through the batch layer
Complexity isolation
24. April 2013 Big Data und Fast Data
53
Data Stream
RealtimeViews derive
Speed Layer Serving Layer
Speed Layer Serving Layer
real time view
real time view
Incremented information
2013 © Trivadis
Cassandra
Fully distributed, no single-point-of-failure
Linearly scalable
Fault tolerant
Performant
Durable
Integrated caching
Tunable consistency
24. April 2013 Big Data und Fast Data
54
Serving Layer
2013 © Trivadis
Implementing Serving LayerMerge of Batch and Realtime Views
An interesting feature of Storm / Trident is the ability to execute distributed RPC (DRPC) calls in parallel
This can be used to implement the merge functionality when a query is executed
24. April 2013 Big Data und Fast Data
55
Serving Layer
batch view
batch view
real time view
real time view
RealtimeViews
Serving Layer
Batch Views
Mer
ge
query
2013 © Trivadis
Storm / Trident DRPC
24. April 2013 Big Data und Fast Data
56
Serving Layer
2013 © Trivadis
Agenda
1. Big Data, what is it?
2. Motivation
3. The Lambda Architecture
4. Implementing the Lambda Architecture
5. Summary
24. April 2013 Big Data und Fast Data
57
2013 © Trivadis
Summary – The lambda architecture
24. April 2013 Big Data und Fast Data
58
§ The Lambda Architecture § Can discard batch views and real-time views and recreate everything from
scratch § Mistakes corrected via re-computation § Data storage layer optimized independently from query resolution layer § Still in a very early …. But a very interesting idea!
- Today a zoo of technologies are needed => Operations won‘t like it § Different query language for batch and real time § An abstraction over batch and speed layer needed
- Cascading and Trident are already similar § Industry standards needed!
2013 © Trivadis
BASEL BERN LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MÜNCHEN STUTTGART WIEN
THANK YOU. Trivadis AG
Guido Schmutz & Albert Blarer
Europa-Strasse 5CH-8095 Glattbrugg
[email protected] www.trivadis.com
24. April 2013 Big Data und Fast Data
59