implementing the lambda architecture with spring xd
DESCRIPTION
Recorded at SpringOne2GX 2014. Speaker: Carlos Queiroz Big Data Track The lambda architecture has been proposed as a general purpose data system that aims to solve the problem of computing arbitrary functions on an arbitrary dataset in (near) real-time. In this talk we introduce the lambda architecture and show how it can be implemented with SpringXD, GemFireXD and Hadoop (HDFS and MapReduce) as the foundation of the architecture implementation. To validate the architecture we introduce a CDR (Call Detail Record) mining application as use case of the lambda architecture. We finalise by showing a demo of the CDR (Call Detail Record) mining application.TRANSCRIPT
![Page 1: Implementing the Lambda Architecture with Spring XD](https://reader034.vdocuments.site/reader034/viewer/2022052508/559444311a28ab0c308b474f/html5/thumbnails/1.jpg)
© 2014 SpringOne 2GX. All rights reserved. Do not distribute without permission.
Implementing the Lambda Architecture with Spring XD
By Carlos Queiroz
![Page 2: Implementing the Lambda Architecture with Spring XD](https://reader034.vdocuments.site/reader034/viewer/2022052508/559444311a28ab0c308b474f/html5/thumbnails/2.jpg)
Me
Advanced Field Engineer
2
Carlos Queiroz
![Page 3: Implementing the Lambda Architecture with Spring XD](https://reader034.vdocuments.site/reader034/viewer/2022052508/559444311a28ab0c308b474f/html5/thumbnails/3.jpg)
Agenda
!
• Introduction to the Lambda Architecture • Applying Lambda architecture to a business case • Implementation details
3
![Page 4: Implementing the Lambda Architecture with Spring XD](https://reader034.vdocuments.site/reader034/viewer/2022052508/559444311a28ab0c308b474f/html5/thumbnails/4.jpg)
What is Lambda Architecture?
Implementation of a set of desired properties on general purpose big data systems. !
Generic architecture addressing common requirements in big data applications. !
Set of design patterns for dealing with historical and operational data.
4
![Page 5: Implementing the Lambda Architecture with Spring XD](https://reader034.vdocuments.site/reader034/viewer/2022052508/559444311a28ab0c308b474f/html5/thumbnails/5.jpg)
Approach is new, ideas not so…
5
Traditional / Relational
Data Sources
The Big Data Ecosystem
Stre
amin
g D
ata
Trad
ition
al
War
ehou
se
Analytics on Data at Rest
Data Warehouse
Analytics on Structured Data
Analytics onData in Motion
MapReduce like modelsNon-Traditional /Non-RelationalData Sources
Non-Traditional / Non-Relational
Data Feeds
Traditional / Relational
Data SourcesIn
tern
et-
Scal
e D
ata
Sets
Stream computing
Hadoop
![Page 6: Implementing the Lambda Architecture with Spring XD](https://reader034.vdocuments.site/reader034/viewer/2022052508/559444311a28ab0c308b474f/html5/thumbnails/6.jpg)
Why Lambda Architecture?
To build a data system that can answer questions by running functions that take at the entire dataset as input. A general purpose data system
6
![Page 7: Implementing the Lambda Architecture with Spring XD](https://reader034.vdocuments.site/reader034/viewer/2022052508/559444311a28ab0c308b474f/html5/thumbnails/7.jpg)
Desired Properties of such systems
Fault-tolerance
Generic
Scale linearly, horizontally.
Extensible
Be able to achieve low latency updates when necessary.
Be able to “ask” arbitrary questions to the system
7
![Page 8: Implementing the Lambda Architecture with Spring XD](https://reader034.vdocuments.site/reader034/viewer/2022052508/559444311a28ab0c308b474f/html5/thumbnails/8.jpg)
How to support those properties?
Decompose the problem into pieces
8
Speed Layer
Batch Layer
Serving Layer
![Page 9: Implementing the Lambda Architecture with Spring XD](https://reader034.vdocuments.site/reader034/viewer/2022052508/559444311a28ab0c308b474f/html5/thumbnails/9.jpg)
Batch Layer
9
Master dataset Batch Layer
Batch view
Batch view
Batch view
runBatchLayer() { while(true) recomputeFunctions()}
![Page 10: Implementing the Lambda Architecture with Spring XD](https://reader034.vdocuments.site/reader034/viewer/2022052508/559444311a28ab0c308b474f/html5/thumbnails/10.jpg)
The Master dataset
10
Master dataset
Sourceof
Truth
Recompute
Writeonce
Protect
Readmany
![Page 11: Implementing the Lambda Architecture with Spring XD](https://reader034.vdocuments.site/reader034/viewer/2022052508/559444311a28ab0c308b474f/html5/thumbnails/11.jpg)
Master dataset requirements
11
Operation Requisite Comments
Writes Efficient appends of new data
Basically add new pieces of data. Easy to append new data
Writes Scalable storage Need to handle possibly, petabytes of data
Reads Support for parallel processing
Functions usually work on the entire dataset. Need to support handling large amounts of data
Reads Vertically partition data Not necessary to look all the data all the time. Some functions may need to look at only relevant data (e.g. 1 week of calls)
Writes/Reads Costs for processing. Flexible storage
Storage costs money (a lot). Need flexibility on how to store and compress data.
![Page 12: Implementing the Lambda Architecture with Spring XD](https://reader034.vdocuments.site/reader034/viewer/2022052508/559444311a28ab0c308b474f/html5/thumbnails/12.jpg)
The Master dataset
12
HDFS - Hadoop Distributed File System
![Page 13: Implementing the Lambda Architecture with Spring XD](https://reader034.vdocuments.site/reader034/viewer/2022052508/559444311a28ab0c308b474f/html5/thumbnails/13.jpg)
Serving Layer
Makes the batch views “queryable”
13
Batch view
Batch view
Batch updates and random reads
Distrib
uted data
base
![Page 14: Implementing the Lambda Architecture with Spring XD](https://reader034.vdocuments.site/reader034/viewer/2022052508/559444311a28ab0c308b474f/html5/thumbnails/14.jpg)
Desired properties on the Serving layer
14
Batch view
Batch view
High ThroughputLow Latency
![Page 15: Implementing the Lambda Architecture with Spring XD](https://reader034.vdocuments.site/reader034/viewer/2022052508/559444311a28ab0c308b474f/html5/thumbnails/15.jpg)
Serving layer database requirements
15
Batch writable
Scalable
Random reads
Fault-tolerant
![Page 16: Implementing the Lambda Architecture with Spring XD](https://reader034.vdocuments.site/reader034/viewer/2022052508/559444311a28ab0c308b474f/html5/thumbnails/16.jpg)
Batch and serving layers
16
Master dataset Batch Layer
Batch view
Batch view
Batch view
analytical functions,immutable storage
Serving Layer
pre-computed analytical functions
(Views)
Only property missing - low latency updates
![Page 17: Implementing the Lambda Architecture with Spring XD](https://reader034.vdocuments.site/reader034/viewer/2022052508/559444311a28ab0c308b474f/html5/thumbnails/17.jpg)
Speed layer
Allows arbitrary functions computed on arbitrary data on (near) real-time.
17
Feed
Feed
Feed
Feed
Feed
Feed
![Page 18: Implementing the Lambda Architecture with Spring XD](https://reader034.vdocuments.site/reader034/viewer/2022052508/559444311a28ab0c308b474f/html5/thumbnails/18.jpg)
Speed layer
18
RealTime FunctionRealTime FunctionRealTime FunctionRealTime FunctionRealTime FunctionRealTime FunctionRealTime ViewRealTime ViewRealTime ViewRealTime ViewRealTime ViewRealTime ViewRealTime View
Narrow but more up to date view
Depending of functions complexity incremental computation approach is recommended
![Page 19: Implementing the Lambda Architecture with Spring XD](https://reader034.vdocuments.site/reader034/viewer/2022052508/559444311a28ab0c308b474f/html5/thumbnails/19.jpg)
Incremental computation
realtime view = function(new data, realtime view)
19
![Page 20: Implementing the Lambda Architecture with Spring XD](https://reader034.vdocuments.site/reader034/viewer/2022052508/559444311a28ab0c308b474f/html5/thumbnails/20.jpg)
Eventual Accuracy
!
Some computations are harder to compute !
For such cases approximations are used. Results are approximate to the correct answer !
Sophisticated queries such as realtime machine learning are usually done with eventual accuracy. Too complex to be computed exactly.
20
![Page 21: Implementing the Lambda Architecture with Spring XD](https://reader034.vdocuments.site/reader034/viewer/2022052508/559444311a28ab0c308b474f/html5/thumbnails/21.jpg)
Approximation & Randomisation
!
!
Approximation find an answer correct within some factor
(answer that is within 10% of correct result)
!
Randomisation allow a small probability of failure
(1 in 10,000 give the wrong answer)
!
!
21
![Page 22: Implementing the Lambda Architecture with Spring XD](https://reader034.vdocuments.site/reader034/viewer/2022052508/559444311a28ab0c308b474f/html5/thumbnails/22.jpg)
Synopses structures
• Sampling • Sketches • Histograms • Wavelets
!
Library implementations • algebird (https://github.com/twitter/algebird) • samoa (https://github.com/yahoo/samoa/wiki)
22
http://charuaggarwal.net/synopsis.pdf
![Page 23: Implementing the Lambda Architecture with Spring XD](https://reader034.vdocuments.site/reader034/viewer/2022052508/559444311a28ab0c308b474f/html5/thumbnails/23.jpg)
Stream processing (real-time function)
Run the realtime functions to update the realtime views
23
Data Ingestion Stream Processing (realtime function)
![Page 24: Implementing the Lambda Architecture with Spring XD](https://reader034.vdocuments.site/reader034/viewer/2022052508/559444311a28ab0c308b474f/html5/thumbnails/24.jpg)
One-at-a-time
Divide your processing into worker processes, and put queues between the worker processes
24Queues and workers paradigm
![Page 25: Implementing the Lambda Architecture with Spring XD](https://reader034.vdocuments.site/reader034/viewer/2022052508/559444311a28ab0c308b474f/html5/thumbnails/25.jpg)
Generalised one-at-a-time approach
• Works on a higher level • Stream computation defined as a graph (usually).
• Storm, InfoSphere Streams models • Filters and pipes
• Spring XD • At least once in case of failures.
25
cut -d" " -f1 < access.log | sort | uniq -c | sort -rn | less
![Page 26: Implementing the Lambda Architecture with Spring XD](https://reader034.vdocuments.site/reader034/viewer/2022052508/559444311a28ab0c308b474f/html5/thumbnails/26.jpg)
Micro-batching
• Small batches of events processed one at a time !
• Exactly one processing !
• Implementations: • Spark Streaming
26
batch #1
batch #2Data Ingestion
![Page 27: Implementing the Lambda Architecture with Spring XD](https://reader034.vdocuments.site/reader034/viewer/2022052508/559444311a28ab0c308b474f/html5/thumbnails/27.jpg)
Lambda Architecture
27
Ad-hoc queries
analytical functions
analytical functions,immutable storage
raw data
Data Ingestion
Speed Layer
Batch Layer Serving Layer
raw data
raw data
raw data
raw data
pre-computed analytical functions
![Page 28: Implementing the Lambda Architecture with Spring XD](https://reader034.vdocuments.site/reader034/viewer/2022052508/559444311a28ab0c308b474f/html5/thumbnails/28.jpg)
Principles of Lambda Architecture
Store data in it’s rawest form
Immutability and perpetuity !
Re-computation !
Query = function(all data)
28
![Page 29: Implementing the Lambda Architecture with Spring XD](https://reader034.vdocuments.site/reader034/viewer/2022052508/559444311a28ab0c308b474f/html5/thumbnails/29.jpg)
Implementing the Lambda Architecture
The ACM DEBS 2014 Grand Challenge1
29
To demonstrate the applicability of event-based systems to provide scalable, real-time analytics over high volume sensor data
1 http://www.cse.iitb.ac.in/debs2014/
![Page 30: Implementing the Lambda Architecture with Spring XD](https://reader034.vdocuments.site/reader034/viewer/2022052508/559444311a28ab0c308b474f/html5/thumbnails/30.jpg)
ACM DEBS 2014 Challenge Application
Scenario !
Analysis of energy consumption measurement
30
Short-term load forecasting makes load forecasts based on current load and what was learned over historical data
Load statistics for real-time demand management finds outliers based on the energy consumption
plug
plug
house_id
dev_id
plug_id
![Page 31: Implementing the Lambda Architecture with Spring XD](https://reader034.vdocuments.site/reader034/viewer/2022052508/559444311a28ab0c308b474f/html5/thumbnails/31.jpg)
Data model
31
Data source
Field Comments
ID UNIQUE IDENTIFIER
TIMESTAMP TIMESTAMP OF MEASUREMENT
VALUE MEASUREMENT
PROPERTY TYPE: 0 - WORK, 1 - LOAD
PLUG_ID UNIQUE IDENTIFIER OF A PLUG
HOUSEHOLD_ID UNIQUE IDENTIFIER OF A HOUSEHOLD
HOUSE_ID UNIQUE IDENTIFIER OF A HOUSE
Field CommentsTS TIMESTAMP OF STARTING TIME PREDICTION
HOUSE_ID HOUSE ID
PREDICTED_LOAD PREDICTED LOAD
PL - House
Field CommentsTS_START TIMESTAMP OF START TIME WINDOW
TS_STOP TIMESTAMP OF STO PTIME WINDOW
HOUSE_ID HOUSE ID
PERCENTAGE % PLUGS LOAD HIGHER THAN NORMAL
Outliers
Field CommentsTS TIMESTAMP OF STARTING TIME PREDICTION
HOUSE_ID HOUSE ID
HOUSEHOLD_ID
PLUG_ID
PREDICTED_LOAD PREDICTED LOAD
PL - Plug
![Page 32: Implementing the Lambda Architecture with Spring XD](https://reader034.vdocuments.site/reader034/viewer/2022052508/559444311a28ab0c308b474f/html5/thumbnails/32.jpg)
Analytical models
32
L(si+2) = (avgLoad(si) +median(avgLoad(sj)))/2
Load prediction per house, per plug
OutliersFor each house calculate the percentage of plugs which have a median load during the last hour greater than the median load of all plugs (in all households of all houses) during the last hour
sj = s(i+2–n⇤k)
It is based on the average load of the current window and the median of the average loads of windows covering the same time of all past days
![Page 33: Implementing the Lambda Architecture with Spring XD](https://reader034.vdocuments.site/reader034/viewer/2022052508/559444311a28ab0c308b474f/html5/thumbnails/33.jpg)
How others did
33
Storm-based implementation
![Page 34: Implementing the Lambda Architecture with Spring XD](https://reader034.vdocuments.site/reader034/viewer/2022052508/559444311a28ab0c308b474f/html5/thumbnails/34.jpg)
How others did
34
Oracle DB + Oracle Event Processing (OEP)
![Page 35: Implementing the Lambda Architecture with Spring XD](https://reader034.vdocuments.site/reader034/viewer/2022052508/559444311a28ab0c308b474f/html5/thumbnails/35.jpg)
How others did
35
http://lsds.doc.ic.ac.uk/projects/seep/debs14-gc
SEEP: Real-Time Stateful Big Data Processing
![Page 36: Implementing the Lambda Architecture with Spring XD](https://reader034.vdocuments.site/reader034/viewer/2022052508/559444311a28ab0c308b474f/html5/thumbnails/36.jpg)
Implementation details
Fork from https://github.com/pivotalsoftware/gfxd-demo
36
Thanks to Jens Deppe and William Markito
![Page 37: Implementing the Lambda Architecture with Spring XD](https://reader034.vdocuments.site/reader034/viewer/2022052508/559444311a28ab0c308b474f/html5/thumbnails/37.jpg)
Our approach - ACM DEBS 2014 system architecture
37
RabbitMQ Spring XD
Hadoop
GemFire XD Web App(Spring boot)
Batch layer
Serving layer
Speed layerSensorEventSensor
EventSensorEventSensor
EventSensorEvent
![Page 38: Implementing the Lambda Architecture with Spring XD](https://reader034.vdocuments.site/reader034/viewer/2022052508/559444311a28ab0c308b474f/html5/thumbnails/38.jpg)
Data ingestion
38
Sensor Events
RabbitMQ
publishes
Stream App
Spring XD
subscribe
s
S
S
S
SS
SS
S
![Page 39: Implementing the Lambda Architecture with Spring XD](https://reader034.vdocuments.site/reader034/viewer/2022052508/559444311a28ab0c308b474f/html5/thumbnails/39.jpg)
Speed layer
39
Application flow
Removeduplicates
Filter
Fix missingvaluestransform
Store toGFXD
Sink
Findoutliers
Tap
Load Forecast
Tap
Loadpred. model
Processor
Store toGFXD
Sink
OutliersModelProcessor
not a flow
Store toGFXD
Sink
RT viewRT view
RT view
RT viewRT view
RT view
Load Forecast
Tap
Loadpred. model
Processor
Store toGFXD
Sink RT viewRT view
RT view
P
H
![Page 40: Implementing the Lambda Architecture with Spring XD](https://reader034.vdocuments.site/reader034/viewer/2022052508/559444311a28ab0c308b474f/html5/thumbnails/40.jpg)
Stream actions
40
Duplicatesfiltering
enrichment(compute time slice)
computemodel Produce RT View
![Page 41: Implementing the Lambda Architecture with Spring XD](https://reader034.vdocuments.site/reader034/viewer/2022052508/559444311a28ab0c308b474f/html5/thumbnails/41.jpg)
Spring XD streams
41
pumpin - sends all data to a master queue
sensoreventenricher - Consumes the data from the queue, filter and transform the data before store on HDFS
findoutliers - Taps from master_ds stream to compute outlier model
loadpred{h,p} - Taps from master_ds stream to compute load prediction for house and plug (2 streams)
5 streams
![Page 42: Implementing the Lambda Architecture with Spring XD](https://reader034.vdocuments.site/reader034/viewer/2022052508/559444311a28ab0c308b474f/html5/thumbnails/42.jpg)
Batch Layer
42
Batch job that starts from Spring XD Uses cron to run every X hour/min/day?
Load pred. model
Job
Hadoop
Batch viewBatch view
Batch view
Hadoop based system
Store an immutable and constantly growing master dataset (of all datasets) !
Compute arbitrary functions (models) on the existing datasets. !
Essentially, runs the MR models.
![Page 43: Implementing the Lambda Architecture with Spring XD](https://reader034.vdocuments.site/reader034/viewer/2022052508/559444311a28ab0c308b474f/html5/thumbnails/43.jpg)
Spring XD jobs
• A unique job to run the MR model to compute the historical aspect of the models.
• MR job runs every “day/hour/min” • Job is completely independent from the other streams
43
1 job
Load raw_sensortable
computemodel
updateavg, outliers
table
![Page 44: Implementing the Lambda Architecture with Spring XD](https://reader034.vdocuments.site/reader034/viewer/2022052508/559444311a28ab0c308b474f/html5/thumbnails/44.jpg)
Serving layer (RT and Batch views)
44
Field CommentsTS TIMESTAMP OF STARTING TIME PREDICTION
HOUSE_ID HOUSE ID
PREDICTED_LOAD PREDICTED LOAD
Field CommentsTS_START TIMESTAMP OF START TIME WINDOW
TS_STOP TIMESTAMP OF STO PTIME WINDOW
HOUSE_ID HOUSE ID
PERCENTAGE % PLUGS LOAD HIGHER THAN NORMAL
Gemfire XD
Get updates from MR and SP models
Holds both batch and real-time views
Field CommentsTS TIMESTAMP OF STARTING TIME PREDICTION
HOUSE_ID HOUSE ID
HOUSEHOLD_ID
PLUG_ID
PREDICTED_LOAD PREDICTED LOAD
![Page 45: Implementing the Lambda Architecture with Spring XD](https://reader034.vdocuments.site/reader034/viewer/2022052508/559444311a28ab0c308b474f/html5/thumbnails/45.jpg)
Web UI
45
Web App(Spring boot)
view view
view
![Page 46: Implementing the Lambda Architecture with Spring XD](https://reader034.vdocuments.site/reader034/viewer/2022052508/559444311a28ab0c308b474f/html5/thumbnails/46.jpg)
Why Gemfire XD?
‣ Distributed database !
‣ Tightly integrated with Pivotal Hadoop !
‣ SQL support !
‣ Fault-tolerant !
‣ In-memory (fast access)
46
GemFire XD
Hadoop
column_name data_type
column_name data_type
Model: Datahas_many_and_belongs_to_many :teams
Table1
column_name data_type
column_name data_type
Model: Datahas_many_and_belongs_to_many :teams
Table1
column_name data_type
column_name data_type
Model: Datahas_many_and_belongs_to_many :teams
Table1
column_name data_type
column_name data_type
Model: Datahas_many_and_belongs_to_many :teams
Table1
column_name data_type
column_name data_type
Model: Datahas_many_and_belongs_to_many :teams
Table1
column_name data_type
column_name data_type
Model: Datahas_many_and_belongs_to_many :teams
Table1
![Page 47: Implementing the Lambda Architecture with Spring XD](https://reader034.vdocuments.site/reader034/viewer/2022052508/559444311a28ab0c308b474f/html5/thumbnails/47.jpg)
Why Pivotal HD (PHD)
47
‣ Hadoop based !
‣ Tightly integrated with Gemfire XD
GemFire XD
Hadoop
column_name data_type
column_name data_type
Model: Datahas_many_and_belongs_to_many :teams
Table1
column_name data_type
column_name data_type
Model: Datahas_many_and_belongs_to_many :teams
Table1
column_name data_type
column_name data_type
Model: Datahas_many_and_belongs_to_many :teams
Table1
column_name data_type
column_name data_type
Model: Datahas_many_and_belongs_to_many :teams
Table1
column_name data_type
column_name data_type
Model: Datahas_many_and_belongs_to_many :teams
Table1
column_name data_type
column_name data_type
Model: Datahas_many_and_belongs_to_many :teams
Table1
![Page 48: Implementing the Lambda Architecture with Spring XD](https://reader034.vdocuments.site/reader034/viewer/2022052508/559444311a28ab0c308b474f/html5/thumbnails/48.jpg)
Why Spring XD?
‣ Data Ingestion !
‣ Real-time Analytics !
‣ Workflow Orchestration !
‣ Integration
48
![Page 49: Implementing the Lambda Architecture with Spring XD](https://reader034.vdocuments.site/reader034/viewer/2022052508/559444311a28ab0c308b474f/html5/thumbnails/49.jpg)
Full Open source implementation
49
HBase??
Hadoop
SpringXD
![Page 50: Implementing the Lambda Architecture with Spring XD](https://reader034.vdocuments.site/reader034/viewer/2022052508/559444311a28ab0c308b474f/html5/thumbnails/50.jpg)
Is Lambda Architecture perfect?
• Suitable for specific use cases • Doesn't contemplate reference data access. • Not always possible to use same model for Real-time and Batch • Other ideas to improve the lambda architecture
• Multiple batch layers • incremental batch layers
50
![Page 51: Implementing the Lambda Architecture with Spring XD](https://reader034.vdocuments.site/reader034/viewer/2022052508/559444311a28ab0c308b474f/html5/thumbnails/51.jpg)
Source code
51
https://bitbucket.org/caxqueiroz/debs2014-challenge
![Page 52: Implementing the Lambda Architecture with Spring XD](https://reader034.vdocuments.site/reader034/viewer/2022052508/559444311a28ab0c308b474f/html5/thumbnails/52.jpg)
Thank you!!
52