lambda architecture @ indix
DESCRIPTION
Slides from my presentation on Lambda Architecture at Indix, presented at Fifth Elephant 2014. It talks about our experience in using Lambda Architecture at Indix, to build a large scale analytics system on unstructured, dynamically changing data sources using Hadoop, HBase, Scalding, Spark and Solr.TRANSCRIPT
![Page 1: Lambda architecture @ Indix](https://reader034.vdocuments.site/reader034/viewer/2022042508/547e5363b479598e508b4b72/html5/thumbnails/1.jpg)
Lambda Architecture
Analyzing large scale, unstructured, dynamic data
Rajesh Muppalla (@codingnirvana)[email protected]
![Page 2: Lambda architecture @ Indix](https://reader034.vdocuments.site/reader034/viewer/2022042508/547e5363b479598e508b4b72/html5/thumbnails/2.jpg)
Indix - Quick Overview
Am I priced higher or lower w.r.t my competitor on Nikon D700?
Which product has the UPC - 8745354434?
What are all the variants of Apple Macbook Air 13”? What is the average price change of all Nike Shoes
in Walmart in the last 3 months?
![Page 3: Lambda architecture @ Indix](https://reader034.vdocuments.site/reader034/viewer/2022042508/547e5363b479598e508b4b72/html5/thumbnails/3.jpg)
Data Pipeline @ Indix
C
Crawling Parsing
ML Model
ML Model
Classification
C1 C1 C1 C1
C2 C2 C2
C2 C2
Matching
Product & Price Catalog
![Page 4: Lambda architecture @ Indix](https://reader034.vdocuments.site/reader034/viewer/2022042508/547e5363b479598e508b4b72/html5/thumbnails/4.jpg)
Data Pipeline @ Indix
Analytics(Precomputes,
Insights)
Search Index
Product & Price Catalog
Experiences
We released the v1.0 of our API today - developer.indix.com
![Page 5: Lambda architecture @ Indix](https://reader034.vdocuments.site/reader034/viewer/2022042508/547e5363b479598e508b4b72/html5/thumbnails/5.jpg)
Data is Dynamic
CC1 C1 C1 C1
C2 C2 C2
C2 C2
ML Model
ML Model(new)
Crawling Parsing Classification Matching
![Page 6: Lambda architecture @ Indix](https://reader034.vdocuments.site/reader034/viewer/2022042508/547e5363b479598e508b4b72/html5/thumbnails/6.jpg)
Data Scale
400 MProduct
URLs4 TB
HTML Data Crawled
Daily
100 TB Data
Processed Daily
3000Categories
10 BPrice
Points
2000Sites
![Page 7: Lambda architecture @ Indix](https://reader034.vdocuments.site/reader034/viewer/2022042508/547e5363b479598e508b4b72/html5/thumbnails/7.jpg)
Data Pipeline v1.0
![Page 8: Lambda architecture @ Indix](https://reader034.vdocuments.site/reader034/viewer/2022042508/547e5363b479598e508b4b72/html5/thumbnails/8.jpg)
Batch using HBase & MapReduce
![Page 9: Lambda architecture @ Indix](https://reader034.vdocuments.site/reader034/viewer/2022042508/547e5363b479598e508b4b72/html5/thumbnails/9.jpg)
Problem 1
Data Systems should be Human Fault Tolerant
Mutable State
![Page 10: Lambda architecture @ Indix](https://reader034.vdocuments.site/reader034/viewer/2022042508/547e5363b479598e508b4b72/html5/thumbnails/10.jpg)
Problem 2
Compactions
Random Write databases are hard to manage at large scale
![Page 11: Lambda architecture @ Indix](https://reader034.vdocuments.site/reader034/viewer/2022042508/547e5363b479598e508b4b72/html5/thumbnails/11.jpg)
Problem 3
16 hours
16 hours latency is a lot. We wanted it to be couple of hours
![Page 12: Lambda architecture @ Indix](https://reader034.vdocuments.site/reader034/viewer/2022042508/547e5363b479598e508b4b72/html5/thumbnails/12.jpg)
Three Problems
● No Human Fault Tolerance○ Mutable State
● Operational Complexity○ Random Writes (Compactions)
● High Latency○ Batch system architectural tradeoff
![Page 13: Lambda architecture @ Indix](https://reader034.vdocuments.site/reader034/viewer/2022042508/547e5363b479598e508b4b72/html5/thumbnails/13.jpg)
Rethink our data systems
![Page 14: Lambda architecture @ Indix](https://reader034.vdocuments.site/reader034/viewer/2022042508/547e5363b479598e508b4b72/html5/thumbnails/14.jpg)
Lambda Architecture
![Page 15: Lambda architecture @ Indix](https://reader034.vdocuments.site/reader034/viewer/2022042508/547e5363b479598e508b4b72/html5/thumbnails/15.jpg)
Lambda Architecture
● An approach to build big data systems○ Architectural Components & Principles○ Ties Batch & Real Time Systems○ General Purpose - Domain Agnostic
● Coined by Nathan Marz○ Ex-Twitter Engineer○ Creator of Storm
![Page 16: Lambda architecture @ Indix](https://reader034.vdocuments.site/reader034/viewer/2022042508/547e5363b479598e508b4b72/html5/thumbnails/16.jpg)
HBase
Data System - Traditional Approach
Application
Source of Truth
![Page 17: Lambda architecture @ Indix](https://reader034.vdocuments.site/reader034/viewer/2022042508/547e5363b479598e508b4b72/html5/thumbnails/17.jpg)
Data System - New Approach
ImmutableRawData
ApplicationProcessed
View(s)
Source of Truth
![Page 18: Lambda architecture @ Indix](https://reader034.vdocuments.site/reader034/viewer/2022042508/547e5363b479598e508b4b72/html5/thumbnails/18.jpg)
Let’s take an example
Find the count of unique products in any given category for the entire time range
![Page 19: Lambda architecture @ Indix](https://reader034.vdocuments.site/reader034/viewer/2022042508/547e5363b479598e508b4b72/html5/thumbnails/19.jpg)
Two Requirements
● Recomputations● Large Scale
![Page 20: Lambda architecture @ Indix](https://reader034.vdocuments.site/reader034/viewer/2022042508/547e5363b479598e508b4b72/html5/thumbnails/20.jpg)
Batch Layer Implementation
C1 5
C2 7
C3 4
C4 7
C5 1
HDFS (Vertical Partitioning) HBase
Products Master Data
9 am
10 am
11 am
12 pm
1 pm
2 pm
Query
Intermediate view
C1
C2
C3
C4
C5
MR Job 1
Batch View
MR Job 2New Data
![Page 21: Lambda architecture @ Indix](https://reader034.vdocuments.site/reader034/viewer/2022042508/547e5363b479598e508b4b72/html5/thumbnails/21.jpg)
Handling Recomputations
C1 5
C2 7
C3 4
C4 7
C5 1
HDFS (Vertical Partitioning) HBase
Products Master Data
9 am
10 am
11 am
12 pm
1 pm
2 pm
Query
Intermediate view
C1
C2
C3
C4
C5
MR Job 1
Batch View
MR Job 2New Data
![Page 22: Lambda architecture @ Indix](https://reader034.vdocuments.site/reader034/viewer/2022042508/547e5363b479598e508b4b72/html5/thumbnails/22.jpg)
Handling Scale
● Hadoop HDFS, MapReduce, HBase● Proven Linear Scalability
![Page 23: Lambda architecture @ Indix](https://reader034.vdocuments.site/reader034/viewer/2022042508/547e5363b479598e508b4b72/html5/thumbnails/23.jpg)
Three Problems (Recap)
● No Human Fault Tolerance○ Mutable State
● Operational Complexity○ Random Writes (Compactions)
● High Latency○ Batch system architectural tradeoff
![Page 24: Lambda architecture @ Indix](https://reader034.vdocuments.site/reader034/viewer/2022042508/547e5363b479598e508b4b72/html5/thumbnails/24.jpg)
Human Fault Tolerance
● Bugs in the batch jobs○ Discard views & Recompute
● Bugs in the master data jobs○ Re-process the master data to hide the old data
● Bugs in the query○ Re-deploy the query layer
● Traceability as a side effect
![Page 25: Lambda architecture @ Indix](https://reader034.vdocuments.site/reader034/viewer/2022042508/547e5363b479598e508b4b72/html5/thumbnails/25.jpg)
Operational Complexity
● No random writes in the batch layer○ Bulk Updates to build the batch view
![Page 26: Lambda architecture @ Indix](https://reader034.vdocuments.site/reader034/viewer/2022042508/547e5363b479598e508b4b72/html5/thumbnails/26.jpg)
Great… What about Latency?
![Page 27: Lambda architecture @ Indix](https://reader034.vdocuments.site/reader034/viewer/2022042508/547e5363b479598e508b4b72/html5/thumbnails/27.jpg)
Speed Layer
Queue(Kafka)
Recent Data
Real Time Processing(Storm)
QueryHyperloglog SetsHyperloglog SetsHyperloglog
Random Writes
(Updates)
Read-Write Data Store(Riak, HBase, Cassandra)
![Page 28: Lambda architecture @ Indix](https://reader034.vdocuments.site/reader034/viewer/2022042508/547e5363b479598e508b4b72/html5/thumbnails/28.jpg)
Speed Layer has mutation... But
● Speed layer deals with much smaller data○ Batch Layer - Months/years of data○ Speed Layer - Few hours or 1 day of data
● Easy to manage operationally
Complexity Isolation
![Page 29: Lambda architecture @ Indix](https://reader034.vdocuments.site/reader034/viewer/2022042508/547e5363b479598e508b4b72/html5/thumbnails/29.jpg)
Final Step - Merging Results
Batch Layer
Speed Layer
DataQuery
Merged ResultsC1 - 50000
C1 - 499(Approximate with error 0.02%)
C1 - 50499
![Page 30: Lambda architecture @ Indix](https://reader034.vdocuments.site/reader034/viewer/2022042508/547e5363b479598e508b4b72/html5/thumbnails/30.jpg)
What about Accuracy?
Batch Layer
Speed Layer
DataQuery
Merged Results
C1 - 499(Approximate with error 0.02%)
C1’ - 50500
Batch LayerC1’ - 50500C1 - 50000
Eventually Accurate
![Page 31: Lambda architecture @ Indix](https://reader034.vdocuments.site/reader034/viewer/2022042508/547e5363b479598e508b4b72/html5/thumbnails/31.jpg)
Lambda Architecture
![Page 32: Lambda architecture @ Indix](https://reader034.vdocuments.site/reader034/viewer/2022042508/547e5363b479598e508b4b72/html5/thumbnails/32.jpg)
Lambda Architecture @ INDIX
![Page 33: Lambda architecture @ Indix](https://reader034.vdocuments.site/reader034/viewer/2022042508/547e5363b479598e508b4b72/html5/thumbnails/33.jpg)
Lambda Architecture @ Indix
![Page 34: Lambda architecture @ Indix](https://reader034.vdocuments.site/reader034/viewer/2022042508/547e5363b479598e508b4b72/html5/thumbnails/34.jpg)
Batch Layer @ Indix
● Pail○ Vertical partitioning ○ Consolidation of small files
● Scalding● Thrift for enforcing schemas● HBase/Solr for views
○ Bulk updates to create views
![Page 35: Lambda architecture @ Indix](https://reader034.vdocuments.site/reader034/viewer/2022042508/547e5363b479598e508b4b72/html5/thumbnails/35.jpg)
Speed Layer @ Indix
● Still WIP● To reduce latency
○ Micro batches for Speed layer○ Use the last batch run + bulk update views
![Page 36: Lambda architecture @ Indix](https://reader034.vdocuments.site/reader034/viewer/2022042508/547e5363b479598e508b4b72/html5/thumbnails/36.jpg)
Open Challenges
● Managing both Batch & Real Time still painful● Two broad directions
○ Abstractions■ SummingBird (Twitter)
○ Unified Stack■ Spark ■ Kafka + Samza/Storm (LinkedIn)■ Cloud Data Flow (Google)
![Page 37: Lambda architecture @ Indix](https://reader034.vdocuments.site/reader034/viewer/2022042508/547e5363b479598e508b4b72/html5/thumbnails/37.jpg)
In Conclusion...
● Lambda Architecture○ A different approach to build data systems○ Solid principles ○ Domain Agnostic○ Tools not yet mature
![Page 38: Lambda architecture @ Indix](https://reader034.vdocuments.site/reader034/viewer/2022042508/547e5363b479598e508b4b72/html5/thumbnails/38.jpg)
Resources
● Indix Engineering Blog - http://engineering.indix.com
● Runaway Complexity in Big Data Systems● Lambda Architecture● Big Data Book - Manning● Scalding● Spark● Pail● Summingbird
![Page 39: Lambda architecture @ Indix](https://reader034.vdocuments.site/reader034/viewer/2022042508/547e5363b479598e508b4b72/html5/thumbnails/39.jpg)
Key Takeaways
- Human Fault Tolerance
- Complexity Isolation
- Higher Level Abstractions
![Page 40: Lambda architecture @ Indix](https://reader034.vdocuments.site/reader034/viewer/2022042508/547e5363b479598e508b4b72/html5/thumbnails/40.jpg)
Thank You
![Page 41: Lambda architecture @ Indix](https://reader034.vdocuments.site/reader034/viewer/2022042508/547e5363b479598e508b4b72/html5/thumbnails/41.jpg)
Batch vs Real Time Choices
![Page 42: Lambda architecture @ Indix](https://reader034.vdocuments.site/reader034/viewer/2022042508/547e5363b479598e508b4b72/html5/thumbnails/42.jpg)
Tying it all together - Go-CD
![Page 43: Lambda architecture @ Indix](https://reader034.vdocuments.site/reader034/viewer/2022042508/547e5363b479598e508b4b72/html5/thumbnails/43.jpg)
Extras
● Monoids● LA is not new
○ Search Engines (fast, slow crawl)
○ Event Sourcing (immutable events to maintain
state)○ Patch, Audit, Bootstrap
![Page 44: Lambda architecture @ Indix](https://reader034.vdocuments.site/reader034/viewer/2022042508/547e5363b479598e508b4b72/html5/thumbnails/44.jpg)
Problem Statement - Optimization