josa techtalks - real-time and big data
TRANSCRIPT
![Page 1: JOSA TechTalks - Real-Time and Big Data](https://reader030.vdocuments.site/reader030/viewer/2022032421/55a85b411a28abba0b8b45c2/html5/thumbnails/1.jpg)
REAL-TIME AND BIG DATA
Mahmoud M. Jalajel
![Page 2: JOSA TechTalks - Real-Time and Big Data](https://reader030.vdocuments.site/reader030/viewer/2022032421/55a85b411a28abba0b8b45c2/html5/thumbnails/2.jpg)
OUTLINE
• Intro: Real-time with Big Data
• The Lambda Architecture
• The Relay Model
![Page 3: JOSA TechTalks - Real-Time and Big Data](https://reader030.vdocuments.site/reader030/viewer/2022032421/55a85b411a28abba0b8b45c2/html5/thumbnails/3.jpg)
WHY SOLVE FOR REAL-
TIME• Real-time offers more business value
• Live Web Analytics
• Recommendations
• Real-time = (semi-) realtime
• Event to index ~ single digit minutes
• Query duration ~ single digit seconds
![Page 4: JOSA TechTalks - Real-Time and Big Data](https://reader030.vdocuments.site/reader030/viewer/2022032421/55a85b411a28abba0b8b45c2/html5/thumbnails/4.jpg)
REAL-TIME
IMPLEMENTATION• Incremental Implementation
• Stream processing / No full data context
• A real-time implementation is:
• Far more useful
• Faster
• Easily adaptable to batch mode
![Page 5: JOSA TechTalks - Real-Time and Big Data](https://reader030.vdocuments.site/reader030/viewer/2022032421/55a85b411a28abba0b8b45c2/html5/thumbnails/5.jpg)
REAL-TIME IN HADOOP
MongoDb Query Time
(optimized, single-node)
Hive Query Time
(5 nodes)
Hangs, crashes, starts
begging for mercy then
commits suicide and
weepingly dies
A few hours
2 Seconds 15 Minutes
![Page 7: JOSA TechTalks - Real-Time and Big Data](https://reader030.vdocuments.site/reader030/viewer/2022032421/55a85b411a28abba0b8b45c2/html5/thumbnails/7.jpg)
LAMBDA
ARCHITECTURE
![Page 8: JOSA TechTalks - Real-Time and Big Data](https://reader030.vdocuments.site/reader030/viewer/2022032421/55a85b411a28abba0b8b45c2/html5/thumbnails/8.jpg)
BASIC ASSUMPTIONS
1. Query = Function(All Data)
2. Data are immutable timely facts
3. Append-Only (CRUD becomes CR)
4. Human Fault-Tolerance
![Page 9: JOSA TechTalks - Real-Time and Big Data](https://reader030.vdocuments.site/reader030/viewer/2022032421/55a85b411a28abba0b8b45c2/html5/thumbnails/9.jpg)
THE BATCH LAYER
• Accepts stream of data
• Appends to master
dataset
• Uses: HDFS
![Page 10: JOSA TechTalks - Real-Time and Big Data](https://reader030.vdocuments.site/reader030/viewer/2022032421/55a85b411a28abba0b8b45c2/html5/thumbnails/10.jpg)
THE SERVING LAYER
• Precomputes different
views
• Works on full dataset
• Refreshes regularly offline
• Batch views are usually
stored in a key-value store
![Page 11: JOSA TechTalks - Real-Time and Big Data](https://reader030.vdocuments.site/reader030/viewer/2022032421/55a85b411a28abba0b8b45c2/html5/thumbnails/11.jpg)
![Page 12: JOSA TechTalks - Real-Time and Big Data](https://reader030.vdocuments.site/reader030/viewer/2022032421/55a85b411a28abba0b8b45c2/html5/thumbnails/12.jpg)
![Page 13: JOSA TechTalks - Real-Time and Big Data](https://reader030.vdocuments.site/reader030/viewer/2022032421/55a85b411a28abba0b8b45c2/html5/thumbnails/13.jpg)
CHECKPOINT
• Typical Hadoop Setup
• Slow, inefficient
• Outdated. usually lagging by hours or days
• Although accurate for surveyed data
• Costly to re-run. Real-time is not an option
![Page 14: JOSA TechTalks - Real-Time and Big Data](https://reader030.vdocuments.site/reader030/viewer/2022032421/55a85b411a28abba0b8b45c2/html5/thumbnails/14.jpg)
THE SPEED LAYER
• Works with recent data
• Complements results
• Incremental implementation
![Page 15: JOSA TechTalks - Real-Time and Big Data](https://reader030.vdocuments.site/reader030/viewer/2022032421/55a85b411a28abba0b8b45c2/html5/thumbnails/15.jpg)
![Page 16: JOSA TechTalks - Real-Time and Big Data](https://reader030.vdocuments.site/reader030/viewer/2022032421/55a85b411a28abba0b8b45c2/html5/thumbnails/16.jpg)
THE FULL PICTUREQuery Merging
![Page 17: JOSA TechTalks - Real-Time and Big Data](https://reader030.vdocuments.site/reader030/viewer/2022032421/55a85b411a28abba0b8b45c2/html5/thumbnails/17.jpg)
EXAMPLE
TECHNOLOGIES
![Page 18: JOSA TechTalks - Real-Time and Big Data](https://reader030.vdocuments.site/reader030/viewer/2022032421/55a85b411a28abba0b8b45c2/html5/thumbnails/18.jpg)
DRUID EXAMPLE
![Page 19: JOSA TechTalks - Real-Time and Big Data](https://reader030.vdocuments.site/reader030/viewer/2022032421/55a85b411a28abba0b8b45c2/html5/thumbnails/19.jpg)
REVIEWING LA
PROs
• Modular
• Flexible
• Self-Auditing
• Proven components
CONs
• Complex
• Maintainability
• Query Merging
![Page 20: JOSA TechTalks - Real-Time and Big Data](https://reader030.vdocuments.site/reader030/viewer/2022032421/55a85b411a28abba0b8b45c2/html5/thumbnails/20.jpg)
THE RELAY MODEL
![Page 21: JOSA TechTalks - Real-Time and Big Data](https://reader030.vdocuments.site/reader030/viewer/2022032421/55a85b411a28abba0b8b45c2/html5/thumbnails/21.jpg)
RELAY MODELQuery Merging
![Page 22: JOSA TechTalks - Real-Time and Big Data](https://reader030.vdocuments.site/reader030/viewer/2022032421/55a85b411a28abba0b8b45c2/html5/thumbnails/22.jpg)
THE WORKFLOW
![Page 23: JOSA TechTalks - Real-Time and Big Data](https://reader030.vdocuments.site/reader030/viewer/2022032421/55a85b411a28abba0b8b45c2/html5/thumbnails/23.jpg)
REVIEWING RM
PROs
• Coherent, Simpler
than LA
• Extensible to full
LA
• Cheaper
CONs
• Master Data
Storage
• Query flexibility
![Page 24: JOSA TechTalks - Real-Time and Big Data](https://reader030.vdocuments.site/reader030/viewer/2022032421/55a85b411a28abba0b8b45c2/html5/thumbnails/24.jpg)
WHY NOT HADOOP NOW?
• Too much time, no capacity
• Too soon or too late
• Too expensive
• Hammer/nail problem
![Page 25: JOSA TechTalks - Real-Time and Big Data](https://reader030.vdocuments.site/reader030/viewer/2022032421/55a85b411a28abba0b8b45c2/html5/thumbnails/25.jpg)
CONCLUSIONS
• Think big data, now!
• No need to invest years of development to
perfect a big data system.
• Start now! gradually grow system requirements
and engineering skill-set
• Select scalable components
![Page 26: JOSA TechTalks - Real-Time and Big Data](https://reader030.vdocuments.site/reader030/viewer/2022032421/55a85b411a28abba0b8b45c2/html5/thumbnails/26.jpg)
Mahmoud Jalajel – @mjalajel
Questions ?
![Page 27: JOSA TechTalks - Real-Time and Big Data](https://reader030.vdocuments.site/reader030/viewer/2022032421/55a85b411a28abba0b8b45c2/html5/thumbnails/27.jpg)
APPENDIX
![Page 28: JOSA TechTalks - Real-Time and Big Data](https://reader030.vdocuments.site/reader030/viewer/2022032421/55a85b411a28abba0b8b45c2/html5/thumbnails/28.jpg)
Apache Kafka
![Page 29: JOSA TechTalks - Real-Time and Big Data](https://reader030.vdocuments.site/reader030/viewer/2022032421/55a85b411a28abba0b8b45c2/html5/thumbnails/29.jpg)
Apache Storm
![Page 30: JOSA TechTalks - Real-Time and Big Data](https://reader030.vdocuments.site/reader030/viewer/2022032421/55a85b411a28abba0b8b45c2/html5/thumbnails/30.jpg)
Apache Storm with external systems