big data & hadoop environment · 2018-01-31 · big data & hadoop environment bu eğitim...
TRANSCRIPT
![Page 1: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul](https://reader034.vdocuments.site/reader034/viewer/2022042221/5ec7268d3ed94a09191bd609/html5/thumbnails/1.jpg)
Big Data & Hadoop Environment
Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul Mali Destek Programı kapsamında
yürütülmekte olan TR10/16/YNY/0036 no’lu İstanbul Big Data Eğitim ve Araştırma Merkezi Projesi dahilinde
gerçekleştirilmiştir. İçerik ile ilgili tek sorumluluk Bahçeşehir Üniversitesi’ne ait olup İSTKA veya Kalkınma Bakanlığı’nın
görüşlerini yansıtmamaktadır.
![Page 2: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul](https://reader034.vdocuments.site/reader034/viewer/2022042221/5ec7268d3ed94a09191bd609/html5/thumbnails/2.jpg)
What is big data?
Why do we need big data analytics?
How to setup Infrastructure for big data?
![Page 3: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul](https://reader034.vdocuments.site/reader034/viewer/2022042221/5ec7268d3ed94a09191bd609/html5/thumbnails/3.jpg)
How much data?
Hadoop: 10K nodes,
150K cores, 150 PB
(4/2014)
Processes 20 PB a day (2008)
Crawls 20B web pages a day (2012)
Search index is 100+ PB (5/2014)
Bigtable serves 2+ EB, 600M QPS
(5/2014)
300 PB data in Hive +
600 TB/day (4/2014)
400B pages,
10+ PB
(2/2014)
LHC: ~15 PB a year
LSST: 6-10 PB a year
(~2020) 640K ought to
be enough for
anybody.
150 PB on 50k+ servers
running 15k apps (6/2011)
S3: 2T objects, 1.1M
request/second (4/2013)
SKA: 0.3 – 1.5 EB
per year (~2020)
Hadoop: 365 PB, 330K
nodes (6/2014)
![Page 4: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul](https://reader034.vdocuments.site/reader034/viewer/2022042221/5ec7268d3ed94a09191bd609/html5/thumbnails/4.jpg)
The percentage of all data in the word that has been generated in last 2 years?
![Page 5: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul](https://reader034.vdocuments.site/reader034/viewer/2022042221/5ec7268d3ed94a09191bd609/html5/thumbnails/5.jpg)
Big Data
What are Key Features of Big Data?
Volume
Petabyte scale
Variety
Structured
Semi-structured
Unstructured
Velocity
Social Media
Sensor
Throughput
Veracity
Unclean
Imprecise
Unclear
4 Vs
![Page 6: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul](https://reader034.vdocuments.site/reader034/viewer/2022042221/5ec7268d3ed94a09191bd609/html5/thumbnails/6.jpg)
GATGCTTACTATGCGGGCCCC
CGGTCTAATGCTTACTATGC
GCTTACTATGCGGGCCCCTT
AATGCTTACTATGCGGGCCCCTT
TAATGCTTACTATGC
AATGCTTAGCTATGCGGGC
AATGCTTACTATGCGGGCCCCTT
AATGCTTACTATGCGGGCCCCTT
CGGTCTAGATGCTTACTATGC
AATGCTTACTATGCGGGCCCCTT
CGGTCTAATGCTTAGCTATGC
ATGCTTACTATGCGGGCCCCTT
?
Subject
genome
Sequencer
Reads
Human genome: 3 gbp
A few billion short reads
(~100 GB compressed data)
![Page 7: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul](https://reader034.vdocuments.site/reader034/viewer/2022042221/5ec7268d3ed94a09191bd609/html5/thumbnails/7.jpg)
What to do with more data?
Answering factoid questions
Pattern matching on the Web
Works amazingly well
Learning relations
Start with seed instances
Search for patterns on the Web
Using patterns to find more instances
Who shot Abraham Lincoln? X shot Abraham Lincoln
Birthday-of(Mozart, 1756)
Birthday-of(Einstein, 1879)
Wolfgang Amadeus Mozart (1756 - 1791)
Einstein was born in 1879
PERSON (DATE –
PERSON was born in DATE
(Brill et al., TREC 2001; Lin, ACM TOIS 2007)
(Agichtein and Gravano, DL 2000; Ravichandran and Hovy, ACL 2002; … )
![Page 8: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul](https://reader034.vdocuments.site/reader034/viewer/2022042221/5ec7268d3ed94a09191bd609/html5/thumbnails/8.jpg)
![Page 9: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul](https://reader034.vdocuments.site/reader034/viewer/2022042221/5ec7268d3ed94a09191bd609/html5/thumbnails/9.jpg)
No data like more data!
(Banko and Brill, ACL 2001)
(Brants et al., EMNLP 2007)
s/knowledge/data/g;
![Page 10: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul](https://reader034.vdocuments.site/reader034/viewer/2022042221/5ec7268d3ed94a09191bd609/html5/thumbnails/10.jpg)
Database Workloads
OLTP (online transaction processing)
Typical applications: e-commerce, banking, airline reservations
User facing: real-time, low latency, highly-concurrent
Tasks: relatively small set of “standard” transactional queries
Data access pattern: random reads, updates, writes (involving
relatively small amounts of data)
OLAP (online analytical processing)
Typical applications: business intelligence, data mining
Back-end processing: batch workloads, less concurrency
Tasks: complex analytical queries, often ad hoc
Data access pattern: table scans, large amounts of data per query
![Page 11: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul](https://reader034.vdocuments.site/reader034/viewer/2022042221/5ec7268d3ed94a09191bd609/html5/thumbnails/11.jpg)
One Database or Two?
Downsides of co-existing OLTP and OLAP workloads
Poor memory management
Conflicting data access patterns
Variable latency
Solution: separate databases
User-facing OLTP database for high-volume transactions
Data warehouse for OLAP workloads
How do we connect the two?
![Page 12: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul](https://reader034.vdocuments.site/reader034/viewer/2022042221/5ec7268d3ed94a09191bd609/html5/thumbnails/12.jpg)
OLTP/OLAP Architecture
OLTP OLAP
ETL (Extract, Transform, and Load)
![Page 13: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul](https://reader034.vdocuments.site/reader034/viewer/2022042221/5ec7268d3ed94a09191bd609/html5/thumbnails/13.jpg)
Structure of Data Warehouses
SELECT P.Brand, S.Country,
SUM(F.Units_Sold)
FROM Fact_Sales F
INNER JOIN Dim_Date D ON F.Date_Id = D.Id
INNER JOIN Dim_Store S ON F.Store_Id = S.Id
INNER JOIN Dim_Product P ON F.Product_Id = P.Id
WHERE D.YEAR = 1997 AND P.Product_Category = 'tv'
GROUP BY P.Brand, S.Country;
Source: Wikipedia (Star Schema)
![Page 14: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul](https://reader034.vdocuments.site/reader034/viewer/2022042221/5ec7268d3ed94a09191bd609/html5/thumbnails/14.jpg)
OLAP Cubes
store
pro
duct
slice and dice
Common operations
roll up/drill down
pivot
![Page 15: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul](https://reader034.vdocuments.site/reader034/viewer/2022042221/5ec7268d3ed94a09191bd609/html5/thumbnails/15.jpg)
Fast forward…
![Page 16: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul](https://reader034.vdocuments.site/reader034/viewer/2022042221/5ec7268d3ed94a09191bd609/html5/thumbnails/16.jpg)
ETL Bottleneck
ETL is typically a nightly task:
What happens if processing 24 hours of data takes longer than 24
hours?
Hadoop is perfect:
Ingest is limited by speed of HDFS
Scales out with more nodes
Massively parallel
Ability to use any processing tool
Cheaper than parallel databases
ETL is a batch process anyway!
![Page 17: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul](https://reader034.vdocuments.site/reader034/viewer/2022042221/5ec7268d3ed94a09191bd609/html5/thumbnails/17.jpg)
What’s changed?
Dropping cost of disks
Cheaper to store everything than to figure out what to throw away
Types of data collected
From data that’s obviously valuable to data whose value is less
apparent
Rise of social media and user-generated content
Large increase in data volume
Growing maturity of data mining techniques
Demonstrates value of data analytics
![Page 18: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul](https://reader034.vdocuments.site/reader034/viewer/2022042221/5ec7268d3ed94a09191bd609/html5/thumbnails/18.jpg)
a useful service
analyze user
behavior to extract
insights
transform
insights into
action
$ (hopefully)
data science data products
Virtuous Product Cycle
![Page 19: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul](https://reader034.vdocuments.site/reader034/viewer/2022042221/5ec7268d3ed94a09191bd609/html5/thumbnails/19.jpg)
OLTP/OLAP/Hadoop Architecture
OLTP OLAP
ETL (Extract, Transform, and Load)
Hadoop
![Page 20: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul](https://reader034.vdocuments.site/reader034/viewer/2022042221/5ec7268d3ed94a09191bd609/html5/thumbnails/20.jpg)
ETL: Redux
Often, with noisy datasets, ETL is the analysis!
Note that ETL necessarily involves brute force data scans
L, then E and T?
![Page 21: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul](https://reader034.vdocuments.site/reader034/viewer/2022042221/5ec7268d3ed94a09191bd609/html5/thumbnails/21.jpg)
Big Data Ecosystem
Source : datafloq.com
![Page 22: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul](https://reader034.vdocuments.site/reader034/viewer/2022042221/5ec7268d3ed94a09191bd609/html5/thumbnails/22.jpg)
Cloud Computing
Before clouds…
Grids
Connection machine
Vector supercomputers
…
Cloud computing means many different things:
Big data
Rebranding of web 2.0
Utility computing
Everything as a service
![Page 23: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul](https://reader034.vdocuments.site/reader034/viewer/2022042221/5ec7268d3ed94a09191bd609/html5/thumbnails/23.jpg)
Utility Computing
What?
Computing resources as a metered service (“pay as you go”)
Ability to dynamically provision virtual machines
Why?
Cost: capital vs. operating expenses
Scalability: “infinite” capacity
Elasticity: scale up or down on demand
Does it make sense?
Benefits to cloud users
Business case for cloud providers
I think there is a world
market for about five
computers.
![Page 24: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul](https://reader034.vdocuments.site/reader034/viewer/2022042221/5ec7268d3ed94a09191bd609/html5/thumbnails/24.jpg)
Enabling Technology: Virtualization
Hardware
Operating System
App App App
Traditional
Stack
Hardware
OS
App App App
Hypervisor
OS OS
Virtualized Stack
![Page 25: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul](https://reader034.vdocuments.site/reader034/viewer/2022042221/5ec7268d3ed94a09191bd609/html5/thumbnails/25.jpg)
Everything as a Service
Utility computing = Infrastructure as a Service (IaaS)
Why buy machines when you can rent cycles?
Examples: Amazon’s EC2, Rackspace
Platform as a Service (PaaS)
Give me nice API and take care of the maintenance, upgrades, …
Example: Google App Engine
Software as a Service (SaaS)
Just run it for me!
Example: Gmail, Salesforce
![Page 26: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul](https://reader034.vdocuments.site/reader034/viewer/2022042221/5ec7268d3ed94a09191bd609/html5/thumbnails/26.jpg)
Who cares?
A source of problems…
Cloud-based services generate big data
Clouds make it easier to start companies that generate big data
As well as a solution…
Ability to provision analytics clusters on-demand in the cloud
Commoditization and democratization of big data capabilities
![Page 27: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul](https://reader034.vdocuments.site/reader034/viewer/2022042221/5ec7268d3ed94a09191bd609/html5/thumbnails/27.jpg)
Parallelization Challenges
How do we assign work units to workers?
What if we have more work units than workers?
What if workers need to share partial results?
How do we aggregate partial results?
How do we know all the workers have finished?
What if workers die?
What’s the common theme of all of these problems?
![Page 28: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul](https://reader034.vdocuments.site/reader034/viewer/2022042221/5ec7268d3ed94a09191bd609/html5/thumbnails/28.jpg)
Where the rubber meets the road
Concurrency is difficult to reason about
Concurrency is even more difficult to reason about
At the scale of datacenters and across datacenters
In the presence of failures
In terms of multiple interacting services
Not to mention debugging…
The reality:
Lots of one-off solutions, custom code
Write you own dedicated library, then program with it
Burden on the programmer to explicitly manage everything
![Page 29: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul](https://reader034.vdocuments.site/reader034/viewer/2022042221/5ec7268d3ed94a09191bd609/html5/thumbnails/29.jpg)
The datacenter is the computer!
Source: Barroso and Urs Hölzle (2009)
![Page 30: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul](https://reader034.vdocuments.site/reader034/viewer/2022042221/5ec7268d3ed94a09191bd609/html5/thumbnails/30.jpg)
The datacenter is the computer
It’s all about the right level of abstraction
Moving beyond the von Neumann architecture
What’s the “instruction set” of the datacenter computer?
Hide system-level details from the developers
No more race conditions, lock contention, etc.
No need to explicitly worry about reliability, fault tolerance, etc.
Separating the what from the how
Developer specifies the computation that needs to be performed
Execution framework (“runtime”) handles actual execution
![Page 31: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul](https://reader034.vdocuments.site/reader034/viewer/2022042221/5ec7268d3ed94a09191bd609/html5/thumbnails/31.jpg)
Scaling “up” vs. “out”
No single machine is large enough
Smaller cluster of large SMP machines vs. larger cluster of
commodity machines (e.g., 16 128-core machines vs. 128 16-core
machines)
Nodes need to talk to each other!
Intra-node latencies: ~100 ns
Inter-node latencies: ~100 s
Move processing to the data
Cluster have limited bandwidth
Process data sequentially, avoid random access
Seeks are expensive, disk throughput is reasonable
Seamless scalability
Source: analysis on this an subsequent slides from Barroso and Urs Hölzle (2009)
![Page 32: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul](https://reader034.vdocuments.site/reader034/viewer/2022042221/5ec7268d3ed94a09191bd609/html5/thumbnails/32.jpg)
Common Big Data Analytics Use Cases
• Batch
• Use case 1 : ETL / Batch query (Single Silo)
• Use case 2 : distributed log aggregation
• Batch + real time
• Use case 3 : real time data store
• Use case 4 : real time data store + batch analytics
• Real time / Streaming
• Use case 5 : Streaming
![Page 33: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul](https://reader034.vdocuments.site/reader034/viewer/2022042221/5ec7268d3ed94a09191bd609/html5/thumbnails/33.jpg)
USE CASE 1 : ETL AND BATCH ANALYTICS at SCALE
• Data collected in various databases
• Data is scattered across multiple silos !
• Need a single silo to bring all data together and analyze
![Page 34: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul](https://reader034.vdocuments.site/reader034/viewer/2022042221/5ec7268d3ed94a09191bd609/html5/thumbnails/34.jpg)
USE CASE 1
• Using core Hadoop components
• No vendor lock in (works on all Hadoop distributions)
• Use HDFS (Hadoop File System) for storage
• Data Ingest with Sqoop
• Processing done by Map Reduce & Cousins
• Results are exported back to DB
![Page 35: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul](https://reader034.vdocuments.site/reader034/viewer/2022042221/5ec7268d3ed94a09191bd609/html5/thumbnails/35.jpg)
USE CASE 2 : DATA COMING FROM MULTIPLE SOURCES
• Data coming in from multiple sources.
• Data is ‘streaming in’
• Capture data in Hadoop
• Do batch analytics
![Page 36: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul](https://reader034.vdocuments.site/reader034/viewer/2022042221/5ec7268d3ed94a09191bd609/html5/thumbnails/36.jpg)
USE CASE 2
• Flume
• To bring in logs from multiple sources
• Distributed, reliable way to collect and move data
• If uplinks are dis connected, flume agents will store and
forward data
• HDFS
• Flume can directly write data to HDFS
• Files are segmented or ‘rolled’ by size / time e.g.
• Data-2015-01-01_10-00-00.log
• Data-2015-01-01_11-00-00.log
• Data-2015-01-01_12-00-00.log
![Page 37: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul](https://reader034.vdocuments.site/reader034/viewer/2022042221/5ec7268d3ed94a09191bd609/html5/thumbnails/37.jpg)
USE CASE 2
• Analytics stack : Pig / Hive / Oozie / Spark
(Same as in Use Case 1)
• Oozie
• Work flow manager
• “run this work flow every 1 hour”
• “run this work flow when data shows up in input directory”
• Can manage complex work flows
• Send alerts when processes fail
• ..etc
![Page 38: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul](https://reader034.vdocuments.site/reader034/viewer/2022042221/5ec7268d3ed94a09191bd609/html5/thumbnails/38.jpg)
USE CASE 3 : REAL TIME DATA STORE • Events are coming in
• Need to store the events
• Can be billions of events
• And query them in real time
e.g. last 10 events by user
![Page 39: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul](https://reader034.vdocuments.site/reader034/viewer/2022042221/5ec7268d3ed94a09191bd609/html5/thumbnails/39.jpg)
USE CASE 3
• HDFS is not ideal for updating data in real time and
accessing data in random
• A scalable real time store is needed e.g. Hbase, Cassandra
to support real time updates
• Data comes trickling in (as stream)
• Saved data becomes queryable immediately
• Use HBase APIs (Java / REST) to build dashboards
![Page 40: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul](https://reader034.vdocuments.site/reader034/viewer/2022042221/5ec7268d3ed94a09191bd609/html5/thumbnails/40.jpg)
USE CASE 4 : REAL TIME + BATCH • Building on use case 3
• Do extensive analysis on data on HBase
• E.g. : ‘scoring user models’
• ‘flagging credit card transactions’
![Page 41: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul](https://reader034.vdocuments.site/reader034/viewer/2022042221/5ec7268d3ed94a09191bd609/html5/thumbnails/41.jpg)
USE CASE 4
• HBase is the real time store
• Analytics is done via Map Reduce stack (Pig / Hive)
• Can we do them in a single stack?
• May not be a good idea
• Don’t mix real time
and batch analytics
• Batch Analytics will
impede real time
performance
![Page 42: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul](https://reader034.vdocuments.site/reader034/viewer/2022042221/5ec7268d3ed94a09191bd609/html5/thumbnails/42.jpg)
USE CASE 4
• How to replicate data?
• 1 : periodic synchronization of data between clusters
• 2 : data goes to both clusters at the same time
![Page 43: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul](https://reader034.vdocuments.site/reader034/viewer/2022042221/5ec7268d3ed94a09191bd609/html5/thumbnails/43.jpg)
USE CASE 5 : Streaming
Decision times : batch ( hours / days)
Use cases:
• Modeling
• ETL
• Reporting
![Page 44: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul](https://reader034.vdocuments.site/reader034/viewer/2022042221/5ec7268d3ed94a09191bd609/html5/thumbnails/44.jpg)
MOVING TOWARDS FAST DATA
• Decision time : (near) real time
• seconds (or milli seconds)
• Use Cases
• Alerts (medical / security)
• Fraud detection
• Streaming is becoming
more prevalent
• ‘Connected Devices’
• ‘Internet of Things’
• ‘Beyond Batch’
• We need faster processing
/ analytics
![Page 45: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul](https://reader034.vdocuments.site/reader034/viewer/2022042221/5ec7268d3ed94a09191bd609/html5/thumbnails/45.jpg)
STREAMING ARCHITECTURE – DATA BUCKET
• ‘data bucket’
• Captures incoming data
• Acts as a ‘buffer’ – smoothes out bursts
• So even if our processing offline, we won’t loose data
• Data bucket choices
• * Kafka
• MQ (RabittMQ ..etc)
• Amazon Kinesis
![Page 46: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul](https://reader034.vdocuments.site/reader034/viewer/2022042221/5ec7268d3ed94a09191bd609/html5/thumbnails/46.jpg)
KAFKA ARCHITECTURE
• Producers write data
to brokers
• Consumers read data from brokers
• All of this is
distributed / parallel
• Failure tolerant
• Data is stored as topics
• “sensor_data”
• “alerts”
• “emails”
![Page 47: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul](https://reader034.vdocuments.site/reader034/viewer/2022042221/5ec7268d3ed94a09191bd609/html5/thumbnails/47.jpg)
STREAMING ARCHITECTURE – PROCESSING ENGINE
• Need to process events with low latency
• So many to choose from !
• Choices
• Storm
• Spark
• NiFi
• Flink
![Page 48: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul](https://reader034.vdocuments.site/reader034/viewer/2022042221/5ec7268d3ed94a09191bd609/html5/thumbnails/48.jpg)
STREAMING ARCHITECTURE – DATA STORE
• Where processed data ends up
• Need to absorb data in real time
• Usually a NoSQL storage
• HBase
• Cassandra
• Lots of NoSQL stores
![Page 49: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul](https://reader034.vdocuments.site/reader034/viewer/2022042221/5ec7268d3ed94a09191bd609/html5/thumbnails/49.jpg)
LAMBDA ARCHITECTURE
• Each component is scalable
• Each component is fault tolerant
• Incorporates best practices
• All open source !
![Page 50: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul](https://reader034.vdocuments.site/reader034/viewer/2022042221/5ec7268d3ed94a09191bd609/html5/thumbnails/50.jpg)
LAMBDA ARCHITECTURE
1. All new data is sent to both batch layer and speed layer
2. Batch layer
• Holds master data set (immutable , append-only)
• Answers batch queries
3. Serving layer
• updates batch views so they can be queried adhoc
4. Speed Layer
• Handles new data
• Facilitates fast / real-time queries
5. Query layer
• Answers queries using batch & real-time views