introduction to hadoop 2.0 and how it overcomes the limitations of hadoop 1.0
DESCRIPTION
This presentation explains the new Hadoop 2.0 features in detail and clarifies many prevalent doubts about Hadoop 2.0. Following are the four main improvements in Hadoop 2.0 over Hadoop 1.x: HDFS Federation – horizontal scalability of NameNode NameNode High Availability – NameNode is no longer a Single Point of Failure YARN – ability to process Terabytes and Petabytes of data available in HDFS using Non-MapReduce applications such as MPI, GIRAPH Resource Manager – splits up the two major functionalities of overburdened JobTracker (resource management and job scheduling/monitoring) into two separate daemons: a global Resource Manager and per-application ApplicationMaster There are additional features such as Capacity Scheduler (Enable Multi-tenancy support in Hadoop), Data Snapshot, Support for Windows, NFS access, enabling increased Hadoop adoption in the Industry to solve Big Data problems.TRANSCRIPT
Introduction to Hadoop 2.0 Architecture
www.edureka.in/hadoop
Objectives of this Session
• Un
• The Big Data Problem• How the Hadoop Ecosystem comes to rescue?• Hadoop 1.0 Architecture and limitations• How Hadoop 2.0 Architecture overcomes the challenges?• Quiz to reinforce your learning
www.edureka.in/hadoop
For Further Queries and class recording:#askedurekaFollow us on Twitter @edurekaINLike us on Facebook /edurekaIN
Big Data Use Cases
www.edureka.in/hadoop
Tweet Trend Analysis
Telecom – Service Usage Analysise-Governance – Social Welfare
Banks and Financial Services
Twitter @edurekaIN, Facebook /edurekaIN, use #askedureka for Questions
Growing Interest in Hadoop
www.edureka.in/hadoop
www.edureka.in/hadoopSlide 5
Apache Oozie (Workflow)
HDFS (Hadoop Distributed File System)
Pig LatinData Analysis
MahoutMachine Learning
HiveDW System
MapReduce Framework
HBase
Flume Sqoop
Import Or Export
Unstructured orSemi-Structured data
Structured Data
Hadoop Eco-System
ETL/DW Professionals
Developers / Programmers
DBA / Administrators
Twitter @edurekaIN, Facebook /edurekaIN, use #askedureka for Questions
Hadoop 1.0 – In Summary
Client
HDFS Map Reduce
Secondary NameNode
Data BlocksDataNode
NameNode Job Tracker
Task Tracker
Map Reduce
DataNode Task Tracker
Map Reduce….
DataNode DataNodeTask Tracker
Map Reduce
Task Tracker
Map Reduce
www.edureka.in/hadoopTwitter @edurekaIN, Facebook /edurekaIN, use #askedureka for Questions
Limitations of Hadoop 1.x
• No horizontal scalability of NameNode
• Does not support NameNode High Availability
• Overburdened JobTracker
• Not possible to run Non-MapReduce Big Data Applications on HDFS
• Does not support Multi-tenancy
www.edureka.in/hadoopTwitter @edurekaIN, Facebook /edurekaIN, use #askedureka for Questions
Challenges:
• Meta is stored in NameNode memory
• Bottleneck after ~4000 nodes
• Results in cascading failures
DataNode
DataNode
DataNode
….
Client
Block Management
NameNodeNS
Challenges with Horizontal Scale
www.edureka.in/hadoopTwitter @edurekaIN, Facebook /edurekaIN, use #askedureka for Questions
Name Node – Single Point of Failure
Secondary NameNode
NameNode
metadata
metadata
www.edureka.in/hadoop
Secondary NameNode:
• “Not a hot standby” for the NameNode
• Connects to NameNode regularly
• Housekeeping, backup of NameNode metadata
• Saved metadata can build a failed NameNode
Twitter @edurekaIN, Facebook /edurekaIN, use #askedureka for Questions
Job Tracker – Overburdened
CPU
Spends a very significant portion of time and effort managing the life cycle of applications
Network
Single Listener Thread to communicate with thousands of
Map and Reduce Jobs
Task Tracker Task Tracker Task Tracker….
Job Tracker
www.edureka.in/hadoopTwitter @edurekaIN, Facebook /edurekaIN, use #askedureka for Questions
Unutilized Data in HDFS
Challenges:
Only MapReduce processing can be achieved
Alternate Data Storage is needed for other processing such as Real-time or Graph analysis
Doesn’t support Multi-Tenacy
www.edureka.in/hadoopTwitter @edurekaIN, Facebook /edurekaIN, use #askedureka for Questions
Introducing Hadoop 2.0
Most Important Features:
• HDFS Federation
• Support for NameNode High Availability
• YARN – Yet Another Resource Negotiator
• Better Processing Control
• Support for non Map Reduce type of processing
• Support for Multi-tenancy
www.edureka.in/hadoopTwitter @edurekaIN, Facebook /edurekaIN, use #askedureka for Questions
Namenode
Block Management
NS
Storage
Datanode Datanode…
Nam
esp
ace
Blo
ckSto
rage
Nam
esp
ace
NS1 NSk NSn
NN-1 NN-k NN-n
Common Storage
Datanode 1
…Datanode 2
…Datanode m
…Blo
ckSto
rage
Pool 1 Pool k Pool n
Block Pools
… …
Hadoop 1.0 Hadoop 2.0
www.edureka.in/hadoop
http://hadoop.apache.org/docs/stable2/hadoop-project-dist/hadoop-hdfs/Federation.html
Hadoop 2.0 Cluster Architecture - Federation
Twitter @edurekaIN, Facebook /edurekaIN, use #askedureka for Questions
Slide 14
NameNode HighAvailability
Next Generation MapReduce
Hadoop 2.0 – In Summary
HDFS YARN
Resource Manager
Standby NameNode
Active NameNode
DataNode
Node Manager
ContainerApp
Master …….
Mast
ers
Sla
ves
Node Manager
DataNode
ContainerApp
Master
DataNode
Node Manager
ContainerApp
Master
Shared edit logs
Scheduler
Applications Manager
(AsM)
www.edureka.in/hadoop
Write Read
Client
Twitter @edurekaIN, Facebook /edurekaIN, use #askedureka for Questions
Slide 15
NameNode HighAvailability
Next Generation MapReduce
Hadoop 2.0 – High Availability
HDFS YARN
Resource Manager
Standby NameNode
Active NameNode
DataNode
Node Manager
ContainerApp
Master …….
Mast
ers
Sla
ves
Node Manager
DataNode
ContainerApp
Master
DataNode
Node Manager
ContainerApp
Master
Shared edit logs
Scheduler
Applications Manager
(AsM)
www.edureka.in/hadoop
Write Read
Client
• Read/Write logs apply to its own namespace
• All name space edits logged to shared NFS storage; single writer (fencing)
Twitter @edurekaIN, Facebook /edurekaIN, use #askedureka for Questions
Slide 16
NameNode HighAvailability
Next Generation MapReduce
Hadoop 2.0 – Resource Management
HDFS YARN
Resource Manager
Standby NameNode
Active NameNode
DataNode
Node Manager
ContainerApp
Master …….
Mast
ers
Sla
ves
Node Manager
DataNode
ContainerApp
Master
DataNode
Node Manager
ContainerApp
Master
Shared edit logs
Scheduler
Applications Manager
(AsM)
www.edureka.in/hadoop
Write Read
Client
Twitter @edurekaIN, Facebook /edurekaIN, use #askedureka for Questions
BATCH(MapReduce)
INTERACTIVE(Text)
ONLINE(HBase)
STREAMING(Storm, S4, …)
GRAPH(Giraph)
IN-MEMORY(Spark)
HPC MPI(OpenMPI)
OTHER(Search)
(Weave..)
www.edureka.in/hadoop
http://hadoop.apache.org/docs/stable2/hadoop-yarn/hadoop-yarn-site/YARN.html
YARN – Moving beyond MapReduce
Twitter @edurekaIN, Facebook /edurekaIN, use #askedureka for Questions
www.edureka.in/hadoop
Features:
Different types of jobs are organized in different queues
Queue shares as %’s of cluster
Each queue has an associated priority
FIFO scheduling within each queue
Security ensured between applications
Multi-Tenancy - Capacity Scheduler
Batch Interactive Streaming
Twitter @edurekaIN, Facebook /edurekaIN, use #askedureka for Questions
Slide 19
Annie’s Question
NameNode HA was developed to overcome the following disadvantage in Hadoop 1.0?a) Single Point Of Failure Of NameNodeb) To run classic MapReducec) Too much burden on Job Tracker
www.edureka.in/hadoopTwitter @edurekaIN, Facebook /edurekaIN, use #askedureka for Questions
Slide 20
Annie’s Answer
Single Point of Failure of NameNode.
www.edureka.in/hadoopTwitter @edurekaIN, Facebook /edurekaIN, use #askedureka for Questions
Slide 21
Annie’s Question
YARN was developed to overcome the following disadvantage in Hadoop 1.0 MapReduce framework?a) Single Point Of Failure Of NameNodeb) Only one version can be run in classic MapReducec) Too much burden on Job Tracker
www.edureka.in/hadoopTwitter @edurekaIN, Facebook /edurekaIN, use #askedureka for Questions
Slide 22
Annie’s Answer
Too much burden on Job Tracker and to support Multi-Tenacy
www.edureka.in/hadoopTwitter @edurekaIN, Facebook /edurekaIN, use #askedureka for Questions
Slide 23
Hello There!!My name is Annie. I love quizzes and
puzzles and I am here to make you guys think and
answer my questions.
Which of the following is (are) a significant disadvantage in Hadoop 1.0?- ‘Single Point Of Failure’ of NameNode- It can run only one version in classic MapReduce- Too much burden on Job Tracker
Annie’s Question
www.edureka.in/hadoopTwitter @edurekaIN, Facebook /edurekaIN, use #askedureka for Questions
Slide 24
Hello There!!My name is Annie. I love quizzes and
puzzles and I am here to make you guys think and
answer my questions.
A Hadoop 1.x cluster can have multiple HDFS Namespaces.- True- False
Annie’s Question
www.edureka.in/hadoopTwitter @edurekaIN, Facebook /edurekaIN, use #askedureka for Questions
Slide 25
False. Not possible with Hadoop 1.x.
Annie’s Answer
www.edureka.in/hadoopTwitter @edurekaIN, Facebook /edurekaIN, use #askedureka for Questions
Slide 26
Hello There!!My name is Annie. I love quizzes and
puzzles and I am here to make you guys think and
answer my questions.
Can you use Hadoop 2.0 for Real-time processing?- Yes- No
Annie’s Question
www.edureka.in/hadoopTwitter @edurekaIN, Facebook /edurekaIN, use #askedureka for Questions
Slide 27
No. Even though YARN in Hadoop 2.0 supports multiple frameworks for different workloads other than batch, you need Storm or S4 for real-time processing.
Annie’s Answer
www.edureka.in/hadoopTwitter @edurekaIN, Facebook /edurekaIN, use #askedureka for Questions
www.edureka.in/hadoop
cluster.
Annie’s Question
How does HDFS Federation help HDFS Scale horizontally?A) Reduces the load on any single NameNode by using the multiple, independent NameNode to manage individual parts of the file system namespace.B) Provides cross-data centre (non-local) support for HDFS, allowing a cluster administrator to split the Block Storage outside the local cluster.
Twitter @edurekaIN, Facebook /edurekaIN, use #askedureka for Questions
www.edureka.in/hadoop
Annie’s Answer
(A). In order to scale the name service horizontally, HDFS federation uses multiple independent NameNodes. The NameNodes are federated, that is, the NameNodes are independent and do not require coordination with each other.
Twitter @edurekaIN, Facebook /edurekaIN, use #askedureka for Questions
Slide 30
Annie’s Question
You have configured two NameNodes to manage /marketing and /finance namespaces respectively. What will happen if you try to ‘put’ a file in /accounting directory?
www.edureka.in/hadoopTwitter @edurekaIN, Facebook /edurekaIN, use #askedureka for Questions
Slide 31
Annie’s Answer
The ‘put’ will fail. None of the namespaces will manage the file and you will get an IOException with a “No such file or directory error”.
www.edureka.in/hadoopTwitter @edurekaIN, Facebook /edurekaIN, use #askedureka for Questions
Slide 32
Hello There!!My name is Annie. I love quizzes and
puzzles and I am here to make you guys think and
answer my questions.
Can you use hundreds of Hadoop DataNode for any other processing than MapReduce in Hadoop 1.x?- Yes- No
Annie’s Question
www.edureka.in/hadoopTwitter @edurekaIN, Facebook /edurekaIN, use #askedureka for Questions
Slide 33
No. Hadoop 1.x dedicates all the DataNode resources to Map and Reduce slots with no or little room for processing any other workload.
Annie’s Answer
www.edureka.in/hadoopTwitter @edurekaIN, Facebook /edurekaIN, use #askedureka for Questions
Thank YouSee You in Next Class
www.edureka.in/hadoop
For Further Queries and class recording:#askedurekaFollow us on Twitter @edurekaINLike us on Facebook /edurekaIN