introduction to hadoop 2.0 and how it overcomes the limitations of hadoop 1.0

Introduction to Hadoop 2.0 Architecture

www.edureka.in/hadoop

Objectives of this Session

• Un

• The Big Data Problem• How the Hadoop Ecosystem comes to rescue?• Hadoop 1.0 Architecture and limitations• How Hadoop 2.0 Architecture overcomes the challenges?• Quiz to reinforce your learning


For Further Queries and class recording:#askedurekaFollow us on Twitter @edurekaINLike us on Facebook /edurekaIN

Big Data Use Cases


Tweet Trend Analysis

Telecom – Service Usage Analysise-Governance – Social Welfare

Banks and Financial Services

Twitter @edurekaIN, Facebook /edurekaIN, use #askedureka for Questions

Growing Interest in Hadoop


www.edureka.in/hadoopSlide 5

Apache Oozie (Workflow)

HDFS (Hadoop Distributed File System)

Pig LatinData Analysis

MahoutMachine Learning

HiveDW System

MapReduce Framework

HBase

Flume Sqoop

Import Or Export

Unstructured orSemi-Structured data

Structured Data

Hadoop Eco-System

ETL/DW Professionals

Developers / Programmers

DBA / Administrators


Hadoop 1.0 – In Summary

Client

HDFS Map Reduce

Secondary NameNode

Data BlocksDataNode

NameNode Job Tracker

Task Tracker

Map Reduce

DataNode Task Tracker

Map Reduce….

DataNode DataNodeTask Tracker

Map Reduce

Task Tracker

Map Reduce

www.edureka.in/hadoopTwitter @edurekaIN, Facebook /edurekaIN, use #askedureka for Questions

Limitations of Hadoop 1.x

• No horizontal scalability of NameNode

• Does not support NameNode High Availability

• Overburdened JobTracker

• Not possible to run Non-MapReduce Big Data Applications on HDFS

• Does not support Multi-tenancy


Challenges:

• Meta is stored in NameNode memory

• Bottleneck after ~4000 nodes

• Results in cascading failures

DataNode

DataNode

DataNode

….

Client

Block Management

NameNodeNS

Challenges with Horizontal Scale


Name Node – Single Point of Failure

Secondary NameNode

NameNode

metadata

metadata


Secondary NameNode:

• “Not a hot standby” for the NameNode

• Connects to NameNode regularly

• Housekeeping, backup of NameNode metadata

• Saved metadata can build a failed NameNode


Job Tracker – Overburdened

CPU

Spends a very significant portion of time and effort managing the life cycle of applications

Network

Single Listener Thread to communicate with thousands of

Map and Reduce Jobs

Task Tracker Task Tracker Task Tracker….

Job Tracker


Unutilized Data in HDFS

Challenges:

Only MapReduce processing can be achieved

Alternate Data Storage is needed for other processing such as Real-time or Graph analysis

Doesn’t support Multi-Tenacy


http://www.edureka.in/hadoop

Introducing Hadoop 2.0

Most Important Features:

• HDFS Federation

• Support for NameNode High Availability

• YARN – Yet Another Resource Negotiator

• Better Processing Control

• Support for non Map Reduce type of processing

• Support for Multi-tenancy


Namenode

Block Management

NS

Storage

Datanode Datanode…

Nam

esp

ace

Blo

ckSto

rage

Nam

esp

ace

NS1 NSk NSn

NN-1 NN-k NN-n

Common Storage

Datanode 1

…Datanode 2

…Datanode m

…Blo

ckSto

rage

Pool 1 Pool k Pool n

Block Pools

… …

Hadoop 1.0 Hadoop 2.0


http://hadoop.apache.org/docs/stable2/hadoop-project-dist/hadoop-hdfs/Federation.html

Hadoop 2.0 Cluster Architecture - Federation



http://hadoop.apache.org/docs/stable2/hadoop-project-dist/hadoop-hdfs/Federation.html

NameNode HighAvailability

Next Generation MapReduce

Hadoop 2.0 – In Summary

HDFS YARN

Resource Manager

Standby NameNode

Active NameNode

DataNode

Node Manager

ContainerApp

Master …….

Mast

ers

Sla

ves

Node Manager

DataNode

ContainerApp

Master

DataNode

Node Manager

ContainerApp

Master

Shared edit logs

Scheduler

Applications Manager

(AsM)


Write Read

Client





Hadoop 2.0 – High Availability

HDFS YARN

Resource Manager

Standby NameNode

Active NameNode

DataNode

Node Manager

ContainerApp

Master …….

Mast

ers

Sla

ves

Node Manager

DataNode

ContainerApp

Master

DataNode

Node Manager

ContainerApp

Master

Shared edit logs

Scheduler


(AsM)


Write Read

Client

• Read/Write logs apply to its own namespace

• All name space edits logged to shared NFS storage; single writer (fencing)





Hadoop 2.0 – Resource Management

HDFS YARN

Resource Manager

Standby NameNode

Active NameNode

DataNode

Node Manager

ContainerApp

Master …….

Mast

ers

Sla

ves

Node Manager

DataNode

ContainerApp

Master

DataNode

Node Manager

ContainerApp

Master

Shared edit logs

Scheduler


(AsM)


Write Read

Client



BATCH(MapReduce)

INTERACTIVE(Text)

ONLINE(HBase)

STREAMING(Storm, S4, …)

GRAPH(Giraph)

IN-MEMORY(Spark)

HPC MPI(OpenMPI)

OTHER(Search)

(Weave..)


http://hadoop.apache.org/docs/stable2/hadoop-yarn/hadoop-yarn-site/YARN.html

YARN – Moving beyond MapReduce



http://hadoop.apache.org/docs/stable2/hadoop-yarn/hadoop-yarn-site/YARN.html


Features:

Different types of jobs are organized in different queues

Queue shares as %’s of cluster

Each queue has an associated priority

FIFO scheduling within each queue

Security ensured between applications

Multi-Tenancy - Capacity Scheduler

Batch Interactive Streaming



Annie’s Question

NameNode HA was developed to overcome the following disadvantage in Hadoop 1.0?a) Single Point Of Failure Of NameNodeb) To run classic MapReducec) Too much burden on Job Tracker



Annie’s Answer

Single Point of Failure of NameNode.



Annie’s Question

YARN was developed to overcome the following disadvantage in Hadoop 1.0 MapReduce framework?a) Single Point Of Failure Of NameNodeb) Only one version can be run in classic MapReducec) Too much burden on Job Tracker



Annie’s Answer

Too much burden on Job Tracker and to support Multi-Tenacy



Hello There!!My name is Annie. I love quizzes and

puzzles and I am here to make you guys think and

answer my questions.

Which of the following is (are) a significant disadvantage in Hadoop 1.0?- ‘Single Point Of Failure’ of NameNode- It can run only one version in classic MapReduce- Too much burden on Job Tracker

Annie’s Question





A Hadoop 1.x cluster can have multiple HDFS Namespaces.- True- False

Annie’s Question


False. Not possible with Hadoop 1.x.

Annie’s Answer





Can you use Hadoop 2.0 for Real-time processing?- Yes- No

Annie’s Question



No. Even though YARN in Hadoop 2.0 supports multiple frameworks for different workloads other than batch, you need Storm or S4 for real-time processing.

Annie’s Answer




cluster.

Annie’s Question

How does HDFS Federation help HDFS Scale horizontally?A) Reduces the load on any single NameNode by using the multiple, independent NameNode to manage individual parts of the file system namespace.B) Provides cross-data centre (non-local) support for HDFS, allowing a cluster administrator to split the Block Storage outside the local cluster.




Annie’s Answer

(A). In order to scale the name service horizontally, HDFS federation uses multiple independent NameNodes. The NameNodes are federated, that is, the NameNodes are independent and do not require coordination with each other.



Annie’s Question

You have configured two NameNodes to manage /marketing and /finance namespaces respectively. What will happen if you try to ‘put’ a file in /accounting directory?



Annie’s Answer

The ‘put’ will fail. None of the namespaces will manage the file and you will get an IOException with a “No such file or directory error”.






Can you use hundreds of Hadoop DataNode for any other processing than MapReduce in Hadoop 1.x?- Yes- No

Annie’s Question


No. Hadoop 1.x dedicates all the DataNode resources to Map and Reduce slots with no or little room for processing any other workload.

Annie’s Answer


Thank YouSee You in Next Class


For Further Queries and class recording:#askedurekaFollow us on Twitter @edurekaINLike us on Facebook /edurekaIN

introduction to hadoop 2.0 and how it overcomes the limitations of hadoop 1.0

Technology

facebook edurekain

failed namenode twitter

datanode task tracker

backup of namenode metadata

limitations of hadoop

hadoop ecosystem

namenode memory bottleneck

generation mapreduce