introduction to hadoop 2.0 and how it overcomes the limitations of hadoop 1.0

34
Introduction to Hadoop 2.0 Architecture www.edureka.in/hadoop

Upload: edureka

Post on 27-Jan-2015

110 views

Category:

Technology


2 download

DESCRIPTION

This presentation explains the new Hadoop 2.0 features in detail and clarifies many prevalent doubts about Hadoop 2.0. Following are the four main improvements in Hadoop 2.0 over Hadoop 1.x: HDFS Federation – horizontal scalability of NameNode NameNode High Availability – NameNode is no longer a Single Point of Failure YARN – ability to process Terabytes and Petabytes of data available in HDFS using Non-MapReduce applications such as MPI, GIRAPH Resource Manager – splits up the two major functionalities of overburdened JobTracker (resource management and job scheduling/monitoring) into two separate daemons: a global Resource Manager and per-application ApplicationMaster There are additional features such as Capacity Scheduler (Enable Multi-tenancy support in Hadoop), Data Snapshot, Support for Windows, NFS access, enabling increased Hadoop adoption in the Industry to solve Big Data problems.

TRANSCRIPT

Page 1: Introduction to Hadoop 2.0 and How it overcomes the Limitations of Hadoop 1.0

Introduction to Hadoop 2.0 Architecture

www.edureka.in/hadoop

Page 2: Introduction to Hadoop 2.0 and How it overcomes the Limitations of Hadoop 1.0

Objectives of this Session

• Un

• The Big Data Problem• How the Hadoop Ecosystem comes to rescue?• Hadoop 1.0 Architecture and limitations• How Hadoop 2.0 Architecture overcomes the challenges?• Quiz to reinforce your learning

www.edureka.in/hadoop

For Further Queries and class recording:#askedurekaFollow us on Twitter @edurekaINLike us on Facebook /edurekaIN

Page 3: Introduction to Hadoop 2.0 and How it overcomes the Limitations of Hadoop 1.0

Big Data Use Cases

www.edureka.in/hadoop

Tweet Trend Analysis

Telecom – Service Usage Analysise-Governance – Social Welfare

Banks and Financial Services

Twitter @edurekaIN, Facebook /edurekaIN, use #askedureka for Questions

Page 4: Introduction to Hadoop 2.0 and How it overcomes the Limitations of Hadoop 1.0

Growing Interest in Hadoop

www.edureka.in/hadoop

Page 5: Introduction to Hadoop 2.0 and How it overcomes the Limitations of Hadoop 1.0

www.edureka.in/hadoopSlide 5

Apache Oozie (Workflow)

HDFS (Hadoop Distributed File System)

Pig LatinData Analysis

MahoutMachine Learning

HiveDW System

MapReduce Framework

HBase

Flume Sqoop

Import Or Export

Unstructured orSemi-Structured data

Structured Data

Hadoop Eco-System

ETL/DW Professionals

Developers / Programmers

DBA / Administrators

Twitter @edurekaIN, Facebook /edurekaIN, use #askedureka for Questions

Page 6: Introduction to Hadoop 2.0 and How it overcomes the Limitations of Hadoop 1.0

Hadoop 1.0 – In Summary

Client

HDFS Map Reduce

Secondary NameNode

Data BlocksDataNode

NameNode Job Tracker

Task Tracker

Map Reduce

DataNode Task Tracker

Map Reduce….

DataNode DataNodeTask Tracker

Map Reduce

Task Tracker

Map Reduce

www.edureka.in/hadoopTwitter @edurekaIN, Facebook /edurekaIN, use #askedureka for Questions

Page 7: Introduction to Hadoop 2.0 and How it overcomes the Limitations of Hadoop 1.0

Limitations of Hadoop 1.x

• No horizontal scalability of NameNode

• Does not support NameNode High Availability

• Overburdened JobTracker

• Not possible to run Non-MapReduce Big Data Applications on HDFS

• Does not support Multi-tenancy

www.edureka.in/hadoopTwitter @edurekaIN, Facebook /edurekaIN, use #askedureka for Questions

Page 8: Introduction to Hadoop 2.0 and How it overcomes the Limitations of Hadoop 1.0

Challenges:

• Meta is stored in NameNode memory

• Bottleneck after ~4000 nodes

• Results in cascading failures

DataNode

DataNode

DataNode

….

Client

Block Management

NameNodeNS

Challenges with Horizontal Scale

www.edureka.in/hadoopTwitter @edurekaIN, Facebook /edurekaIN, use #askedureka for Questions

Page 9: Introduction to Hadoop 2.0 and How it overcomes the Limitations of Hadoop 1.0

Name Node – Single Point of Failure

Secondary NameNode

NameNode

metadata

metadata

www.edureka.in/hadoop

Secondary NameNode:

• “Not a hot standby” for the NameNode

• Connects to NameNode regularly

• Housekeeping, backup of NameNode metadata

• Saved metadata can build a failed NameNode

Twitter @edurekaIN, Facebook /edurekaIN, use #askedureka for Questions

Page 10: Introduction to Hadoop 2.0 and How it overcomes the Limitations of Hadoop 1.0

Job Tracker – Overburdened

CPU

Spends a very significant portion of time and effort managing the life cycle of applications

Network

Single Listener Thread to communicate with thousands of

Map and Reduce Jobs

Task Tracker Task Tracker Task Tracker….

Job Tracker

www.edureka.in/hadoopTwitter @edurekaIN, Facebook /edurekaIN, use #askedureka for Questions

Page 11: Introduction to Hadoop 2.0 and How it overcomes the Limitations of Hadoop 1.0

Unutilized Data in HDFS

Challenges:

Only MapReduce processing can be achieved

Alternate Data Storage is needed for other processing such as Real-time or Graph analysis

Doesn’t support Multi-Tenacy

www.edureka.in/hadoopTwitter @edurekaIN, Facebook /edurekaIN, use #askedureka for Questions

Page 12: Introduction to Hadoop 2.0 and How it overcomes the Limitations of Hadoop 1.0

Introducing Hadoop 2.0

Most Important Features:

• HDFS Federation

• Support for NameNode High Availability

• YARN – Yet Another Resource Negotiator

• Better Processing Control

• Support for non Map Reduce type of processing

• Support for Multi-tenancy

www.edureka.in/hadoopTwitter @edurekaIN, Facebook /edurekaIN, use #askedureka for Questions

Page 13: Introduction to Hadoop 2.0 and How it overcomes the Limitations of Hadoop 1.0

Namenode

Block Management

NS

Storage

Datanode Datanode…

Nam

esp

ace

Blo

ckSto

rage

Nam

esp

ace

NS1 NSk NSn

NN-1 NN-k NN-n

Common Storage

Datanode 1

…Datanode 2

…Datanode m

…Blo

ckSto

rage

Pool 1 Pool k Pool n

Block Pools

… …

Hadoop 1.0 Hadoop 2.0

www.edureka.in/hadoop

http://hadoop.apache.org/docs/stable2/hadoop-project-dist/hadoop-hdfs/Federation.html

Hadoop 2.0 Cluster Architecture - Federation

Twitter @edurekaIN, Facebook /edurekaIN, use #askedureka for Questions

Page 14: Introduction to Hadoop 2.0 and How it overcomes the Limitations of Hadoop 1.0

Slide 14

NameNode HighAvailability

Next Generation MapReduce

Hadoop 2.0 – In Summary

HDFS YARN

Resource Manager

Standby NameNode

Active NameNode

DataNode

Node Manager

ContainerApp

Master …….

Mast

ers

Sla

ves

Node Manager

DataNode

ContainerApp

Master

DataNode

Node Manager

ContainerApp

Master

Shared edit logs

Scheduler

Applications Manager

(AsM)

www.edureka.in/hadoop

Write Read

Client

Twitter @edurekaIN, Facebook /edurekaIN, use #askedureka for Questions

Page 15: Introduction to Hadoop 2.0 and How it overcomes the Limitations of Hadoop 1.0

Slide 15

NameNode HighAvailability

Next Generation MapReduce

Hadoop 2.0 – High Availability

HDFS YARN

Resource Manager

Standby NameNode

Active NameNode

DataNode

Node Manager

ContainerApp

Master …….

Mast

ers

Sla

ves

Node Manager

DataNode

ContainerApp

Master

DataNode

Node Manager

ContainerApp

Master

Shared edit logs

Scheduler

Applications Manager

(AsM)

www.edureka.in/hadoop

Write Read

Client

• Read/Write logs apply to its own namespace

• All name space edits logged to shared NFS storage; single writer (fencing)

Twitter @edurekaIN, Facebook /edurekaIN, use #askedureka for Questions

Page 16: Introduction to Hadoop 2.0 and How it overcomes the Limitations of Hadoop 1.0

Slide 16

NameNode HighAvailability

Next Generation MapReduce

Hadoop 2.0 – Resource Management

HDFS YARN

Resource Manager

Standby NameNode

Active NameNode

DataNode

Node Manager

ContainerApp

Master …….

Mast

ers

Sla

ves

Node Manager

DataNode

ContainerApp

Master

DataNode

Node Manager

ContainerApp

Master

Shared edit logs

Scheduler

Applications Manager

(AsM)

www.edureka.in/hadoop

Write Read

Client

Twitter @edurekaIN, Facebook /edurekaIN, use #askedureka for Questions

Page 17: Introduction to Hadoop 2.0 and How it overcomes the Limitations of Hadoop 1.0

BATCH(MapReduce)

INTERACTIVE(Text)

ONLINE(HBase)

STREAMING(Storm, S4, …)

GRAPH(Giraph)

IN-MEMORY(Spark)

HPC MPI(OpenMPI)

OTHER(Search)

(Weave..)

www.edureka.in/hadoop

http://hadoop.apache.org/docs/stable2/hadoop-yarn/hadoop-yarn-site/YARN.html

YARN – Moving beyond MapReduce

Twitter @edurekaIN, Facebook /edurekaIN, use #askedureka for Questions

Page 18: Introduction to Hadoop 2.0 and How it overcomes the Limitations of Hadoop 1.0

www.edureka.in/hadoop

Features:

Different types of jobs are organized in different queues

Queue shares as %’s of cluster

Each queue has an associated priority

FIFO scheduling within each queue

Security ensured between applications

Multi-Tenancy - Capacity Scheduler

Batch Interactive Streaming

Twitter @edurekaIN, Facebook /edurekaIN, use #askedureka for Questions

Page 19: Introduction to Hadoop 2.0 and How it overcomes the Limitations of Hadoop 1.0

Slide 19

Annie’s Question

NameNode HA was developed to overcome the following disadvantage in Hadoop 1.0?a) Single Point Of Failure Of NameNodeb) To run classic MapReducec) Too much burden on Job Tracker

www.edureka.in/hadoopTwitter @edurekaIN, Facebook /edurekaIN, use #askedureka for Questions

Page 20: Introduction to Hadoop 2.0 and How it overcomes the Limitations of Hadoop 1.0

Slide 20

Annie’s Answer

Single Point of Failure of NameNode.

www.edureka.in/hadoopTwitter @edurekaIN, Facebook /edurekaIN, use #askedureka for Questions

Page 21: Introduction to Hadoop 2.0 and How it overcomes the Limitations of Hadoop 1.0

Slide 21

Annie’s Question

YARN was developed to overcome the following disadvantage in Hadoop 1.0 MapReduce framework?a) Single Point Of Failure Of NameNodeb) Only one version can be run in classic MapReducec) Too much burden on Job Tracker

www.edureka.in/hadoopTwitter @edurekaIN, Facebook /edurekaIN, use #askedureka for Questions

Page 22: Introduction to Hadoop 2.0 and How it overcomes the Limitations of Hadoop 1.0

Slide 22

Annie’s Answer

Too much burden on Job Tracker and to support Multi-Tenacy

www.edureka.in/hadoopTwitter @edurekaIN, Facebook /edurekaIN, use #askedureka for Questions

Page 23: Introduction to Hadoop 2.0 and How it overcomes the Limitations of Hadoop 1.0

Slide 23

Hello There!!My name is Annie. I love quizzes and

puzzles and I am here to make you guys think and

answer my questions.

Which of the following is (are) a significant disadvantage in Hadoop 1.0?- ‘Single Point Of Failure’ of NameNode- It can run only one version in classic MapReduce- Too much burden on Job Tracker

Annie’s Question

www.edureka.in/hadoopTwitter @edurekaIN, Facebook /edurekaIN, use #askedureka for Questions

Page 24: Introduction to Hadoop 2.0 and How it overcomes the Limitations of Hadoop 1.0

Slide 24

Hello There!!My name is Annie. I love quizzes and

puzzles and I am here to make you guys think and

answer my questions.

A Hadoop 1.x cluster can have multiple HDFS Namespaces.- True- False

Annie’s Question

www.edureka.in/hadoopTwitter @edurekaIN, Facebook /edurekaIN, use #askedureka for Questions

Page 25: Introduction to Hadoop 2.0 and How it overcomes the Limitations of Hadoop 1.0

Slide 25

False. Not possible with Hadoop 1.x.

Annie’s Answer

www.edureka.in/hadoopTwitter @edurekaIN, Facebook /edurekaIN, use #askedureka for Questions

Page 26: Introduction to Hadoop 2.0 and How it overcomes the Limitations of Hadoop 1.0

Slide 26

Hello There!!My name is Annie. I love quizzes and

puzzles and I am here to make you guys think and

answer my questions.

Can you use Hadoop 2.0 for Real-time processing?- Yes- No

Annie’s Question

www.edureka.in/hadoopTwitter @edurekaIN, Facebook /edurekaIN, use #askedureka for Questions

Page 27: Introduction to Hadoop 2.0 and How it overcomes the Limitations of Hadoop 1.0

Slide 27

No. Even though YARN in Hadoop 2.0 supports multiple frameworks for different workloads other than batch, you need Storm or S4 for real-time processing.

Annie’s Answer

www.edureka.in/hadoopTwitter @edurekaIN, Facebook /edurekaIN, use #askedureka for Questions

Page 28: Introduction to Hadoop 2.0 and How it overcomes the Limitations of Hadoop 1.0

www.edureka.in/hadoop

cluster.

Annie’s Question

How does HDFS Federation help HDFS Scale horizontally?A) Reduces the load on any single NameNode by using the multiple, independent NameNode to manage individual parts of the file system namespace.B) Provides cross-data centre (non-local) support for HDFS, allowing a cluster administrator to split the Block Storage outside the local cluster.

Twitter @edurekaIN, Facebook /edurekaIN, use #askedureka for Questions

Page 29: Introduction to Hadoop 2.0 and How it overcomes the Limitations of Hadoop 1.0

www.edureka.in/hadoop

Annie’s Answer

(A). In order to scale the name service horizontally, HDFS federation uses multiple independent NameNodes. The NameNodes are federated, that is, the NameNodes are independent and do not require coordination with each other.

Twitter @edurekaIN, Facebook /edurekaIN, use #askedureka for Questions

Page 30: Introduction to Hadoop 2.0 and How it overcomes the Limitations of Hadoop 1.0

Slide 30

Annie’s Question

You have configured two NameNodes to manage /marketing and /finance namespaces respectively. What will happen if you try to ‘put’ a file in /accounting directory?

www.edureka.in/hadoopTwitter @edurekaIN, Facebook /edurekaIN, use #askedureka for Questions

Page 31: Introduction to Hadoop 2.0 and How it overcomes the Limitations of Hadoop 1.0

Slide 31

Annie’s Answer

The ‘put’ will fail. None of the namespaces will manage the file and you will get an IOException with a “No such file or directory error”.

www.edureka.in/hadoopTwitter @edurekaIN, Facebook /edurekaIN, use #askedureka for Questions

Page 32: Introduction to Hadoop 2.0 and How it overcomes the Limitations of Hadoop 1.0

Slide 32

Hello There!!My name is Annie. I love quizzes and

puzzles and I am here to make you guys think and

answer my questions.

Can you use hundreds of Hadoop DataNode for any other processing than MapReduce in Hadoop 1.x?- Yes- No

Annie’s Question

www.edureka.in/hadoopTwitter @edurekaIN, Facebook /edurekaIN, use #askedureka for Questions

Page 33: Introduction to Hadoop 2.0 and How it overcomes the Limitations of Hadoop 1.0

Slide 33

No. Hadoop 1.x dedicates all the DataNode resources to Map and Reduce slots with no or little room for processing any other workload.

Annie’s Answer

www.edureka.in/hadoopTwitter @edurekaIN, Facebook /edurekaIN, use #askedureka for Questions

Page 34: Introduction to Hadoop 2.0 and How it overcomes the Limitations of Hadoop 1.0

Thank YouSee You in Next Class

www.edureka.in/hadoop

For Further Queries and class recording:#askedurekaFollow us on Twitter @edurekaINLike us on Facebook /edurekaIN