hadoop cluster with high availability

22
View Hadoop Administration Course at www.edureka.co/hadoop-admin Achieve Hadoop High Availability

Upload: edureka

Post on 07-Aug-2015

126 views

Category:

Technology


4 download

TRANSCRIPT

Page 1: Hadoop Cluster With High Availability

View Hadoop Administration Course at www.edureka.co/hadoop-admin

Achieve Hadoop High Availability

Page 2: Hadoop Cluster With High Availability

www.edureka.co/hadoop-adminSlide 2 Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions

Objectives

At the end of this module, you will be able to

Hadoop Cluster introductionHadoop cluster running modesHadoop configuration filesHadoop Admin ResponsibilitiesHadoop High AvailabilityDemo on high Availability

Page 3: Hadoop Cluster With High Availability

Slide 3Slide 3Slide 3 www.edureka.co/java-hadoop

Hadoop Core Components

Hadoop 2.x Core Components

HDFS YARN

Storage Processing

DataNode

NameNode Resource Manager

Node Manager

Master

Slave

SecondaryNameNode

www.edureka.co/hadoop-admin

Page 4: Hadoop Cluster With High Availability

www.edureka.co/hadoop-adminSlide 4

Seeking cluster growth on storage capacity is often a good method to use!

Cluster Growth Based On Storage Capacity

Data grows by approximately5TB per week

HDFS set up to replicate eachblock three times

Thus, 15TB of extra storagespace required per week

Assuming machines with 5x3TBhard drives, equating to a newmachine required each week

Assume Overheads to be 30%

Page 5: Hadoop Cluster With High Availability

www.edureka.co/hadoop-adminSlide 5

Slave Nodes: Recommended Configuration

Higher-performance vs lower performance components

Save the Money, Buy more Nodes!

General ( Depends on requirement ‘base’ configuration for a slave Node

» 4 x 1 TB or 2 TB hard drives, in a JBOD* configuration

» Do not use RAID!» 2 x Quad-core CPUs» 24 -32GB RAM» Gigabit Ethernet

General Configuration

Multiples of ( 1 hard drive + 2 cores+ 6-8GB RAM) generally work wellfor many types of applications

Special Configuration

Slave Nodes

“A cluster with more nodes performs better than one with fewer, slightly faster nodes”

Page 6: Hadoop Cluster With High Availability

www.edureka.co/hadoop-adminSlide 6

Slave Nodes: More Details (RAM)

Slave Nodes (RAM)

Generally each Map or Reduce taskwill take 1GB to 2GB of RAM

Slave nodes should not be usingvirtual memory

RULE OF THUMB!Total number of tasks = 1.5 x numberof processor core

Ensure enough RAM is present torun all tasks, plus the DataNode,TaskTracker daemons, plus theoperating system

Page 7: Hadoop Cluster With High Availability

www.edureka.co/hadoop-adminSlide 7

Master Node Hardware Recommendations

Carrier-class hardware (Not commodity hardware)

Dual power supplies

Dual Ethernet cards(Bonded to provide failover)

Raided hard drives

At least 32GB of RAM

Master Node

Requires

Page 8: Hadoop Cluster With High Availability

www.edureka.co/hadoop-adminSlide 8

Hadoop Cluster Modes

Hadoop can run in any of the following three modes:

Fully-Distributed Mode

Pseudo-Distributed Mode

No daemons, everything runs in a single JVM Suitable for running MapReduce programs during development Has no DFS

Hadoop daemons run on the local machine

Hadoop daemons run on a cluster of machines

Standalone (or Local) Mode

Page 9: Hadoop Cluster With High Availability

www.edureka.co/hadoop-adminSlide 9

Configuration Files

ConfigurationFilenames

Description of Log Files

hadoop-env.shyarn-env.sh

Settings for Hadoop Daemon’s process environment.

core-site.xmlConfiguration settings for Hadoop Core such as I/O settings that common to both HDFS and YARN.

hdfs-site.xml Configuration settings for HDFS Daemons, the Name Node and the Data Nodes.

yarn-site.xml Configuration setting for Resource Manager and Node Manager.

mapred-site.xml Configuration settings for MapReduce Applications.

slaves A list of machines (one per line) that each run DataNode and Node Manager.

Page 10: Hadoop Cluster With High Availability

Slide 10

Core

HDFS

core-site.xml

hdfs-site.xml

yarn-site.xmlYARN

mapred-site.xmlMap

Reduce

Hadoop 2.x Configuration Files – Apache Hadoop

www.edureka.co/hadoop-admin

Page 11: Hadoop Cluster With High Availability

Slide 11

RAM: 16GBHard disk: 6 x 2TBProcessor: Xenon with 2 coresEthernet: 3 x 10 GB/sOS: 64-bit CentOS

Hadoop Cluster: A Typical Use Case

RAM: 16GBHard disk: 6 x 2TBProcessor: Xenon with 2 cores.Ethernet: 3 x 10 GB/sOS: 64-bit CentOS

RAM: 64 GB,Hard disk: 1 TBProcessor: Xenon with 8 CoresEthernet: 3 x 10 GB/sOS: 64-bit CentOSPower: Redundant Power Supply

RAM: 32 GB,Hard disk: 1 TBProcessor: Xenon with 4 CoresEthernet: 3 x 10 GB/sOS: 64-bit CentOSPower: Redundant Power Supply

Active NameNodeSecondary NameNode

DataNode DataNode

RAM: 64 GB,Hard disk: 1 TBProcessor: Xenon with 8 CoresEthernet: 3 x 10 GB/sOS: 64-bit CentOSPower: Redundant Power Supply

StandBy NameNode

Optional

RAM: 16GBHard disk: 6 x 2TBProcessor: Xenon with 2 coresEthernet: 3 x 10 GB/sOS: 64-bit CentOS

DataNode

DataNode DataNode DataNode

www.edureka.co/hadoop-admin

Page 12: Hadoop Cluster With High Availability

Slide 12 www.edureka.in/hadoop-admin

Secondary NameNode:

"Not a hot standby" for the NameNode

Connects to NameNode every hour*

Housekeeping, backup of NemeNode metadata

Saved metadata can build a failed NameNode

SecondaryNameNode

NameNode

metadata

metadata

Single PointFailure

You give me metadata

every hour, I will make it

secure

NameNode – Single Point of Failure

Page 13: Hadoop Cluster With High Availability

www.edureka.in/hadoop-adminSlide 13

High Availability in Hadoop 2.0

NameNode recovery in Hadoop 1.0

Secondary NameNode

Standby NameNode

Active NameNode

Secondary NameNode

NameNode

Edit logs

Meta-Data

Automatic failover to Standby NameNode

Manually Recover using Secondary

NameNodeFSImage

NameNode Recovery Vs. Failover

Page 14: Hadoop Cluster With High Availability

Slide 14

Hadoop-2.X HA

www.edureka.co/hadoop-admin

Hadoop-2.X new feature called High Availability

The HDFS HA feature addresses the Hadoop-1.X problems by providing the option of running two Name Nodes in the same cluster, in an Active/Passive configuration.

These are referred to as the Active Name Node and the Standby Name Node.

Standby Name Node is hot back up for cluster.

Allowing a fast failover to a new Name Node in the case that a machine crashes, or a graceful administrator-initiated failover for the purpose of planned maintenance.

We can set up HA in ways :

Quorum-based StorageShared storage using NFS

Page 15: Hadoop Cluster With High Availability

www.edureka.co/hadoop-adminSlide 15

Slave NodeSlave NodeSlave Node

Standby NodeActive Node

Journal Nodes(Shared Edits)

Failover Controller Standby

Failover Controller Active

Zookeeper Service

Block Report & Heart beat

Monitor status and health. Manage HA state

HA Architecture

Monitor status and health. Manage HA state

Write Read

Page 16: Hadoop Cluster With High Availability

www.edureka.co/hadoop-adminSlide 16

Active Name NodeZKFC

ZookeeperJournal Node

Active Name NodeZKFC

ZookeeperJournal Node

Data nodeZookeeper

Journal Node

Daemons in HA Architecture

Block Report & Heart beat

Page 17: Hadoop Cluster With High Availability

www.edureka.co/hadoop-adminSlide 17

DEMO

Page 18: Hadoop Cluster With High Availability

www.edureka.co/hadoop-adminSlide 18

Hadoop Admin Responsibilities

Responsible for implementation and administration of Hadoop infrastructure.

Testing HDFS, Hive, Pig and MapReduce access for Applications.

Cluster maintenance tasks like Backup, Recovery, Upgrade, Patching.

Performance tuning and Capacity planning for Clusters.

Monitor Hadoop cluster and deploy security.

Page 19: Hadoop Cluster With High Availability

LIVE Online Class

Class Recording in LMS

24/7 Post Class Support

Module Wise Quiz

Project Work

Verifiable Certificate

www.edureka.co/hadoop-adminSlide 19 Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions

How it Works?

Page 20: Hadoop Cluster With High Availability

Questions

www.edureka.co/hadoop-adminSlide 20 Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions

Page 21: Hadoop Cluster With High Availability

www.edureka.co/hadoop-adminSlide 21

Course Topics

Module 1 » Hadoop Cluster Administration

Module 2» Hadoop Architecture and Cluster setup

Module 3 » Hadoop Cluster: Planning and Managing

Module 4 » Backup, Recovery and Maintenance

Module 5 » Hadoop 2.0 and High Availability

Module 6» Advanced Topics: QJM, HDFS Federation and

Security

Module 7» Oozie, Hcatalog/Hive and HBase Administration

Module 8» Project: Hadoop Implementation

Page 22: Hadoop Cluster With High Availability