hadoop cluster with high availability
TRANSCRIPT
View Hadoop Administration Course at www.edureka.co/hadoop-admin
Achieve Hadoop High Availability
www.edureka.co/hadoop-adminSlide 2 Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions
Objectives
At the end of this module, you will be able to
Hadoop Cluster introductionHadoop cluster running modesHadoop configuration filesHadoop Admin ResponsibilitiesHadoop High AvailabilityDemo on high Availability
Slide 3Slide 3Slide 3 www.edureka.co/java-hadoop
Hadoop Core Components
Hadoop 2.x Core Components
HDFS YARN
Storage Processing
DataNode
NameNode Resource Manager
Node Manager
Master
Slave
SecondaryNameNode
www.edureka.co/hadoop-admin
www.edureka.co/hadoop-adminSlide 4
Seeking cluster growth on storage capacity is often a good method to use!
Cluster Growth Based On Storage Capacity
Data grows by approximately5TB per week
HDFS set up to replicate eachblock three times
Thus, 15TB of extra storagespace required per week
Assuming machines with 5x3TBhard drives, equating to a newmachine required each week
Assume Overheads to be 30%
www.edureka.co/hadoop-adminSlide 5
Slave Nodes: Recommended Configuration
Higher-performance vs lower performance components
Save the Money, Buy more Nodes!
General ( Depends on requirement ‘base’ configuration for a slave Node
» 4 x 1 TB or 2 TB hard drives, in a JBOD* configuration
» Do not use RAID!» 2 x Quad-core CPUs» 24 -32GB RAM» Gigabit Ethernet
General Configuration
Multiples of ( 1 hard drive + 2 cores+ 6-8GB RAM) generally work wellfor many types of applications
Special Configuration
Slave Nodes
“A cluster with more nodes performs better than one with fewer, slightly faster nodes”
www.edureka.co/hadoop-adminSlide 6
Slave Nodes: More Details (RAM)
Slave Nodes (RAM)
Generally each Map or Reduce taskwill take 1GB to 2GB of RAM
Slave nodes should not be usingvirtual memory
RULE OF THUMB!Total number of tasks = 1.5 x numberof processor core
Ensure enough RAM is present torun all tasks, plus the DataNode,TaskTracker daemons, plus theoperating system
www.edureka.co/hadoop-adminSlide 7
Master Node Hardware Recommendations
Carrier-class hardware (Not commodity hardware)
Dual power supplies
Dual Ethernet cards(Bonded to provide failover)
Raided hard drives
At least 32GB of RAM
Master Node
Requires
www.edureka.co/hadoop-adminSlide 8
Hadoop Cluster Modes
Hadoop can run in any of the following three modes:
Fully-Distributed Mode
Pseudo-Distributed Mode
No daemons, everything runs in a single JVM Suitable for running MapReduce programs during development Has no DFS
Hadoop daemons run on the local machine
Hadoop daemons run on a cluster of machines
Standalone (or Local) Mode
www.edureka.co/hadoop-adminSlide 9
Configuration Files
ConfigurationFilenames
Description of Log Files
hadoop-env.shyarn-env.sh
Settings for Hadoop Daemon’s process environment.
core-site.xmlConfiguration settings for Hadoop Core such as I/O settings that common to both HDFS and YARN.
hdfs-site.xml Configuration settings for HDFS Daemons, the Name Node and the Data Nodes.
yarn-site.xml Configuration setting for Resource Manager and Node Manager.
mapred-site.xml Configuration settings for MapReduce Applications.
slaves A list of machines (one per line) that each run DataNode and Node Manager.
Slide 10
Core
HDFS
core-site.xml
hdfs-site.xml
yarn-site.xmlYARN
mapred-site.xmlMap
Reduce
Hadoop 2.x Configuration Files – Apache Hadoop
www.edureka.co/hadoop-admin
Slide 11
RAM: 16GBHard disk: 6 x 2TBProcessor: Xenon with 2 coresEthernet: 3 x 10 GB/sOS: 64-bit CentOS
Hadoop Cluster: A Typical Use Case
RAM: 16GBHard disk: 6 x 2TBProcessor: Xenon with 2 cores.Ethernet: 3 x 10 GB/sOS: 64-bit CentOS
RAM: 64 GB,Hard disk: 1 TBProcessor: Xenon with 8 CoresEthernet: 3 x 10 GB/sOS: 64-bit CentOSPower: Redundant Power Supply
RAM: 32 GB,Hard disk: 1 TBProcessor: Xenon with 4 CoresEthernet: 3 x 10 GB/sOS: 64-bit CentOSPower: Redundant Power Supply
Active NameNodeSecondary NameNode
DataNode DataNode
RAM: 64 GB,Hard disk: 1 TBProcessor: Xenon with 8 CoresEthernet: 3 x 10 GB/sOS: 64-bit CentOSPower: Redundant Power Supply
StandBy NameNode
Optional
RAM: 16GBHard disk: 6 x 2TBProcessor: Xenon with 2 coresEthernet: 3 x 10 GB/sOS: 64-bit CentOS
DataNode
DataNode DataNode DataNode
www.edureka.co/hadoop-admin
Slide 12 www.edureka.in/hadoop-admin
Secondary NameNode:
"Not a hot standby" for the NameNode
Connects to NameNode every hour*
Housekeeping, backup of NemeNode metadata
Saved metadata can build a failed NameNode
SecondaryNameNode
NameNode
metadata
metadata
Single PointFailure
You give me metadata
every hour, I will make it
secure
NameNode – Single Point of Failure
www.edureka.in/hadoop-adminSlide 13
High Availability in Hadoop 2.0
NameNode recovery in Hadoop 1.0
Secondary NameNode
Standby NameNode
Active NameNode
Secondary NameNode
NameNode
Edit logs
Meta-Data
Automatic failover to Standby NameNode
Manually Recover using Secondary
NameNodeFSImage
NameNode Recovery Vs. Failover
Slide 14
Hadoop-2.X HA
www.edureka.co/hadoop-admin
Hadoop-2.X new feature called High Availability
The HDFS HA feature addresses the Hadoop-1.X problems by providing the option of running two Name Nodes in the same cluster, in an Active/Passive configuration.
These are referred to as the Active Name Node and the Standby Name Node.
Standby Name Node is hot back up for cluster.
Allowing a fast failover to a new Name Node in the case that a machine crashes, or a graceful administrator-initiated failover for the purpose of planned maintenance.
We can set up HA in ways :
Quorum-based StorageShared storage using NFS
www.edureka.co/hadoop-adminSlide 15
Slave NodeSlave NodeSlave Node
Standby NodeActive Node
Journal Nodes(Shared Edits)
Failover Controller Standby
Failover Controller Active
Zookeeper Service
Block Report & Heart beat
Monitor status and health. Manage HA state
HA Architecture
Monitor status and health. Manage HA state
Write Read
www.edureka.co/hadoop-adminSlide 16
Active Name NodeZKFC
ZookeeperJournal Node
Active Name NodeZKFC
ZookeeperJournal Node
Data nodeZookeeper
Journal Node
Daemons in HA Architecture
Block Report & Heart beat
www.edureka.co/hadoop-adminSlide 17
DEMO
www.edureka.co/hadoop-adminSlide 18
Hadoop Admin Responsibilities
Responsible for implementation and administration of Hadoop infrastructure.
Testing HDFS, Hive, Pig and MapReduce access for Applications.
Cluster maintenance tasks like Backup, Recovery, Upgrade, Patching.
Performance tuning and Capacity planning for Clusters.
Monitor Hadoop cluster and deploy security.
LIVE Online Class
Class Recording in LMS
24/7 Post Class Support
Module Wise Quiz
Project Work
Verifiable Certificate
www.edureka.co/hadoop-adminSlide 19 Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions
How it Works?
Questions
www.edureka.co/hadoop-adminSlide 20 Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions
www.edureka.co/hadoop-adminSlide 21
Course Topics
Module 1 » Hadoop Cluster Administration
Module 2» Hadoop Architecture and Cluster setup
Module 3 » Hadoop Cluster: Planning and Managing
Module 4 » Backup, Recovery and Maintenance
Module 5 » Hadoop 2.0 and High Availability
Module 6» Advanced Topics: QJM, HDFS Federation and
Security
Module 7» Oozie, Hcatalog/Hive and HBase Administration
Module 8» Project: Hadoop Implementation