hadoop admin
TRANSCRIPT
BalajiRajan
Meetup.com/DevOps-Bangalore
[email protected] / balajirajan.com
# 1k => 1000 bytes # 1kb => 1024 bytes# 1m => 1000000 bytes # 1mb => 1024*1024 bytes# 1g => 1000000000 bytes # 1gb => 1024*1024*1024 bytes# 1T => 1000000000000 bytes #Tb =>1024*1024*1024*1024 bytes# 1Petabytes , Exabytes, zettabytes... etc
Max data in memory (RAM): 64GBMax data per computer (disk): 24TBData processed by Google every month: 400PB in 2007Average job size: 180GBTime: 180GB of data would take to read sequentially off a single disk drive: approximately 45 minutes
Some Numbers......
Data Access Speed is the Bottleneck
We can process data very quickly, but we can only read/write it very slowlySolution: parallel reads 1 HDD = 75MB/sec 1,000 HDDs = 75GB/sec Far more acceptable
Moving to a Cluster of Machines
* In the late 1990s, Google decided to design its architecture using clusters of low-cost machines Rather than fewer, more powerful machines * Creating an architecture around low-cost, unreliable hardware presents a number of challenges
System Requirements
* System should support partial failure* System should support data recoverability* System should be consistent* System should be scalable
Hadoop's Origins
Google created an architecture which answers these (and other) requirements Released two White Papers1. 2003: Description of the Google File System (GFS) A method for storing data in a distributed, reliable fashion2. 2004: Description of distributed MapReduce A method for processing data in a parallel fashion
So
Hadoop was based on these White Papers
Hadoop Cluster
HDFS Features
* Operates on top of an existing filesystem * Files are stored as blocks Much larger than for most filesystems Default is 64MB * Provides reliability through replication Each block is replicated across multiple DataNodes Default replication factor is 3 * Single NameNode daemon stores metadata and co-ordinates access Provides simple, centralized management * Blocks are stored on slave nodes Running the DataNode daemon
HDFS: Block Diagram
The NameNode
The NameNode stores all metadata
Information about file locations in HDFS Information about file ownership and permissions Names of the individual blocks Locations of the blocksMetadata is stored on disk and read when the NameNode daemon starts up
Filename is fsimageWhen changes to the metadata are required, these are made in RAM
Changes are also written to a log file on disk called edits Full details later
The NameNode: Memory Allocation
When the NameNode is running, all meta data is held in RAM for fast response
Each item consumes 150-200 bytes of RAM
Items:
Filename, permissions, etc. Block information for each block
The NameNode: Memory Allocation
Why HDFS prefers fewer, larger files:
Consider 1GB of data, HDFS block size 128MB Stored as 1 x 1GB file Name: 1 item Blocks: 8 x 3 = 24 items Total items: 25
Stored as 1000 x 1MB files Names: 1000 items Blocks: 1000 x 3 = 3000 items Total items: 4000
The Slave Nodes
Actual contents of the files are stored as blocks on the slave nodes
Blocks are simply files on the slave nodes underlying filesystem
Named blk_xxxxxxx Nothing on the slave node provides information about what underlying file the block is a part of That information is only stored in the NameNodes metadata Each block is stored on multiple different nodes for redundancy
Default is three replicasEach slave node runs a DataNode daemon
Controls access to the blocks Communicates with the NameNode
Secondary Name Node
The Secondary NameNode:
The Secondary NameNode is not a failover NameNode!
It performs memory-intensive administrative functions for the NameNode NameNode keeps information about files and blocks (the metadata) in memory NameNode writes metadata changes to an editlog Secondary NameNode periodically combines a prior filesystem snapshot and editlog into a new snapshot New snapshot is transmitted back to the NameNodeSecondary NameNode should run on a separate machine in a large installation
It requires as much RAM as the NameNode
Writing Files to HDFS
Anatomy of a File Write
1. Client connects to the NameNode
2. NameNode places an entry for the file in its metadata, returns the block name and list of DataNodes to the client
3. Client connects to the first DataNode and starts sending data
4. As data is received by the first DataNode, it connects to the second and starts sending data
5. Second DataNode similarly connects to the third
6. ack packets from the pipeline are sent back to the client
7. Client reports to the NameNode when the block is written
Reading Files from HDFS
Anatomy of a File Read
Client connects to the NameNode
NameNode returns the name and locations of the first few blocks of the file
Block locations are returned closest-first.Client connects to the first of the DataNodes, and reads the block
If the DataNode fails during the read, the client will seamlessly connect to the next one in the list to read the block
The NameNode Is Not A Bottleneck
Note: the data never travels via the NameNode
For writes For reads During re-replication
Dealing With Data Corruption
As the DataNode is reading the block, it also calculates the checksum.
Live checksum is compared to the checksum created when the block was stored.
If they differ, the client reads from the next DataNode in the list
The NameNode is informed that a corrupted version of the block has been found. The NameNode will then re-replicate that block elsewhere.
The DataNode verifies the checksums for blocks on a regular basis to avoid bit rot
Default is every three weeks after the block was created
Data Reliability and Recovery
DataNodes send heartbeats to the NameNode
Every three seconds After a period without any heartbeats, a DataNode is assumed to be lost
NameNode determines which blocks were on the lost node. NameNode finds other DataNodes with copies of these blocks. These DataNodes are instructed to copy the blocks to other nodes. Three-fold replication is actively maintained.
Hadoop is Rack-aware
Hadoop is Rack-aware
Hadoop understands the concept of rack awareness
The idea of where nodes are located, relative to one another Helps the JT to assign tasks to nodes closest to the data Helps the NN determine the closest block to a client during reads In reality, this should perhaps be described as being switchawareHDFS replicates data blocks on nodes on different racks
Provides extra data security in case of catastrophic hardware failure Rack-awareness is determined by a user-defined script
topology.script.file.name/etc/hadoop/topology.sh
Script create a file which contains a server and rack informaton:============10.0.0.11 /rack110.0.0.12 /rack110.0.0.13 /rack110.0.0.15 /rack210.0.0.16 /rack210.0.0.17 /rack210.0.0.19 /rack310.0.0.20 /rack310.0.0.21 /rack3=============
Rack-aware Script
Datacenter
HDFS File Permissions
Files in HDFS have an owner, a group, and permissions
Very similar to Unix file permissionsHDFS permissions are designed to stop good people doing foolish things
What Is MapReduce?
MapReduce is a method for distributing a task across multiple nodes
Each node processes data stored on that node
Consists of two developer-created phases
Map Reduce In between Map and Reduce is the shuffle and sort
Sends data from the Mappers to the Reducers
What Is MapReduce?
MapReduce: Basic Concepts
Each Mapper processes a single input split from HDFS
Hadoop passes the developers Map code one record at a time
Each record has a key and a value
Intermediate data is written by the Mapper to local disk
During the shuffle and sort phase, all the values associated with the same intermediate key are transferred to the same Reducer
The developer specifies the number of Reducers Reducer is passed each key and a list of all its values Keys are passed in sorted order
Output from the Reducers is written to HDFS
MapReduce: A Simple Example
MapReduce: A Simple Example
MapReduce: A Simple Example
Some MapReduce Terminology
* A user runs a client program on a client computer* The client program submits a job to Hadoop The job consists of a mapper, a reducer, and a list of inputs* The job is sent to the JobTracker* Each Slave Node runs a process called the TaskTracker* The JobTracker instructs TaskTrackers to run and monitor tasks A Map or Reduce over a piece of data is a single task* A task attempt is an instance of a task running on a slave node Task attempts can fail, in which case they will be restarted (more later) There will be at least as many task attempts as there are tasks which need to be performed
Aside: The Job Submission Process
When a job is submitted, the following happens: The client requests and receives a new unique Job ID from the JobTracker (includes JobTracker start time and a sequence number) The client calculates the input splits for the job How the input data will be split up between Mappers The client turns the job configuration information into an XML file The client places the XML file and the job jar into a temporary directory in HDFS (the Job ID is included in the path) The client contacts the JobTracker with the location of the XML and jar files, and the list of input splits The JobTracker takes over the job from this point on
MapReduce: High Level
MapReduce Failure Recovery
Task processes send heartbeats to the TaskTrackerTaskTrackers send heartbeats to the JobTrackerAny task that fails to report in 10 minutes is assumed to have failed Its JVM is killed by the TaskTrackerAny task that throws an exception is said to have failedFailed tasks are reported to the JobTracker by the TaskTrackerThe JobTracker reschedules any failed tasks It tries to avoid rescheduling the task on the same TaskTracker where it previously failed If a task fails four times, the whole job fails
MapReduce Failure Recovery
Any TaskTracker that fails to report in 10 minutes is assumed to have crashed All tasks on the node are restarted elsewhere Any TaskTracker reporting a high number of failed tasks is blacklisted, to prevent the node from blocking the entire job There is also a global blacklist, for TaskTrackers which fail on multiple jobs.
The JobTracker manages the state of each job Partial results of failed tasks are ignored
The Apache Hadoop Project
Hadoop is a top-level Apache project Created and managed under the auspices of the Apache Software FoundationSeveral other projects exist that rely on some or all of Hadoop Typically either both HDFS and MapReduce, or just HDFSEcosystem projects are often also top-level Apache projects Some are Apache incubator projects Some are not managed by the Apache Software FoundationEcosystem projects include Hive, Pig, Sqoop, Flume, HBase,Oozie,
Hive
Hive is a high-level abstraction on top of MapReduce Initially created by a team at Facebook Avoids having to write Java MapReduce code Data in HDFS is queried using a language very similar to SQL Known as HiveQLHiveQL queries are turned into MapReduce jobs by the Hive interpreter Tables are just directories of files stored in HDFS A Hive Metastore contains information on how to map a file to a table structure
Planning Your Hadoop Cluster
* What issues to consider when planning your Hadoop cluster 1. What types of hardware are typically used for Hadoop nodes 2. How to optimally configure your network topology 3. How to select the right operating system and Hadoop distribution
Cluster Growth Based on Storage Capacity
Basing your cluster growth on storage capacity is often a good method to use
Example:
Data grows by approximately 1TB per week HDFS set up to replicate each block three times Therefore, 3TB of extra storage space required per week Plus some overhead say, 30% Assuming machines with 4 x 1TB hard drives, this equates to a new machine required each week Alternatively: Two years of data 100TB will require approximately 100 machines
Classifying Nodes
Nodes can be classified as either slave nodes or master nodes
Slave node runs DataNode plus TaskTracker daemons
Master node runs either a NameNode daemon, a Secondary NameNode Daemon, or a JobTracker daemon
On smaller clusters, NameNode and JobTracker are often run on the same machine Sometimes even Secondary NameNode is on the same machine as the NameNode and JobTracker Important that at least one copy of the NameNodes metadata is stored on a separate machine (see later)
Slave Nodes: Recommended Configuration
Typical base configuration for a slave Node
4 x 1TB or 2TB hard drives, in a JBOD* configuration Do not use RAID! (See later) 2 x Quad-core CPUs 24-32GB RAM Gigabit EthernetMultiples of (1 hard drive + 2 cores + 6-8GB RAM) tend to work well for many types of applications
Especially those that are I/O bound
Slave Nodes: More Details (CPU)
Quad-core CPUs are now standard
Hex-core CPUs are becoming more prevalent
But are more expensiveHyper-threading should be enabled
Hadoop nodes are seldom CPU-bound
They are typically disk- and network-I/O bound Therefore, top-of-the-range CPUs are usually not necessary
Slave Nodes: More Details (RAM)
Slave node configuration specifies the maximum number of Map and Reduce tasks that can run simultaneously on that node
Each Map or Reduce task will take 1GB to 2GB of RAM
Slave nodes should not be using virtual memory
Ensure you have enough RAM to run all tasks, plus overhead for the DataNode and TaskTracker daemons, plus the operating system
Rule of thumb:
Total number of tasks = 1.5 x number of processor cores-- This is a starting point, and should not be taken as a definitive setting for all clusters
Slave Nodes: More Details (Disk)
In general, more spindles (disks) is better
In practice, we see anywhere from four to 12 disks per node
Use 3.5" disks
Faster, cheaper, higher capacity than 2.5" disks7,200 RPM SATA drives are fine
No need to buy 15,000 RPM drives8 x 1.5TB drives is likely to be better than 6 x 2TB drives
Different tasks are more likely to be accessing different disksA good practical maximum is 24TB per slave node
More than that will result in massive network traffic if a node dies and block re-replication must take place
Slave Nodes: Why Not RAID?
Slave Nodes do not benefit from using RAID* storage
HDFS provides built-in redundancy by replicating blocks across multiple nodes RAID striping (RAID 0) is actually slower than the JBOD configuration used by HDFS RAID 0 read and write operations are limited by the speed of the slowest disk in the RAID array Disk operations on JBOD are independent, so the average speed is greater than that of the slowest disk One test by Yahoo showed JBOD performing between 10% and 30% faster than RAID 0, depending on the operations being performed
What About Virtualization?
Virtualization is usually not worth considering
Multiple virtual nodes per machine hurts performance Hadoop runs optimally when it can use all the disks at once
What About Blade Servers?Blade servers are not recommended Failure of a blade chassis results in many nodes being unavailable Individual blades usually have very limited hard disk capacity Network interconnection between the chassis and top-of-rack switch can become a bottleneck
Master Nodes: Single Points of Failure
Slave nodes are expected to fail at some point
This is an assumption built into Hadoop NameNode will automatically re-replicate blocks that were on the failed node to other nodes in the cluster, retaining the 3x replication requirement JobTracker will automatically re-assign tasks that were running on failed nodes Master nodes are single points of failure
If the NameNode goes down, the cluster is inaccessible If the JobTracker goes down, no jobs can run on the cluster All currently running jobs will fail Spend more money on your master nodes!
Master Node Hardware Recommendations
Carrier-class hardware
Not commodity hardwareDual power supplies
Dual Ethernet cards
Bonded to provide failoverRAIDed hard drives
At least 32GB of RAM
General Network Considerations
Hadoop is very bandwidth-intensive!
Often, all nodes are communicating with each other at the same timeUse dedicated switches for your Hadoop cluster
Nodes are connected to a top-of-rack switch
Nodes should be connected at a minimum speed of 1Gb/sec
For clusters where large amounts of intermediate data is generated, consider 10Gb/sec connections
Expensive Alternative: bond two 1Gb/sec connections to each node
General Network Considerations (contd)
Racks are interconnected via core switches
Core switches should connect to top-of-rack switches at 10Gb/ sec or faster
Beware of over-subscription in top-of-rack and core switches
Consider bonded Ethernet to mitigate against failure
Consider redundant top-of-rack and core switches
Operating System Recommendations
Choose an OS youre comfortable administering
CentOS: geared towards servers rather than individual workstations
Conservative about package versions Very widely used in productionRedHat Enterprise Linux (RHEL): RedHat-supported analog to CentOS
Includes support contracts, for a priceIn production, we often see a mixture of RHEL and CentOS machines
Often RHEL on master nodes, CentOS on slaves
Configuring The System
Do not use Linuxs LVM (Logical Volume Manager) to make all your disks appear as a single volume
As with RAID 0, this limits speed to that of the slowest disk
Check the machines BIOS* settings
BIOS settings may not be configured for optimal performance For example, if you have SATA drives make sure IDE emulation is not enabledTest disk I/O speed with hdparm -t
Example:hdparm -t /dev/sda1
You should see speeds of 70MB/sec or more Anything less is an indication of possible problems
Configuring The System
Hadoop has no specific disk partitioning requirements Use whatever partitioning system makes sense to youMount disks with the noatime optionCommon directory structure for data mount points:/data//dfs/nn/data//dfs/dn/data//dfs/snn/data//mapred/localReduce the swappiness of the system Set vm.swappiness to 0 or 5 in /etc/sysctl.conf
Filesystem Considerations
Cloudera recommends the ext3 and ext4 filesystems
ext4 is now becoming more commonly usedXFS provides some performance benefit during kickstart
It formats in 0 seconds, vs several minutes for each disk with ext3XFS has some performance issues
Slow deletes in some versions Some performance improvements are available; see e.g.,http://everything2.com/index.pl?node_id=1479435 Some versions had problems when a machine runs out of memory
Operating System Parameters
Increase the nofile ulimit for the mapred and hdfs users to at least 32K
Setting is in /etc/security/limits.confDisable IPv6
Disable SELinux
Install and configure the ntp daemon
Ensures the time on all nodes is synchronized Important for HBase Useful when using logs to debug problems
Java Virtual Machine (JVM) Recommendations
Always use the official Oracle JDK (http://java.com/)
Hadoop is complex software, and often exposes bugs in other JDK implementationsVersion 1.6 is required
Avoid 1.6.0u18 This version had significant bugsHadoop is not yet production-tested with Java 7 (1.7)
Recommendation: dont upgrade to a new version as soon as it is released
Wait until it has been tested for some time
Cloudara Manager
For easy installation
Cloudera has released Cloudera Manager (CM), a tool for easy deployment and configuration of Hadoop clusters
The free version, Cloudera Manager Free Edition, can manage up to 50 nodes
The version supplied with Cloudera Enterprise supports an unlimited number of nodes
Using Cloudera Manager Free Edition
Typical Configuration Parameters
Hadoop's Configuration Files
Each machine in the Hadoop cluster has its own set of configuration files
Configuration files all reside in Hadoops conf directory
Typically /etc/hadoop/confPrimary configuration files are written in XML
Sample Configuration File
Sample configuration file (mapred-site.xml)
mapred.job.trackerlocalhost:8021
Core-site.xml
hdfs-site.xml
The single most important configuration value on your entire cluster, set on the NameNode:
* Loss of the NameNodes metadata will result in the effective loss of all the data on the cluster Although the blocks will remain, there is no way of reconstructing the original files without the metadata* This must be at least two disks (or a RAID volume) on the NameNode, plus an NFS mount elsewhere on the network Failure to set this correctly will result in eventual loss of your clusters data
Mapred-site.xml
Additional Configuration Files
There are several more configuration files in /etc/hadoop/conf
hadoop-env.sh: environment variables for Hadoop daemons HDFS and MapReduce include/exclude files* Controls who can connect to the NameNode and JobTracker masters, slaves: hostname lists for ssh control hadoop-policy.xml: Access control policies log4j.properties: logging (covered later in the course) fair-scheduler.xml: Scheduler (covered later in the course) hadoop-metrics.properties: Monitoring (covered later inthe course)
Environment Setup: hadoop-env.sh
HADOOP_HEAPSIZE Controls the heap size for Hadoop daemons Default 1GB Comment this out, and set the heap for individual daemons HADOOP_NAMENODE_OPTS Java options for the NameNode At least 4GB: -Xmx4g HADOOP_JOBTRACKER_OPTS Java options for the JobTracker At least 4GB: -Xmx4g HADOOP_DATANODE_OPTS, HADOOP_TASKTRACKER_OPTS Set to 1GB each: -Xmx1g
Host 'include' and 'exclude' Files
Optionally, specify dfs.hosts in hdfs-site.xml to point to a file listing hosts which are allowed to connect to the NameNode and act as DataNodes
Similarly, mapred.hosts points to a file which lists hosts allowedto connect as TaskTrackers Both files are optional
If omitted, any host may connect and act as a DataNode/ TaskTracker This is a possible security/data integrity issueNameNode can be forced to reread the dfs.hosts file with
hadoop dfsadmin -refreshNodes No such command for the JobTracker, which has to be restarted to re-read the mapred.hosts file, so many System Administrators only create a dfs.hosts file
Managing and Scheduling Jobs
Displaying Running Jobs
To view all jobs running on the cluster, use
# hadoop job list
Displaying All Jobs
To display all jobs including completed jobs, use
# hadoop job -list all
Killing a Job
It is important to note that once a user has submitted a job, they can not stop it just by hitting CTRL-C on their terminal
This stops job output appearing on the users console The job is still running on the cluster!
Killing a Job
To kill a job use hadoop job -kill
Demo!!!
Reference:
1. Cloudera.com
2. Bradhedlund.com
???
Click to edit the title text formatClick to edit Master title style
14/12/13
Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline Level
Seventh Outline LevelClick to edit Master text styles
Second level
Third level
Fourth level
Fifth level
Click to edit the title text formatClick to edit Master title style
14/12/13