hadoop admin

BalajiRajan
Meetup.com/DevOps-Bangalore

[email protected] / balajirajan.com

# 1k => 1000 bytes # 1kb => 1024 bytes# 1m => 1000000 bytes # 1mb => 1024*1024 bytes# 1g => 1000000000 bytes # 1gb => 1024*1024*1024 bytes# 1T => 1000000000000 bytes #Tb =>1024*1024*1024*1024 bytes# 1Petabytes , Exabytes, zettabytes... etc

Max data in memory (RAM): 64GBMax data per computer (disk): 24TBData processed by Google every month: 400PB in 2007Average job size: 180GBTime: 180GB of data would take to read sequentially off a single disk drive: approximately 45 minutes

Some Numbers......

Data Access Speed is the Bottleneck

We can process data very quickly, but we can only read/write it very slowlySolution: parallel reads 1 HDD = 75MB/sec 1,000 HDDs = 75GB/sec Far more acceptable

Moving to a Cluster of Machines

* In the late 1990s, Google decided to design its architecture using clusters of low-cost machines Rather than fewer, more powerful machines * Creating an architecture around low-cost, unreliable hardware presents a number of challenges

System Requirements

* System should support partial failure* System should support data recoverability* System should be consistent* System should be scalable

Hadoop's Origins

Google created an architecture which answers these (and other) requirements Released two White Papers1. 2003: Description of the Google File System (GFS) A method for storing data in a distributed, reliable fashion2. 2004: Description of distributed MapReduce A method for processing data in a parallel fashion

So

Hadoop was based on these White Papers

Hadoop Cluster

HDFS Features

* Operates on top of an existing filesystem * Files are stored as blocks Much larger than for most filesystems Default is 64MB * Provides reliability through replication Each block is replicated across multiple DataNodes Default replication factor is 3 * Single NameNode daemon stores metadata and co-ordinates access Provides simple, centralized management * Blocks are stored on slave nodes Running the DataNode daemon

HDFS: Block Diagram

The NameNode

The NameNode stores all metadata

Information about file locations in HDFS Information about file ownership and permissions Names of the individual blocks Locations of the blocksMetadata is stored on disk and read when the NameNode daemon starts up

Filename is fsimageWhen changes to the metadata are required, these are made in RAM

Changes are also written to a log file on disk called edits Full details later

The NameNode: Memory Allocation

When the NameNode is running, all meta data is held in RAM for fast response

Each item consumes 150-200 bytes of RAM

Items:

Filename, permissions, etc. Block information for each block

The NameNode: Memory Allocation

Why HDFS prefers fewer, larger files:

Consider 1GB of data, HDFS block size 128MB Stored as 1 x 1GB file Name: 1 item Blocks: 8 x 3 = 24 items Total items: 25

Stored as 1000 x 1MB files Names: 1000 items Blocks: 1000 x 3 = 3000 items Total items: 4000

The Slave Nodes

Actual contents of the files are stored as blocks on the slave nodes

Blocks are simply files on the slave nodes underlying filesystem

Named blk_xxxxxxx Nothing on the slave node provides information about what underlying file the block is a part of That information is only stored in the NameNodes metadata Each block is stored on multiple different nodes for redundancy

Default is three replicasEach slave node runs a DataNode daemon

Controls access to the blocks Communicates with the NameNode

Secondary Name Node

The Secondary NameNode:

The Secondary NameNode is not a failover NameNode!

It performs memory-intensive administrative functions for the NameNode NameNode keeps information about files and blocks (the metadata) in memory NameNode writes metadata changes to an editlog Secondary NameNode periodically combines a prior filesystem snapshot and editlog into a new snapshot New snapshot is transmitted back to the NameNodeSecondary NameNode should run on a separate machine in a large installation

It requires as much RAM as the NameNode

Writing Files to HDFS

Anatomy of a File Write

1. Client connects to the NameNode

2. NameNode places an entry for the file in its metadata, returns the block name and list of DataNodes to the client

3. Client connects to the first DataNode and starts sending data

4. As data is received by the first DataNode, it connects to the second and starts sending data

5. Second DataNode similarly connects to the third

6. ack packets from the pipeline are sent back to the client

7. Client reports to the NameNode when the block is written

Reading Files from HDFS

Anatomy of a File Read

Client connects to the NameNode

NameNode returns the name and locations of the first few blocks of the file

Block locations are returned closest-first.Client connects to the first of the DataNodes, and reads the block

If the DataNode fails during the read, the client will seamlessly connect to the next one in the list to read the block

The NameNode Is Not A Bottleneck

Note: the data never travels via the NameNode

For writes For reads During re-replication

Dealing With Data Corruption

As the DataNode is reading the block, it also calculates the checksum.

Live checksum is compared to the checksum created when the block was stored.

If they differ, the client reads from the next DataNode in the list

The NameNode is informed that a corrupted version of the block has been found. The NameNode will then re-replicate that block elsewhere.

The DataNode verifies the checksums for blocks on a regular basis to avoid bit rot

Default is every three weeks after the block was created

Data Reliability and Recovery

DataNodes send heartbeats to the NameNode

Every three seconds After a period without any heartbeats, a DataNode is assumed to be lost

NameNode determines which blocks were on the lost node. NameNode finds other DataNodes with copies of these blocks. These DataNodes are instructed to copy the blocks to other nodes. Three-fold replication is actively maintained.

Hadoop is Rack-aware

Hadoop is Rack-aware

Hadoop understands the concept of rack awareness

The idea of where nodes are located, relative to one another Helps the JT to assign tasks to nodes closest to the data Helps the NN determine the closest block to a client during reads In reality, this should perhaps be described as being switchawareHDFS replicates data blocks on nodes on different racks

Provides extra data security in case of catastrophic hardware failure Rack-awareness is determined by a user-defined script

topology.script.file.name/etc/hadoop/topology.sh

Script create a file which contains a server and rack informaton:============10.0.0.11 /rack110.0.0.12 /rack110.0.0.13 /rack110.0.0.15 /rack210.0.0.16 /rack210.0.0.17 /rack210.0.0.19 /rack310.0.0.20 /rack310.0.0.21 /rack3=============

Rack-aware Script

Datacenter

HDFS File Permissions

Files in HDFS have an owner, a group, and permissions

Very similar to Unix file permissionsHDFS permissions are designed to stop good people doing foolish things

What Is MapReduce?

MapReduce is a method for distributing a task across multiple nodes

Each node processes data stored on that node

Consists of two developer-created phases

Map Reduce In between Map and Reduce is the shuffle and sort

Sends data from the Mappers to the Reducers

What Is MapReduce?

MapReduce: Basic Concepts

Each Mapper processes a single input split from HDFS

Hadoop passes the developers Map code one record at a time

Each record has a key and a value

Intermediate data is written by the Mapper to local disk

During the shuffle and sort phase, all the values associated with the same intermediate key are transferred to the same Reducer

The developer specifies the number of Reducers Reducer is passed each key and a list of all its values Keys are passed in sorted order

Output from the Reducers is written to HDFS

MapReduce: A Simple Example



Some MapReduce Terminology

* A user runs a client program on a client computer* The client program submits a job to Hadoop The job consists of a mapper, a reducer, and a list of inputs* The job is sent to the JobTracker* Each Slave Node runs a process called the TaskTracker* The JobTracker instructs TaskTrackers to run and monitor tasks A Map or Reduce over a piece of data is a single task* A task attempt is an instance of a task running on a slave node Task attempts can fail, in which case they will be restarted (more later) There will be at least as many task attempts as there are tasks which need to be performed

Aside: The Job Submission Process

When a job is submitted, the following happens: The client requests and receives a new unique Job ID from the JobTracker (includes JobTracker start time and a sequence number) The client calculates the input splits for the job How the input data will be split up between Mappers The client turns the job configuration information into an XML file The client places the XML file and the job jar into a temporary directory in HDFS (the Job ID is included in the path) The client contacts the JobTracker with the location of the XML and jar files, and the list of input splits The JobTracker takes over the job from this point on

MapReduce: High Level

MapReduce Failure Recovery

Task processes send heartbeats to the TaskTrackerTaskTrackers send heartbeats to the JobTrackerAny task that fails to report in 10 minutes is assumed to have failed Its JVM is killed by the TaskTrackerAny task that throws an exception is said to have failedFailed tasks are reported to the JobTracker by the TaskTrackerThe JobTracker reschedules any failed tasks It tries to avoid rescheduling the task on the same TaskTracker where it previously failed If a task fails four times, the whole job fails

MapReduce Failure Recovery

Any TaskTracker that fails to report in 10 minutes is assumed to have crashed All tasks on the node are restarted elsewhere Any TaskTracker reporting a high number of failed tasks is blacklisted, to prevent the node from blocking the entire job There is also a global blacklist, for TaskTrackers which fail on multiple jobs.

The JobTracker manages the state of each job Partial results of failed tasks are ignored

The Apache Hadoop Project

Hadoop is a top-level Apache project Created and managed under the auspices of the Apache Software FoundationSeveral other projects exist that rely on some or all of Hadoop Typically either both HDFS and MapReduce, or just HDFSEcosystem projects are often also top-level Apache projects Some are Apache incubator projects Some are not managed by the Apache Software FoundationEcosystem projects include Hive, Pig, Sqoop, Flume, HBase,Oozie,

Hive

Hive is a high-level abstraction on top of MapReduce Initially created by a team at Facebook Avoids having to write Java MapReduce code Data in HDFS is queried using a language very similar to SQL Known as HiveQLHiveQL queries are turned into MapReduce jobs by the Hive interpreter Tables are just directories of files stored in HDFS A Hive Metastore contains information on how to map a file to a table structure

Planning Your Hadoop Cluster

* What issues to consider when planning your Hadoop cluster 1. What types of hardware are typically used for Hadoop nodes 2. How to optimally configure your network topology 3. How to select the right operating system and Hadoop distribution

Cluster Growth Based on Storage Capacity

Basing your cluster growth on storage capacity is often a good method to use

Example:

Data grows by approximately 1TB per week HDFS set up to replicate each block three times Therefore, 3TB of extra storage space required per week Plus some overhead say, 30% Assuming machines with 4 x 1TB hard drives, this equates to a new machine required each week Alternatively: Two years of data 100TB will require approximately 100 machines

Classifying Nodes

Nodes can be classified as either slave nodes or master nodes

Slave node runs DataNode plus TaskTracker daemons

Master node runs either a NameNode daemon, a Secondary NameNode Daemon, or a JobTracker daemon

On smaller clusters, NameNode and JobTracker are often run on the same machine Sometimes even Secondary NameNode is on the same machine as the NameNode and JobTracker Important that at least one copy of the NameNodes metadata is stored on a separate machine (see later)

Slave Nodes: Recommended Configuration

Typical base configuration for a slave Node

4 x 1TB or 2TB hard drives, in a JBOD* configuration Do not use RAID! (See later) 2 x Quad-core CPUs 24-32GB RAM Gigabit EthernetMultiples of (1 hard drive + 2 cores + 6-8GB RAM) tend to work well for many types of applications

Especially those that are I/O bound

Slave Nodes: More Details (CPU)

Quad-core CPUs are now standard

Hex-core CPUs are becoming more prevalent

But are more expensiveHyper-threading should be enabled

Hadoop nodes are seldom CPU-bound

They are typically disk- and network-I/O bound Therefore, top-of-the-range CPUs are usually not necessary

Slave Nodes: More Details (RAM)

Slave node configuration specifies the maximum number of Map and Reduce tasks that can run simultaneously on that node

Each Map or Reduce task will take 1GB to 2GB of RAM

Slave nodes should not be using virtual memory

Ensure you have enough RAM to run all tasks, plus overhead for the DataNode and TaskTracker daemons, plus the operating system

Rule of thumb:

Total number of tasks = 1.5 x number of processor cores-- This is a starting point, and should not be taken as a definitive setting for all clusters

Slave Nodes: More Details (Disk)

In general, more spindles (disks) is better

In practice, we see anywhere from four to 12 disks per node

Use 3.5" disks

Faster, cheaper, higher capacity than 2.5" disks7,200 RPM SATA drives are fine

No need to buy 15,000 RPM drives8 x 1.5TB drives is likely to be better than 6 x 2TB drives

Different tasks are more likely to be accessing different disksA good practical maximum is 24TB per slave node

More than that will result in massive network traffic if a node dies and block re-replication must take place

Slave Nodes: Why Not RAID?

Slave Nodes do not benefit from using RAID* storage

HDFS provides built-in redundancy by replicating blocks across multiple nodes RAID striping (RAID 0) is actually slower than the JBOD configuration used by HDFS RAID 0 read and write operations are limited by the speed of the slowest disk in the RAID array Disk operations on JBOD are independent, so the average speed is greater than that of the slowest disk One test by Yahoo showed JBOD performing between 10% and 30% faster than RAID 0, depending on the operations being performed

What About Virtualization?

Virtualization is usually not worth considering

Multiple virtual nodes per machine hurts performance Hadoop runs optimally when it can use all the disks at once

What About Blade Servers?Blade servers are not recommended Failure of a blade chassis results in many nodes being unavailable Individual blades usually have very limited hard disk capacity Network interconnection between the chassis and top-of-rack switch can become a bottleneck

Master Nodes: Single Points of Failure

Slave nodes are expected to fail at some point

This is an assumption built into Hadoop NameNode will automatically re-replicate blocks that were on the failed node to other nodes in the cluster, retaining the 3x replication requirement JobTracker will automatically re-assign tasks that were running on failed nodes Master nodes are single points of failure

If the NameNode goes down, the cluster is inaccessible If the JobTracker goes down, no jobs can run on the cluster All currently running jobs will fail Spend more money on your master nodes!

Master Node Hardware Recommendations

Carrier-class hardware

Not commodity hardwareDual power supplies

Dual Ethernet cards

Bonded to provide failoverRAIDed hard drives

At least 32GB of RAM

General Network Considerations

Hadoop is very bandwidth-intensive!

Often, all nodes are communicating with each other at the same timeUse dedicated switches for your Hadoop cluster

Nodes are connected to a top-of-rack switch

Nodes should be connected at a minimum speed of 1Gb/sec

For clusters where large amounts of intermediate data is generated, consider 10Gb/sec connections

Expensive Alternative: bond two 1Gb/sec connections to each node

General Network Considerations (contd)

Racks are interconnected via core switches

Core switches should connect to top-of-rack switches at 10Gb/ sec or faster

Beware of over-subscription in top-of-rack and core switches

Consider bonded Ethernet to mitigate against failure

Consider redundant top-of-rack and core switches

Operating System Recommendations

Choose an OS youre comfortable administering

CentOS: geared towards servers rather than individual workstations

Conservative about package versions Very widely used in productionRedHat Enterprise Linux (RHEL): RedHat-supported analog to CentOS

Includes support contracts, for a priceIn production, we often see a mixture of RHEL and CentOS machines

Often RHEL on master nodes, CentOS on slaves

Configuring The System

Do not use Linuxs LVM (Logical Volume Manager) to make all your disks appear as a single volume

As with RAID 0, this limits speed to that of the slowest disk

Check the machines BIOS* settings

BIOS settings may not be configured for optimal performance For example, if you have SATA drives make sure IDE emulation is not enabledTest disk I/O speed with hdparm -t

Example:hdparm -t /dev/sda1

You should see speeds of 70MB/sec or more Anything less is an indication of possible problems

Configuring The System

Hadoop has no specific disk partitioning requirements Use whatever partitioning system makes sense to youMount disks with the noatime optionCommon directory structure for data mount points:/data//dfs/nn/data//dfs/dn/data//dfs/snn/data//mapred/localReduce the swappiness of the system Set vm.swappiness to 0 or 5 in /etc/sysctl.conf

Filesystem Considerations

Cloudera recommends the ext3 and ext4 filesystems

ext4 is now becoming more commonly usedXFS provides some performance benefit during kickstart

It formats in 0 seconds, vs several minutes for each disk with ext3XFS has some performance issues

Slow deletes in some versions Some performance improvements are available; see e.g.,http://everything2.com/index.pl?node_id=1479435 Some versions had problems when a machine runs out of memory

Operating System Parameters

Increase the nofile ulimit for the mapred and hdfs users to at least 32K

Setting is in /etc/security/limits.confDisable IPv6

Disable SELinux

Install and configure the ntp daemon

Ensures the time on all nodes is synchronized Important for HBase Useful when using logs to debug problems

Java Virtual Machine (JVM) Recommendations

Always use the official Oracle JDK (http://java.com/)

Hadoop is complex software, and often exposes bugs in other JDK implementationsVersion 1.6 is required

Avoid 1.6.0u18 This version had significant bugsHadoop is not yet production-tested with Java 7 (1.7)

Recommendation: dont upgrade to a new version as soon as it is released

Wait until it has been tested for some time

Cloudara Manager

For easy installation

Cloudera has released Cloudera Manager (CM), a tool for easy deployment and configuration of Hadoop clusters

The free version, Cloudera Manager Free Edition, can manage up to 50 nodes

The version supplied with Cloudera Enterprise supports an unlimited number of nodes

Using Cloudera Manager Free Edition

Typical Configuration Parameters

Hadoop's Configuration Files

Each machine in the Hadoop cluster has its own set of configuration files

Configuration files all reside in Hadoops conf directory

Typically /etc/hadoop/confPrimary configuration files are written in XML

Sample Configuration File

Sample configuration file (mapred-site.xml)

mapred.job.trackerlocalhost:8021

Core-site.xml

hdfs-site.xml

The single most important configuration value on your entire cluster, set on the NameNode:

* Loss of the NameNodes metadata will result in the effective loss of all the data on the cluster Although the blocks will remain, there is no way of reconstructing the original files without the metadata* This must be at least two disks (or a RAID volume) on the NameNode, plus an NFS mount elsewhere on the network Failure to set this correctly will result in eventual loss of your clusters data

Mapred-site.xml

Additional Configuration Files

There are several more configuration files in /etc/hadoop/conf

hadoop-env.sh: environment variables for Hadoop daemons HDFS and MapReduce include/exclude files* Controls who can connect to the NameNode and JobTracker masters, slaves: hostname lists for ssh control hadoop-policy.xml: Access control policies log4j.properties: logging (covered later in the course) fair-scheduler.xml: Scheduler (covered later in the course) hadoop-metrics.properties: Monitoring (covered later inthe course)

Environment Setup: hadoop-env.sh

HADOOP_HEAPSIZE Controls the heap size for Hadoop daemons Default 1GB Comment this out, and set the heap for individual daemons HADOOP_NAMENODE_OPTS Java options for the NameNode At least 4GB: -Xmx4g HADOOP_JOBTRACKER_OPTS Java options for the JobTracker At least 4GB: -Xmx4g HADOOP_DATANODE_OPTS, HADOOP_TASKTRACKER_OPTS Set to 1GB each: -Xmx1g

Host 'include' and 'exclude' Files

Optionally, specify dfs.hosts in hdfs-site.xml to point to a file listing hosts which are allowed to connect to the NameNode and act as DataNodes

Similarly, mapred.hosts points to a file which lists hosts allowedto connect as TaskTrackers Both files are optional

If omitted, any host may connect and act as a DataNode/ TaskTracker This is a possible security/data integrity issueNameNode can be forced to reread the dfs.hosts file with

hadoop dfsadmin -refreshNodes No such command for the JobTracker, which has to be restarted to re-read the mapred.hosts file, so many System Administrators only create a dfs.hosts file

Managing and Scheduling Jobs

Displaying Running Jobs

To view all jobs running on the cluster, use

# hadoop job list

Displaying All Jobs

To display all jobs including completed jobs, use

# hadoop job -list all

Killing a Job

It is important to note that once a user has submitted a job, they can not stop it just by hitting CTRL-C on their terminal

This stops job output appearing on the users console The job is still running on the cluster!

Killing a Job

To kill a job use hadoop job -kill

Demo!!!

Reference:
1. Cloudera.com
2. Bradhedlund.com

???

Click to edit the title text formatClick to edit Master title style

14/12/13

Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline Level

Seventh Outline LevelClick to edit Master text styles

Second level

Third level

Fourth level

Fifth level

Click to edit the title text formatClick to edit Master title style

14/12/13

hadoop admin

Technology