hadoop mapreduce

Author: Tom White

Apache Hadoop committee, Cloudera

Reported by: Tzu-Li Tai

NCKU, HPDS Lab

Hadoop: The Definitive Guide 3rd Edition

Hadoop:

The Definitive Guide 3rd EditionBy Tom White

Published by O’Reilly Media, 2012

Referenced Chapters:

Chapter 2 – MapReduce

Chapter 6 – How MapReduce Works

I. An Introduction: Weather Dataset

II. Scaling Out: MapReduce for Large Inputs

III.Anatomy of a MapReduce Job

IV.MapReduce 2: YARN

V. Interesting Topics

A. What is MapReduce?

B. An Example: NCDC Weather DataSet

C. Without Hadoop: Analyzing with Unix

D. With Hadoop: Java MapReduce

A computation framework for distributed data processing above HDFS.

Consists of two phases: Map phase and Reduce phase.

Inherently parallel, therefore works on very-large scale data inputs as well as small inputs (for performance testing).

Data locality optimization.

• Loops through all year files and uses awk to extract

“temperature” and “quality” fields to manipulate.

• Complete run for data of a century took 42 minutes on a

single EC2 High-CPU Extra Large Instance.

• Straightforward(?): Run parts of program in parallel.

• Appropriate division of the work into pieces isn’t easy.

• Multiple machine (distributed computing) is troublesome.

• This is where Hadoop and MapReduce comes in!

(key, value…………………………………………………………… )

MAPPER

function

Shuffle and Sort(key, value )

REDUCER

function

Mapper Function

in Java

Reducer Function

in Java

Running the

MapReduce Job

in Java

A. Terminology

B. Data Flow

C. Combiner Functions

job – A unit of work that client wants to be performed. Consists of input data, MapReduce program, and configuration information.

task – The job is run by dividing it into two types of tasks: map tasksand reduce tasks.

Two types of nodes that control the job execution process:

jobtracker –

Coordinates jobs

Schedules tasks

Keeps record of progress

tasktrackers –

Run tasks.

Send progress reports to jobtracker.

The input to a job is divded into input splits, or splits.

Each split contains several records.

The output of a reduce task is called a part.

Map

task

Map

task

Map

task

input

split

input

split

input

split

record

Deciding split size

Load Balancing Overhead

(how many splits?)

A good split tends to be the size of an HDFS block (64MB default).

Data locality optimization – Best to run the map task on a node where the input data resides in HDFS saves cluster bandwidth

Data flow

for a single

reduce task

The Default MapReduce Job comes with a single reducer setNumReduceTasks() on Job.

For multiple reducers, map tasks partition their output, creating one partition for each reduce task.

(key, value) records for any given key are all in a single partition.

Data flow

with multiple

reduce tasks

Data flow

with no

reduce tasks

Jobs are limited by the bandwidth.

Should minimize the data transferred between map and reduce tasks.

A combiner function can help cut down the amount of data shuffled between the map and reduce tasks.

Shuffle-and sort

(off-node data transfer;

costs bandwidth)

Map

task

Map

task

Reduce

task

HDFS

Without combiner function;

Higher bandwidth consumption

Shuffle-and sort

(off-node data transfer;

costs bandwidth)

Map

task

Map

task

Reduce

task

HDFSUsing combiner function;

Lower bandwidth consumption

A. Job submission

B. Job initialization

C. Task assignment

D. Task execution

E. Progress and status updates

F. Job completion

The client: submits the MapReduce job.

The jobtracker: coordinates the job run JobTracker

The tasktrackers: run the map/reduce tasks TaskTracker

The distributed filesystem, HDFS

1. Run job

waitForCompletion()

calls submit() method on Job

creates a JobSummitter instance

calls submitJobInternal()

2. Get new job ID

JobSummitter asks the jobtracker for a new job ID

call getNewJobID() on JobTracker).

input/output verification

Checks output specification

Computes input splits

3. Copy job resources

Job JAR file

Configuration file

Computed splits

Copy to jobtracker’sfilesystem in a directory named after the job ID

4. Submit job

JobSummitter tells the jobtrackerthat the job is ready

call submitJob() on JobTracker.

5. Initialize job

job placed into internal queue

scheduler picks it up and initializes it

Create object to represent job

6. Retrieve input splits

Create the list of tasks:

retrieve computed splits

one map task for each split

create reduce tasks, amount know by setNumReduceTasks()

job setup and cleanup task

7. Heartbeat (returns task)

TaskTracker confirms op.

is ready for a new task

JobTracker assigns new task

8. Retrieve job resources

localize job JAR

create local working directory

9. Launch and 10. Run

TasTracker creates TaskRunnerinstance

TaskRunner launces child JVM

Child process runs task

Terminology

Status of a job and its tasks:

state of the job or task

progress of maps and reduces

values of the job’s counters

status message set by the user.

Progress: the proportion of the task completed.

half of input processed for map task: Progress = 50%

half of input processed for reduce task: Progress =

1/3 (copy phase) + 1/3 (sort phase) + 1/2*1/3 (half of input) = 5/6

Updating Hierarchy

Updating TaskTracker:

Child sets flag if complete

Every 3s, flag is checked

Updating JobTracker:

Every 5s, all status of tasks on TaskTracker is sent to JobTracker

Status update for client

Client polls JobTracker every sec. for job status.

getStatus() on Job JobStatusinstance

On completing the job cleanup task, JobTracker changes status to “successful”.

Job learns the job has completed prints a message returns from waitForCompletion().

A. What is YARN?

B. YARN Architechture

C. Improvement of SPOF using YARN

The next generation MapReduce: YARN – Yet Another Resource Negotiator

The two roles of the jobtracker, job scheduling and task progress monitoring, is separated into two independent daemons: a resource manager and an application master.

resource

manager

Application

master

node

manager

1. Ask for resource

2. Allocate “container”

More general than MapReduce.

Higher manageability and cluster utilization.

Even possible to run different versions of MapReduce on the same cluster upgrading process of MapReduce more manageable.

Entities of YARN MapReduce

The client: submits the job.

The YARN ResourceManager:

coordinates allocation of cluster resources ResourceManager

The YARN NodeManager(s):

launch and monitor containers. NodeManager

The MapReduce application master:

coordinates tasks running the MapReduce job MRAppMaster

The distributed filesystem, HDFS

application ID

submitApplication()

5a. Start container and

5b. MRAppMaster launch

Decide: Run uber task?

Small Job:

< 10 mappers, 1 reducer

Allocate containers for tasks (8)

Memory requirements are specified (unlike classic MapReduce)

Min. allocation (1024MB = 1GB)

Max. allocation (10240MB = 10GB)

The container is started by calling the NodeManager (9a.)

launch child JVM YarnChild (9b.)

YARN:

Classic MapReduce:

Task

MRAppMaster

Child JVM TaskTracker JobTracker

The ResourceManager is designed with a checkpoint mechanism to save its state.

State: consists of node managers in the system as well as the running applications.

Amount of state to be stored much smaller (manageable) than classic MapReduce.

MapReduce is inherently long-running and batch-oriented.

Hive and Pig translate queries into MapReduce jobs, and are therefore non-adhocand have high latency.

Google Dremel does not use the MapReduce framework and supports adhocqueries. (note: do not confuse with real-time streaming engines, such as “Storm”)

Future of Hive/Pig? Apache Drill

hadoop mapreduce

Technology