hadoop mapreduce

64
Author: Tom White Apache Hadoop committee, Cloudera Reported by: Tzu-Li Tai NCKU, HPDS Lab Hadoop: The Definitive Guide 3 rd Edition

Upload: tzu-li-tai

Post on 06-May-2015

1.342 views

Category:

Technology


3 download

DESCRIPTION

[Study Report] Study Material: http://shop.oreilly.com/product/0636920021773.do

TRANSCRIPT

Page 1: Hadoop MapReduce

Author: Tom White

Apache Hadoop committee, Cloudera

Reported by: Tzu-Li Tai

NCKU, HPDS Lab

Hadoop: The Definitive Guide 3rd Edition

Page 2: Hadoop MapReduce

Hadoop:

The Definitive Guide 3rd EditionBy Tom White

Published by O’Reilly Media, 2012

Referenced Chapters:

Chapter 2 – MapReduce

Chapter 6 – How MapReduce Works

Page 3: Hadoop MapReduce

I. An Introduction: Weather Dataset

II. Scaling Out: MapReduce for Large Inputs

III.Anatomy of a MapReduce Job

IV.MapReduce 2: YARN

V. Interesting Topics

Page 4: Hadoop MapReduce

I. An Introduction: Weather Dataset

II. Scaling Out: MapReduce for Large Inputs

III.Anatomy of a MapReduce Job

IV.MapReduce 2: YARN

V. Interesting Topics

Page 5: Hadoop MapReduce

A. What is MapReduce?

B. An Example: NCDC Weather DataSet

C. Without Hadoop: Analyzing with Unix

D. With Hadoop: Java MapReduce

Page 6: Hadoop MapReduce

A computation framework for distributed data processing above HDFS.

Consists of two phases: Map phase and Reduce phase.

Inherently parallel, therefore works on very-large scale data inputs as well as small inputs (for performance testing).

Data locality optimization.

Page 7: Hadoop MapReduce
Page 8: Hadoop MapReduce

• Loops through all year files and uses awk to extract

“temperature” and “quality” fields to manipulate.

• Complete run for data of a century took 42 minutes on a

single EC2 High-CPU Extra Large Instance.

Page 9: Hadoop MapReduce

• Straightforward(?): Run parts of program in parallel.

• Appropriate division of the work into pieces isn’t easy.

• Multiple machine (distributed computing) is troublesome.

• This is where Hadoop and MapReduce comes in!

Page 10: Hadoop MapReduce
Page 11: Hadoop MapReduce

(key, value…………………………………………………………… )

MAPPER

function

Page 12: Hadoop MapReduce

Shuffle and Sort(key, value )

REDUCER

function

Page 13: Hadoop MapReduce
Page 14: Hadoop MapReduce

Mapper Function

in Java

Page 15: Hadoop MapReduce

Reducer Function

in Java

Page 16: Hadoop MapReduce

Running the

MapReduce Job

in Java

Page 17: Hadoop MapReduce

I. An Introduction: Weather Dataset

II. Scaling Out: MapReduce for Large Inputs

III.Anatomy of a MapReduce Job

IV.MapReduce 2: YARN

V. Interesting Topics

Page 18: Hadoop MapReduce

A. Terminology

B. Data Flow

C. Combiner Functions

Page 19: Hadoop MapReduce

job – A unit of work that client wants to be performed. Consists of input data, MapReduce program, and configuration information.

task – The job is run by dividing it into two types of tasks: map tasksand reduce tasks.

Page 20: Hadoop MapReduce

Two types of nodes that control the job execution process:

jobtracker –

Coordinates jobs

Schedules tasks

Keeps record of progress

tasktrackers –

Run tasks.

Send progress reports to jobtracker.

Page 21: Hadoop MapReduce
Page 22: Hadoop MapReduce
Page 23: Hadoop MapReduce

The input to a job is divded into input splits, or splits.

Each split contains several records.

The output of a reduce task is called a part.

Page 24: Hadoop MapReduce

Map

task

Map

task

Map

task

input

split

input

split

input

split

record

Page 25: Hadoop MapReduce

Deciding split size

Load Balancing Overhead

(how many splits?)

A good split tends to be the size of an HDFS block (64MB default).

Page 26: Hadoop MapReduce

Data locality optimization – Best to run the map task on a node where the input data resides in HDFS saves cluster bandwidth

Page 27: Hadoop MapReduce

Data flow

for a single

reduce task

Page 28: Hadoop MapReduce

The Default MapReduce Job comes with a single reducer setNumReduceTasks() on Job.

For multiple reducers, map tasks partition their output, creating one partition for each reduce task.

(key, value) records for any given key are all in a single partition.

Page 29: Hadoop MapReduce

Data flow

with multiple

reduce tasks

Page 30: Hadoop MapReduce

Data flow

with no

reduce tasks

Page 31: Hadoop MapReduce

Jobs are limited by the bandwidth.

Should minimize the data transferred between map and reduce tasks.

A combiner function can help cut down the amount of data shuffled between the map and reduce tasks.

Page 32: Hadoop MapReduce

Shuffle-and sort

(off-node data transfer;

costs bandwidth)

Map

task

Map

task

Reduce

task

HDFS

Without combiner function;

Higher bandwidth consumption

Page 33: Hadoop MapReduce

Shuffle-and sort

(off-node data transfer;

costs bandwidth)

Map

task

Map

task

Reduce

task

HDFSUsing combiner function;

Lower bandwidth consumption

Page 34: Hadoop MapReduce

I. An Introduction: Weather Dataset

II. Scaling Out: MapReduce for Large Inputs

III.Anatomy of a MapReduce Job

IV.MapReduce 2: YARN

V. Interesting Topics

Page 35: Hadoop MapReduce

A. Job submission

B. Job initialization

C. Task assignment

D. Task execution

E. Progress and status updates

F. Job completion

Page 36: Hadoop MapReduce

The client: submits the MapReduce job.

The jobtracker: coordinates the job run JobTracker

The tasktrackers: run the map/reduce tasks TaskTracker

The distributed filesystem, HDFS

Page 37: Hadoop MapReduce

1. Run job

waitForCompletion()

calls submit() method on Job

creates a JobSummitter instance

calls submitJobInternal()

Page 38: Hadoop MapReduce

2. Get new job ID

JobSummitter asks the jobtracker for a new job ID

call getNewJobID() on JobTracker).

Page 39: Hadoop MapReduce

input/output verification

Checks output specification

Computes input splits

Page 40: Hadoop MapReduce

3. Copy job resources

Job JAR file

Configuration file

Computed splits

Copy to jobtracker’sfilesystem in a directory named after the job ID

Page 41: Hadoop MapReduce

4. Submit job

JobSummitter tells the jobtrackerthat the job is ready

call submitJob() on JobTracker.

Page 42: Hadoop MapReduce

5. Initialize job

job placed into internal queue

scheduler picks it up and initializes it

Create object to represent job

Page 43: Hadoop MapReduce

6. Retrieve input splits

Create the list of tasks:

retrieve computed splits

one map task for each split

create reduce tasks, amount know by setNumReduceTasks()

job setup and cleanup task

Page 44: Hadoop MapReduce

7. Heartbeat (returns task)

TaskTracker confirms op.

is ready for a new task

JobTracker assigns new task

Page 45: Hadoop MapReduce

8. Retrieve job resources

localize job JAR

create local working directory

Page 46: Hadoop MapReduce

9. Launch and 10. Run

TasTracker creates TaskRunnerinstance

TaskRunner launces child JVM

Child process runs task

Page 47: Hadoop MapReduce

Terminology

Status of a job and its tasks:

state of the job or task

progress of maps and reduces

values of the job’s counters

status message set by the user.

Progress: the proportion of the task completed.

half of input processed for map task: Progress = 50%

half of input processed for reduce task: Progress =

1/3 (copy phase) + 1/3 (sort phase) + 1/2*1/3 (half of input) = 5/6

Page 48: Hadoop MapReduce

Updating Hierarchy

Updating TaskTracker:

Child sets flag if complete

Every 3s, flag is checked

Updating JobTracker:

Every 5s, all status of tasks on TaskTracker is sent to JobTracker

Page 49: Hadoop MapReduce

Status update for client

Client polls JobTracker every sec. for job status.

getStatus() on Job JobStatusinstance

Page 50: Hadoop MapReduce

On completing the job cleanup task, JobTracker changes status to “successful”.

Job learns the job has completed prints a message returns from waitForCompletion().

Page 51: Hadoop MapReduce

I. An Introduction: Weather Dataset

II. Scaling Out: MapReduce for Large Inputs

III.Anatomy of a MapReduce Job

IV.MapReduce 2: YARN

V. Interesting Topics

Page 52: Hadoop MapReduce

A. What is YARN?

B. YARN Architechture

C. Improvement of SPOF using YARN

Page 53: Hadoop MapReduce

The next generation MapReduce: YARN – Yet Another Resource Negotiator

The two roles of the jobtracker, job scheduling and task progress monitoring, is separated into two independent daemons: a resource manager and an application master.

resource

manager

Application

master

node

manager

1. Ask for resource

2. Allocate “container”

Page 54: Hadoop MapReduce

More general than MapReduce.

Higher manageability and cluster utilization.

Even possible to run different versions of MapReduce on the same cluster upgrading process of MapReduce more manageable.

Page 55: Hadoop MapReduce

Entities of YARN MapReduce

The client: submits the job.

The YARN ResourceManager:

coordinates allocation of cluster resources ResourceManager

The YARN NodeManager(s):

launch and monitor containers. NodeManager

The MapReduce application master:

coordinates tasks running the MapReduce job MRAppMaster

The distributed filesystem, HDFS

Page 56: Hadoop MapReduce

application ID

submitApplication()

Page 57: Hadoop MapReduce

5a. Start container and

5b. MRAppMaster launch

Decide: Run uber task?

Small Job:

< 10 mappers, 1 reducer

Page 58: Hadoop MapReduce

Allocate containers for tasks (8)

Memory requirements are specified (unlike classic MapReduce)

Min. allocation (1024MB = 1GB)

Max. allocation (10240MB = 10GB)

Page 59: Hadoop MapReduce

The container is started by calling the NodeManager (9a.)

launch child JVM YarnChild (9b.)

Page 60: Hadoop MapReduce

YARN:

Classic MapReduce:

Task

MRAppMaster

Child JVM TaskTracker JobTracker

Page 61: Hadoop MapReduce

The ResourceManager is designed with a checkpoint mechanism to save its state.

State: consists of node managers in the system as well as the running applications.

Amount of state to be stored much smaller (manageable) than classic MapReduce.

Page 62: Hadoop MapReduce

I. An Introduction: Weather Dataset

II. Scaling Out: MapReduce for Large Inputs

III.Anatomy of a MapReduce Job

IV.MapReduce 2: YARN

V. Interesting Topics

Page 63: Hadoop MapReduce

MapReduce is inherently long-running and batch-oriented.

Hive and Pig translate queries into MapReduce jobs, and are therefore non-adhocand have high latency.

Google Dremel does not use the MapReduce framework and supports adhocqueries. (note: do not confuse with real-time streaming engines, such as “Storm”)

Future of Hive/Pig? Apache Drill

Page 64: Hadoop MapReduce