hadoop mapreduce
DESCRIPTION
[Study Report] Study Material: http://shop.oreilly.com/product/0636920021773.doTRANSCRIPT
Author: Tom White
Apache Hadoop committee, Cloudera
Reported by: Tzu-Li Tai
NCKU, HPDS Lab
Hadoop: The Definitive Guide 3rd Edition
Hadoop:
The Definitive Guide 3rd EditionBy Tom White
Published by O’Reilly Media, 2012
Referenced Chapters:
Chapter 2 – MapReduce
Chapter 6 – How MapReduce Works
I. An Introduction: Weather Dataset
II. Scaling Out: MapReduce for Large Inputs
III.Anatomy of a MapReduce Job
IV.MapReduce 2: YARN
V. Interesting Topics
I. An Introduction: Weather Dataset
II. Scaling Out: MapReduce for Large Inputs
III.Anatomy of a MapReduce Job
IV.MapReduce 2: YARN
V. Interesting Topics
A. What is MapReduce?
B. An Example: NCDC Weather DataSet
C. Without Hadoop: Analyzing with Unix
D. With Hadoop: Java MapReduce
A computation framework for distributed data processing above HDFS.
Consists of two phases: Map phase and Reduce phase.
Inherently parallel, therefore works on very-large scale data inputs as well as small inputs (for performance testing).
Data locality optimization.
• Loops through all year files and uses awk to extract
“temperature” and “quality” fields to manipulate.
• Complete run for data of a century took 42 minutes on a
single EC2 High-CPU Extra Large Instance.
• Straightforward(?): Run parts of program in parallel.
• Appropriate division of the work into pieces isn’t easy.
• Multiple machine (distributed computing) is troublesome.
• This is where Hadoop and MapReduce comes in!
(key, value…………………………………………………………… )
MAPPER
function
Shuffle and Sort(key, value )
REDUCER
function
Mapper Function
in Java
Reducer Function
in Java
Running the
MapReduce Job
in Java
I. An Introduction: Weather Dataset
II. Scaling Out: MapReduce for Large Inputs
III.Anatomy of a MapReduce Job
IV.MapReduce 2: YARN
V. Interesting Topics
A. Terminology
B. Data Flow
C. Combiner Functions
job – A unit of work that client wants to be performed. Consists of input data, MapReduce program, and configuration information.
task – The job is run by dividing it into two types of tasks: map tasksand reduce tasks.
Two types of nodes that control the job execution process:
jobtracker –
Coordinates jobs
Schedules tasks
Keeps record of progress
tasktrackers –
Run tasks.
Send progress reports to jobtracker.
The input to a job is divded into input splits, or splits.
Each split contains several records.
The output of a reduce task is called a part.
Map
task
Map
task
Map
task
input
split
input
split
input
split
record
Deciding split size
Load Balancing Overhead
(how many splits?)
A good split tends to be the size of an HDFS block (64MB default).
Data locality optimization – Best to run the map task on a node where the input data resides in HDFS saves cluster bandwidth
Data flow
for a single
reduce task
The Default MapReduce Job comes with a single reducer setNumReduceTasks() on Job.
For multiple reducers, map tasks partition their output, creating one partition for each reduce task.
(key, value) records for any given key are all in a single partition.
Data flow
with multiple
reduce tasks
Data flow
with no
reduce tasks
Jobs are limited by the bandwidth.
Should minimize the data transferred between map and reduce tasks.
A combiner function can help cut down the amount of data shuffled between the map and reduce tasks.
Shuffle-and sort
(off-node data transfer;
costs bandwidth)
Map
task
Map
task
Reduce
task
HDFS
Without combiner function;
Higher bandwidth consumption
Shuffle-and sort
(off-node data transfer;
costs bandwidth)
Map
task
Map
task
Reduce
task
HDFSUsing combiner function;
Lower bandwidth consumption
I. An Introduction: Weather Dataset
II. Scaling Out: MapReduce for Large Inputs
III.Anatomy of a MapReduce Job
IV.MapReduce 2: YARN
V. Interesting Topics
A. Job submission
B. Job initialization
C. Task assignment
D. Task execution
E. Progress and status updates
F. Job completion
The client: submits the MapReduce job.
The jobtracker: coordinates the job run JobTracker
The tasktrackers: run the map/reduce tasks TaskTracker
The distributed filesystem, HDFS
1. Run job
waitForCompletion()
calls submit() method on Job
creates a JobSummitter instance
calls submitJobInternal()
2. Get new job ID
JobSummitter asks the jobtracker for a new job ID
call getNewJobID() on JobTracker).
input/output verification
Checks output specification
Computes input splits
3. Copy job resources
Job JAR file
Configuration file
Computed splits
Copy to jobtracker’sfilesystem in a directory named after the job ID
4. Submit job
JobSummitter tells the jobtrackerthat the job is ready
call submitJob() on JobTracker.
5. Initialize job
job placed into internal queue
scheduler picks it up and initializes it
Create object to represent job
6. Retrieve input splits
Create the list of tasks:
retrieve computed splits
one map task for each split
create reduce tasks, amount know by setNumReduceTasks()
job setup and cleanup task
7. Heartbeat (returns task)
TaskTracker confirms op.
is ready for a new task
JobTracker assigns new task
8. Retrieve job resources
localize job JAR
create local working directory
9. Launch and 10. Run
TasTracker creates TaskRunnerinstance
TaskRunner launces child JVM
Child process runs task
Terminology
Status of a job and its tasks:
state of the job or task
progress of maps and reduces
values of the job’s counters
status message set by the user.
Progress: the proportion of the task completed.
half of input processed for map task: Progress = 50%
half of input processed for reduce task: Progress =
1/3 (copy phase) + 1/3 (sort phase) + 1/2*1/3 (half of input) = 5/6
Updating Hierarchy
Updating TaskTracker:
Child sets flag if complete
Every 3s, flag is checked
Updating JobTracker:
Every 5s, all status of tasks on TaskTracker is sent to JobTracker
Status update for client
Client polls JobTracker every sec. for job status.
getStatus() on Job JobStatusinstance
On completing the job cleanup task, JobTracker changes status to “successful”.
Job learns the job has completed prints a message returns from waitForCompletion().
I. An Introduction: Weather Dataset
II. Scaling Out: MapReduce for Large Inputs
III.Anatomy of a MapReduce Job
IV.MapReduce 2: YARN
V. Interesting Topics
A. What is YARN?
B. YARN Architechture
C. Improvement of SPOF using YARN
The next generation MapReduce: YARN – Yet Another Resource Negotiator
The two roles of the jobtracker, job scheduling and task progress monitoring, is separated into two independent daemons: a resource manager and an application master.
resource
manager
Application
master
node
manager
1. Ask for resource
2. Allocate “container”
More general than MapReduce.
Higher manageability and cluster utilization.
Even possible to run different versions of MapReduce on the same cluster upgrading process of MapReduce more manageable.
Entities of YARN MapReduce
The client: submits the job.
The YARN ResourceManager:
coordinates allocation of cluster resources ResourceManager
The YARN NodeManager(s):
launch and monitor containers. NodeManager
The MapReduce application master:
coordinates tasks running the MapReduce job MRAppMaster
The distributed filesystem, HDFS
application ID
submitApplication()
5a. Start container and
5b. MRAppMaster launch
Decide: Run uber task?
Small Job:
< 10 mappers, 1 reducer
Allocate containers for tasks (8)
Memory requirements are specified (unlike classic MapReduce)
Min. allocation (1024MB = 1GB)
Max. allocation (10240MB = 10GB)
The container is started by calling the NodeManager (9a.)
launch child JVM YarnChild (9b.)
YARN:
Classic MapReduce:
Task
MRAppMaster
Child JVM TaskTracker JobTracker
The ResourceManager is designed with a checkpoint mechanism to save its state.
State: consists of node managers in the system as well as the running applications.
Amount of state to be stored much smaller (manageable) than classic MapReduce.
I. An Introduction: Weather Dataset
II. Scaling Out: MapReduce for Large Inputs
III.Anatomy of a MapReduce Job
IV.MapReduce 2: YARN
V. Interesting Topics
MapReduce is inherently long-running and batch-oriented.
Hive and Pig translate queries into MapReduce jobs, and are therefore non-adhocand have high latency.
Google Dremel does not use the MapReduce framework and supports adhocqueries. (note: do not confuse with real-time streaming engines, such as “Storm”)
Future of Hive/Pig? Apache Drill