performance optimization for short mapreduce job execution...

Performance Optimization for Short MapReduce Job Execution in Hadoop

Student: Hunter Ingle

1

Paper

• 2012 Second International Conference on Cloud and Green Computing

• Nanjing University, China• Focuses on optimizing execution times in Hadoop’s

MapReduce functionality and Job and Task trackers instead of YARN

2

Hadoop

• Distributed System Framework • Uses master (NameNode) – slave (DataNode) relationship• Great for large scale data processing and parallel processing• MapReduce jobs• Hadoop Distributed File System (HDFS)• Heartbeat method for fault detection

3

MapReduce Execution

• NameNode§ JobTrackero Schedules and monitors tasks for MapReduce jobs

• DataNode§ Multiple TaskTrackerso Execute map and reduce functions

• For execution, input data is split between the DataNodes§ Each map task focuses on that one data split§ Tasks run in parallel§ Upon completion, output is sorted and processed by reduce

tasks

4

MapReduce Execution – Job State Transitions

• PREP§ InitializedoProcesses info from data

splitoCreates map and reduce

tasks on JobTrackero Setup is scheduled with

TaskTracker

5

MapReduce Execution – Job State Transitions Cont.

• RUNNING§ Waits for tasks to be

scheduled to TaskTracker§ Executes tasks§ Cleans job environment

upon completion• FINISHED

§ SUCCEEDED after cleanup• KILLED/FAILED

§ Client can kill a job§ Some failure can occur

6

Sequential Process of a Task

1. JobTracker creates JobInProgress for each job and map/reduce tasks are created (tasks are unassigned)

2. TaskTrackers send heartbeats to JobTracker

§ JobTracker then sends corresponding tasks to TaskTracker

3. TaskTracker.TaskInProgress§ Runs a child thread to

execute task (task is also RUNNING)

7

Sequential Process of a Task Cont.

4. TaskTracker reports information to JobTracker(task thread is now RUNNING)

5. Child thread is completed§ Task is now

COMMIT_PENDING6. TaskTrackers reports again

in heartbeat§ JobTracker allows

TaskTrackers to commit results

8

Sequential Process of a Task Cont.

7. TaskTrackers submit task results

§ Task is now SUCCEEDED8. TaskTrackers report success

in heartbeat§ Task is now COMPLETED

9

MapReduce Job Setup and Cleanup

• Setup§ In heartbeat, JobTracker finds TaskTracker with available

resources for task and schedules it§ TaskTracker processes the task and reports information to

JobTracker§ Process takes 2 heartbeats at a minimum

• Cleanup§ Takes another 2 heartbeats

10

MapReduce and the Heartbeat

• Pull model for requests:§ TaskTracker sends resource info to JobTracker and it

responds with tasks• Default heartbeat value is 3 seconds in Hadoop (can be

extended for systems with more DataNodes)• Pull model comes with a costly execution time

§ JobTracker must wait for TaskTrackers to request§ Task states take longer to report

11

Do Short MapReduce Jobs Mean Short Execution Times?• Due to high cost for pull model, each task takes at least four

heartbeats, no matter how short its execution is• By default, that is 12 seconds for each task• If a request is small, it might even be more efficient to have

one machine complete execution

12

How Can Hadoop Executions Be Optimized?

• Dismiss job setup and cleanup tasks• Change from a pull model to a push model• Separate the job and task control messages from the

heartbeat

13

Setup and Cleanup Dismissal

• Setup§ Creates a temporary directory for output

• Cleanup§ Deletes this directory

• Instead of sending heartbeat messages to TaskTrackers, immediately execute these tasks when necessary in JobTracker§ Setup at initialization§ Cleanup at completion

14

Setup and Cleanup Optimization Cont.

• New state transition

essentially removes KILLED

and FAILED fields

• Setup state is assigned to

PREP.INITIALIZED

• Cleanup state is assigned to

RUNNING.SUC_WAIT

• Methods are added to

JobInProgress

15

Changing the Task Assignment from Pull to Push

• Pull method never has JobTracker actively communicate with TaskTrackers, but rather vice versa for them to send requests

• With new method, JobTrackerwill initialize the job and then communicate with TaskTrackers to begin task assignments

16

Separation of Jobs and Tasks Control Messages from Heartbeats• Heartbeat messages contain

resource information (via containers) and block reports§ RAM, CPU usage, HDD

memory available• Also includes data on

TaskTracker, task state, and other fields

• Information on jobs and tasks is now removed and sent to the JobTracker immediately§ Containers and block

reports are still sent in heartbeat messages

17

The Environment for Results

• 19-Node cluster on same network with Gigabit Ethernet• 32 gigabyte database was used for data

18

Evaluation

19

Evaluation Cont.

20

Evaluation Cont.

21

Conclusion

• Great optimization can be found for short query or analysis jobs on Hadoop§ Immediate job setup and cleanup tasks§ Push instead of pull method§ Separated control messages from heartbeat

• Up to 23% performance improvement from evaluations• No change to Hadoop APIs or crucial features

22

performance optimization for short mapreduce job execution...

Documents