performance optimization for short mapreduce job execution...
TRANSCRIPT
Performance Optimization for Short MapReduce Job Execution in Hadoop
Student: Hunter Ingle
1
Paper
• 2012 Second International Conference on Cloud and Green Computing
• Nanjing University, China• Focuses on optimizing execution times in Hadoop’s
MapReduce functionality and Job and Task trackers instead of YARN
2
Hadoop
• Distributed System Framework • Uses master (NameNode) – slave (DataNode) relationship• Great for large scale data processing and parallel processing• MapReduce jobs• Hadoop Distributed File System (HDFS)• Heartbeat method for fault detection
3
MapReduce Execution
• NameNode§ JobTrackero Schedules and monitors tasks for MapReduce jobs
• DataNode§ Multiple TaskTrackerso Execute map and reduce functions
• For execution, input data is split between the DataNodes§ Each map task focuses on that one data split§ Tasks run in parallel§ Upon completion, output is sorted and processed by reduce
tasks
4
MapReduce Execution – Job State Transitions
• PREP§ InitializedoProcesses info from data
splitoCreates map and reduce
tasks on JobTrackero Setup is scheduled with
TaskTracker
5
MapReduce Execution – Job State Transitions Cont.
• RUNNING§ Waits for tasks to be
scheduled to TaskTracker§ Executes tasks§ Cleans job environment
upon completion• FINISHED
§ SUCCEEDED after cleanup• KILLED/FAILED
§ Client can kill a job§ Some failure can occur
6
Sequential Process of a Task
1. JobTracker creates JobInProgress for each job and map/reduce tasks are created (tasks are unassigned)
2. TaskTrackers send heartbeats to JobTracker
§ JobTracker then sends corresponding tasks to TaskTracker
3. TaskTracker.TaskInProgress§ Runs a child thread to
execute task (task is also RUNNING)
7
Sequential Process of a Task Cont.
4. TaskTracker reports information to JobTracker(task thread is now RUNNING)
5. Child thread is completed§ Task is now
COMMIT_PENDING6. TaskTrackers reports again
in heartbeat§ JobTracker allows
TaskTrackers to commit results
8
Sequential Process of a Task Cont.
7. TaskTrackers submit task results
§ Task is now SUCCEEDED8. TaskTrackers report success
in heartbeat§ Task is now COMPLETED
9
MapReduce Job Setup and Cleanup
• Setup§ In heartbeat, JobTracker finds TaskTracker with available
resources for task and schedules it§ TaskTracker processes the task and reports information to
JobTracker§ Process takes 2 heartbeats at a minimum
• Cleanup§ Takes another 2 heartbeats
10
MapReduce and the Heartbeat
• Pull model for requests:§ TaskTracker sends resource info to JobTracker and it
responds with tasks• Default heartbeat value is 3 seconds in Hadoop (can be
extended for systems with more DataNodes)• Pull model comes with a costly execution time
§ JobTracker must wait for TaskTrackers to request§ Task states take longer to report
11
Do Short MapReduce Jobs Mean Short Execution Times?• Due to high cost for pull model, each task takes at least four
heartbeats, no matter how short its execution is• By default, that is 12 seconds for each task• If a request is small, it might even be more efficient to have
one machine complete execution
12
How Can Hadoop Executions Be Optimized?
• Dismiss job setup and cleanup tasks• Change from a pull model to a push model• Separate the job and task control messages from the
heartbeat
13
Setup and Cleanup Dismissal
• Setup§ Creates a temporary directory for output
• Cleanup§ Deletes this directory
• Instead of sending heartbeat messages to TaskTrackers, immediately execute these tasks when necessary in JobTracker§ Setup at initialization§ Cleanup at completion
14
Setup and Cleanup Optimization Cont.
• New state transition
essentially removes KILLED
and FAILED fields
• Setup state is assigned to
PREP.INITIALIZED
• Cleanup state is assigned to
RUNNING.SUC_WAIT
• Methods are added to
JobInProgress
15
Changing the Task Assignment from Pull to Push
• Pull method never has JobTracker actively communicate with TaskTrackers, but rather vice versa for them to send requests
• With new method, JobTrackerwill initialize the job and then communicate with TaskTrackers to begin task assignments
16
Separation of Jobs and Tasks Control Messages from Heartbeats• Heartbeat messages contain
resource information (via containers) and block reports§ RAM, CPU usage, HDD
memory available• Also includes data on
TaskTracker, task state, and other fields
• Information on jobs and tasks is now removed and sent to the JobTracker immediately§ Containers and block
reports are still sent in heartbeat messages
17
The Environment for Results
• 19-Node cluster on same network with Gigabit Ethernet• 32 gigabyte database was used for data
18
Evaluation
19
Evaluation Cont.
20
Evaluation Cont.
21
Conclusion
• Great optimization can be found for short query or analysis jobs on Hadoop§ Immediate job setup and cleanup tasks§ Push instead of pull method§ Separated control messages from heartbeat
• Up to 23% performance improvement from evaluations• No change to Hadoop APIs or crucial features
22