MapReduce:Simplified Data Processing on Large
Clusters
Presented by Cleverence Kombe
By Jeffrey Dean and Sanjay Ghemawat
OUTLINES1. Introduction2. Programming Model3. Implementation 4. Refinements5. Performance6. Experience and Conclusion
1. INTRODUCTIONo Many tasks in large scale data processing composed of:
o Computations that processes large amount of raw data which produce a lots of other data.o Due to massiveness of input data, the computation is distributed to the hundreds or thousands of machines to complete
the tasks in reasonable period of time.o Techniques such as crawled documents and web request logs have been used by Google to parallelize the computation,
distribute the data, and handle failures.o But these techniques contains very complex programming codes.
o Jeffrey Dean and Sanjay Ghemawat came up with MapReduce concept which Simplify Data Processing by hiding the messy details of parallelization, fault-tolerance, data distribution and load balancing in a library.
oWhat is MapReduce?Programming Model, approach, for processing large data sets.Contains Map and Reduce functions.Runs on a large cluster of commodity machines.Many real world tasks are expressible in this model.
oMapReduce provides:User-defined functionsAutomatic parallelization and distributionFault-toleranceI/O schedulingStatus and monitoring
1. INTRODUCTION CONT…
oInput & Output are sets of key/value pairs
oProgrammer specifies two functions: 1. map (in_key, in_value) -> list(out_key, intermediate_value)
Processes input key/value pair Produces set of intermediate pairs
2. reduce (out_key, list(intermediate_value)) -> list(out_value) Combines all intermediate values for a particular key
Produces a set of merged output values (most cases just one)
2. PROGRAMMING MODEL
2. PROGRAMMING MODEL …
Input Files
Input file1
Input file2
Each line passed to individual mapper instances
Map Key Value Splitting
Sort and Shuffle
Reduce Key Value Pairs
Final Output
Output file
o Words Count Example
2. PROGRAMMING MODEL …More Examples
Distributed Grep The map function emits a line if it matches a supplied pattern
Count of URL access frequency. The map function processes logs of web page requests and outputs <URL, 1>
Reverse web-link graph The map function outputs <target, source> pairs for each link to a target URL found in a page named source
Term-Vector per Host A term vector summarizes the most important words that occur in a document or a set of documents as a list of (word, frequency) pairs
Inverted Index The map function parses each document, and emits a sequence of (word, document ID) pairs
Distributed Sort The map function extracts the key from each record, and emits a (key, record) pair
Many different implementations are possible The right choice is depending on the environment. Typical cluster: (wide use at Google, large clusters of PC’s
connected via switched nets) • Hundreds to thousands of dual-processors x86 machines, Linux, 2-4 GB of memory per machine.• connected with networking HW, Limited bisection bandwidth • Storage is on local IDE disks (inexpensive) • GFS: distributed file system manages data• Scheduling system by the users to submit the tasks (Job=set of tasks mapped by scheduler to set of available PC within the cluster)
Implemented using C++ library and linked into user programs
3. IMPLEMENTATION
Execution OverviewMap
• Divide the input into M equal-sized splits• Each split is 16-64 MB large
Reduce• Partitioning intermediate key space into R pieces• hash(intermediate_key) mod R
Typical setting:• 2,000 machines• M = 200,000• R = 5,000
3. IMPLEMENTATION...
M input splits of 16-64MB each
Partitioning functionhash(intermediate_key) mod R
(0) mapreduce(spec, &result)
R regions
• Read all intermediate data
• Sort it by intermediate keys
Execution Overview…3. IMPLEMENTATION…
Fault ToleranceWorks: Handled through re-execution
• Detect failure via periodic heartbeats• Re-execute completed + in-progress map tasks
• Why do we need to re-execute even the completed tasks?• Re-execute in progress reduce tasks• Task completion committed through master
Master failure: • It can be handled, but don't yet (master failure unlikely)
3. IMPLEMENTATION…
LocalityMaster scheduling policy:
• Asks GFS for locations of replicas of input file blocks • Map tasks typically split into 64MB (GFS block size) • Map tasks scheduled so GFS input block replica are on same
machine or same rack
As a result:• most task’s input data is read locally and consumes no network
bandwidth
3. IMPLEMENTATION…
Backup Taskscommon causes that lengthens the total time taken for
a MapReduce operation is a straggler.
mechanism to alleviate the problem of stragglers.
the master schedules backup executions of the remaining in-progress tasks.
significantly reduces the time to complete large MapReduce operations.( up to 40% )
3. IMPLEMENTATION…
• Different partitioning functions.• User specify the number of reduce tasks/output that they desire (R).
• Combiner function.• Useful for saving network bandwidth
• Different input/output types• Skipping bad records
• Master asks next worker is told to skip the bad record• Local execution
• an alternative implementation of the MapReduce library that sequentially executes all of the work for a MapReduce operation on the local machine.
• Status info • Progress of the computation & more info…
• Counters• count occurrences of various events. (Ex: total number of words processed)
4. REFINEMENT
Measure the performance of MapReduce on two computations running on a large cluster of machines.
Grep• searches through approximately one terabyte of data looking for a particular pattern
Sort • sorts approximately one terabyte of data
5. PERFORMANCE
SpecificationsCluster 1800 machinesMemory 4 GBProcessors Dual-processor 2 GHz Xeons with
Hyper-threadingHard disk Dual 160 GB IDE disksNetwork Gigabit Ethernet per machinebandwidth approximately 100 Gbps
Cluster Configuration5. PERFORMANCE…
Grep Computation
Scans 10 billions 100-byte records, searching for rare 3-character pattern (occurs in 92,337 records).
input is split into
approximately 64MB pieces (M = 15000), entire output is placed in one file , R = 1
Startup overhead is significant for short jobs Data Transfer rate over time
5. PERFORMANCE…
Sort Computation
Backup tasks improves completion time reasonably System manages machine failures relatively quickly.
5. PERFORMANCE…Data transfer rates over time for different executions of the sort program
44% longer 5% longer
MapReduce has proven to be a useful abstraction
Greatly simplifies large-scale computations at Google
Fun to use: focus on problem, let library deal with messy details
No big need for parallelization knowledge
• (relief the user from dealing with low level parallelization details)
6. Experience & Conclusions
Thank you!