map reduce - simplified data processing on large clusters

MapReduce:Simplified Data Processing on Large

Clusters

Presented by Cleverence Kombe

By Jeffrey Dean and Sanjay Ghemawat

OUTLINES1. Introduction2. Programming Model3. Implementation 4. Refinements5. Performance6. Experience and Conclusion

1. INTRODUCTIONo Many tasks in large scale data processing composed of:

o Computations that processes large amount of raw data which produce a lots of other data.o Due to massiveness of input data, the computation is distributed to the hundreds or thousands of machines to complete

the tasks in reasonable period of time.o Techniques such as crawled documents and web request logs have been used by Google to parallelize the computation,

distribute the data, and handle failures.o But these techniques contains very complex programming codes.

o Jeffrey Dean and Sanjay Ghemawat came up with MapReduce concept which Simplify Data Processing by hiding the messy details of parallelization, fault-tolerance, data distribution and load balancing in a library.

oWhat is MapReduce?Programming Model, approach, for processing large data sets.Contains Map and Reduce functions.Runs on a large cluster of commodity machines.Many real world tasks are expressible in this model.

oMapReduce provides:User-defined functionsAutomatic parallelization and distributionFault-toleranceI/O schedulingStatus and monitoring

1. INTRODUCTION CONT…

oInput & Output are sets of key/value pairs

oProgrammer specifies two functions: 1. map (in_key, in_value) -> list(out_key, intermediate_value)

Processes input key/value pair Produces set of intermediate pairs

2. reduce (out_key, list(intermediate_value)) -> list(out_value) Combines all intermediate values for a particular key

Produces a set of merged output values (most cases just one)

2. PROGRAMMING MODEL

2. PROGRAMMING MODEL …

Input Files

Input file1

Input file2

Each line passed to individual mapper instances

Map Key Value Splitting

Sort and Shuffle

Reduce Key Value Pairs

Final Output

Output file

o Words Count Example

2. PROGRAMMING MODEL …More Examples

Distributed Grep The map function emits a line if it matches a supplied pattern

Count of URL access frequency. The map function processes logs of web page requests and outputs <URL, 1>

Reverse web-link graph The map function outputs <target, source> pairs for each link to a target URL found in a page named source

Term-Vector per Host A term vector summarizes the most important words that occur in a document or a set of documents as a list of (word, frequency) pairs

Inverted Index The map function parses each document, and emits a sequence of (word, document ID) pairs

Distributed Sort The map function extracts the key from each record, and emits a (key, record) pair

Many different implementations are possible The right choice is depending on the environment. Typical cluster: (wide use at Google, large clusters of PC’s

connected via switched nets) • Hundreds to thousands of dual-processors x86 machines, Linux, 2-4 GB of memory per machine.• connected with networking HW, Limited bisection bandwidth • Storage is on local IDE disks (inexpensive) • GFS: distributed file system manages data• Scheduling system by the users to submit the tasks (Job=set of tasks mapped by scheduler to set of available PC within the cluster)

Implemented using C++ library and linked into user programs

3. IMPLEMENTATION

Execution OverviewMap

• Divide the input into M equal-sized splits• Each split is 16-64 MB large

Reduce• Partitioning intermediate key space into R pieces• hash(intermediate_key) mod R

Typical setting:• 2,000 machines• M = 200,000• R = 5,000

3. IMPLEMENTATION...

M input splits of 16-64MB each

Partitioning functionhash(intermediate_key) mod R

(0) mapreduce(spec, &result)

R regions

• Read all intermediate data

• Sort it by intermediate keys

Execution Overview…3. IMPLEMENTATION…

Fault ToleranceWorks: Handled through re-execution

• Detect failure via periodic heartbeats• Re-execute completed + in-progress map tasks

• Why do we need to re-execute even the completed tasks?• Re-execute in progress reduce tasks• Task completion committed through master

Master failure: • It can be handled, but don't yet (master failure unlikely)

3. IMPLEMENTATION…

LocalityMaster scheduling policy:

• Asks GFS for locations of replicas of input file blocks • Map tasks typically split into 64MB (GFS block size) • Map tasks scheduled so GFS input block replica are on same

machine or same rack

As a result:• most task’s input data is read locally and consumes no network

bandwidth


Backup Taskscommon causes that lengthens the total time taken for

a MapReduce operation is a straggler.

mechanism to alleviate the problem of stragglers.

the master schedules backup executions of the remaining in-progress tasks.

significantly reduces the time to complete large MapReduce operations.( up to 40% )


• Different partitioning functions.• User specify the number of reduce tasks/output that they desire (R).

• Combiner function.• Useful for saving network bandwidth

• Different input/output types• Skipping bad records

• Master asks next worker is told to skip the bad record• Local execution

• an alternative implementation of the MapReduce library that sequentially executes all of the work for a MapReduce operation on the local machine.

• Status info • Progress of the computation & more info…

• Counters• count occurrences of various events. (Ex: total number of words processed)

4. REFINEMENT

Measure the performance of MapReduce on two computations running on a large cluster of machines.

Grep• searches through approximately one terabyte of data looking for a particular pattern

Sort • sorts approximately one terabyte of data

5. PERFORMANCE

SpecificationsCluster 1800 machinesMemory 4 GBProcessors Dual-processor 2 GHz Xeons with

Hyper-threadingHard disk Dual 160 GB IDE disksNetwork Gigabit Ethernet per machinebandwidth approximately 100 Gbps

Cluster Configuration5. PERFORMANCE…

Grep Computation

Scans 10 billions 100-byte records, searching for rare 3-character pattern (occurs in 92,337 records).

input is split into

approximately 64MB pieces (M = 15000), entire output is placed in one file , R = 1

Startup overhead is significant for short jobs Data Transfer rate over time

5. PERFORMANCE…

Sort Computation

Backup tasks improves completion time reasonably System manages machine failures relatively quickly.

5. PERFORMANCE…Data transfer rates over time for different executions of the sort program

44% longer 5% longer

MapReduce has proven to be a useful abstraction

Greatly simplifies large-scale computations at Google

Fun to use: focus on problem, let library deal with messy details

No big need for parallelization knowledge

• (relief the user from dealing with low level parallelization details)

6. Experience & Conclusions

Thank you!

map reduce - simplified data processing on large clusters

Education