mapreduce: simplified data processing on large clusters course/cs402/2009 hongfei yan school of...

22
MapReduce: Simplified Data Processing on Large Clusters http://net.pku.edu.cn/~course/cs402/2009 Hongfei Yan School of EECS, Peking University 7/9/2009

Upload: pearl-george

Post on 05-Jan-2016

232 views

Category:

Documents


3 download

TRANSCRIPT

MapReduce:Simplified Data Processing on Large

Clusters

http://net.pku.edu.cn/~course/cs402/2009

Hongfei YanSchool of EECS, Peking University

7/9/2009

What’s Mapreduce

Parallel/Distributed Computing Programming Model

Input split shuffle output

Typical problem solved by MapReduce

读入数据 : key/value 对的记录格式数据 Map: 从每个记录里 extract something

map (in_key, in_value) -> list(out_key, intermediate_value) 处理 input key/value pair 输出中间结果 key/value pairs

Shuffle: 混排交换数据 把相同 key 的中间结果汇集到相同节点上

Reduce: aggregate, summarize, filter, etc. reduce (out_key, list(intermediate_value)) -> list(out_value)

归并某一个 key 的所有 values ,进行计算 输出合并的计算结果 (usually just one)

输出结果

Mapreduce Framework

Mapreduce Framework

Data store 1 Data store nmap

(key 1, values...)

(key 2, values...)

(key 3, values...)

map

(key 1, values...)

(key 2, values...)

(key 3, values...)

Input key*value pairs

Input key*value pairs

== Barrier == : Aggregates intermediate values by output key

reduce reduce reduce

key 1, intermediate

values

key 2, intermediate

values

key 3, intermediate

values

final key 1 values

final key 2 values

final key 3 values

...

Shuffle Implementation

Partition and Sort Group

Partition function: hash(key)%reducer numberGroup function: sort by key

Example uses: distributed grep   distributed sort   web link-graph reversal

term-vector / hostweb access log stats

inverted index construction

document clustering

machine learning statistical machine translation

... ... ...

Model is Widely ApplicableMapReduce Programs In Google Source Tree

Algorithms Fit in MapReduce

文献中见到实现了的算法 K-Means, EM, SVM, PCA, Linear Regression, Naïve Ba

yes, Logistic Regression, Neural Network PageRank Word Co-occurrence Matrices , Pairwise Document

Similarity Monte Carlo simulation ……

MapReduce Runtime System

Google MapReduce Architecture

Single Master node

Many worker bees

Many worker bees

MapReduce Operation

Initial data splitinto 64MB blocks

Computed, resultslocally stored

M sends datalocation to R workers

Final output written

Master informed ofresult locations

Fault Tolerance

通过 re-execution 实现 fault tolerance 周期性 heartbeats 检测 failure Re-execute 失效节点上已经完成 + 正在执行的 map ta

sks Why????

Re-execute 失效节点上正在执行的 reduce tasks Task completion committed through master

Robust: lost 1600/1800 machines once finished ok

Master Failure?

Refinement: Redundant Execution

Slow workers significantly delay completion time Other jobs consuming resources on machine Bad disks w/ soft errors transfer data slowly

Solution: Near end of phase, spawn backup tasks Whichever one finishes first "wins"

Dramatically shortens job completion time

Refinement: Locality Optimization

Master scheduling policy: Asks GFS for locations of replicas of input file

blocks Map tasks typically split into 64MB (GFS block

size) Map tasks scheduled so GFS input block replica

are on same machine or same rack Effect

Thousands of machines read input at local disk speed

Without this, rack switches limit read rate

Refinement: Skipping Bad Records

Map/Reduce functions sometimes fail for particular inputs Best solution is to debug & fix

Not always possible ~ third-party source libraries

On segmentation fault: Send UDP packet to master from signal handler Include sequence number of record being

processed If master sees two failures for same record:

Next worker is told to skip the record

Compression of intermediate data Combiner

“Combiner” functions can run on same machine as a mapper

Causes a mini-reduce phase to occur before the real reduce phase, to save bandwidth

Local execution for debugging/testing User-defined counters

Other Refinements

Hadoop MapReduce Architecture

JobTrackerMapReduce job

submitted by client computer

Master node

TaskTracker

Slave node

Task instance

TaskTracker

Slave node

Task instance

TaskTracker

Slave node

Task instance

Master/Worker ModelLoad-balancing by polling mechanism

Master/Worker ModelLoad-balancing by polling mechanism

History of Hadoop

2004 - Initial versions of what is now Hadoop Distributed File System and Map-Reduce implemented by Doug Cutting & Mike Cafarella

December 2005 - Nutch ported to the new framework. Hadoop runs reliably on 20 nodes.

January 2006 - Doug Cutting joins Yahoo! February 2006 - Apache Hadoop project official started to support the standalone d

evelopment of Map-Reduce and HDFS. March 2006 - Formation of the Yahoo! Hadoop team May 2006 - Yahoo sets up a Hadoop research cluster - 300 nodes April 2006 - Sort benchmark run on 188 nodes in 47.9 hours May 2006 - Sort benchmark run on 500 nodes in 42 hours (better hardware than Apr

il benchmark) October 2006 - Research cluster reaches 600 Nodes December 2006 - Sort times 20 nodes in 1.8 hrs, 100 nodes in 3.3 hrs, 500 nodes in 5.

2 hrs, 900 nodes in 7.8 January 2006 - Research cluster reaches 900 node April 2007 - Research clusters - 2 clusters of 1000 nodes Sep 2008 - Scaling Hadoop to 4000 nodes at Yahoo! April 2009 – release 0.20.0, many improvements, new features, bug fixes and

optimizations.

Hadoop 0.18 Highlights

Apache Hadoop 0.18 was released on 8/22 number of patches committed (266) patches (20%) from contributors outside of

Yahoo! grid mix benchmark in ~45% of the time taken

by Hadoop 0.15 new stuff in MapReduce

Intermediate compression that just works (Single) reduce optimizations Archive tool

Summary

MapReduce 是一个简单易用的并行编程模型,它极大简化了大规模数据处理问题的实现

References and Resources

[1] J. Dean and S. Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters," in Osdi, 2004, pp. 137-150.

[2] Ucb/Eecs, K. Asanovic, R. Bodik, B. Catanzaro, J. Gebis, P. Husbands, K. Keutzer, D. Patterson, W. Plishker, J. Shalf, S. Williams, and K. Yelick, "The landscape of parallel computing research: a view from Berkeley," 2006.

[3] I. Michael, B. Mihai, Y. Yuan, B. Andrew, and F. Dennis, "Dryad: distributed data-parallel programs from sequential building blocks," SIGOPS Oper. Syst. Rev., vol. 41, pp. 59-72, 2007.

[4] Hadoop, "The Hadoop Project," http://hadoop.apache.org/, 2009.