mapreduce: simplified data processing on large clusters course/cs402/2009 hongfei yan school of...
TRANSCRIPT
MapReduce:Simplified Data Processing on Large
Clusters
http://net.pku.edu.cn/~course/cs402/2009
Hongfei YanSchool of EECS, Peking University
7/9/2009
Typical problem solved by MapReduce
读入数据 : key/value 对的记录格式数据 Map: 从每个记录里 extract something
map (in_key, in_value) -> list(out_key, intermediate_value) 处理 input key/value pair 输出中间结果 key/value pairs
Shuffle: 混排交换数据 把相同 key 的中间结果汇集到相同节点上
Reduce: aggregate, summarize, filter, etc. reduce (out_key, list(intermediate_value)) -> list(out_value)
归并某一个 key 的所有 values ,进行计算 输出合并的计算结果 (usually just one)
输出结果
Mapreduce Framework
Data store 1 Data store nmap
(key 1, values...)
(key 2, values...)
(key 3, values...)
map
(key 1, values...)
(key 2, values...)
(key 3, values...)
Input key*value pairs
Input key*value pairs
== Barrier == : Aggregates intermediate values by output key
reduce reduce reduce
key 1, intermediate
values
key 2, intermediate
values
key 3, intermediate
values
final key 1 values
final key 2 values
final key 3 values
...
Example uses: distributed grep distributed sort web link-graph reversal
term-vector / hostweb access log stats
inverted index construction
document clustering
machine learning statistical machine translation
... ... ...
Model is Widely ApplicableMapReduce Programs In Google Source Tree
Algorithms Fit in MapReduce
文献中见到实现了的算法 K-Means, EM, SVM, PCA, Linear Regression, Naïve Ba
yes, Logistic Regression, Neural Network PageRank Word Co-occurrence Matrices , Pairwise Document
Similarity Monte Carlo simulation ……
MapReduce Operation
Initial data splitinto 64MB blocks
Computed, resultslocally stored
M sends datalocation to R workers
Final output written
Master informed ofresult locations
Fault Tolerance
通过 re-execution 实现 fault tolerance 周期性 heartbeats 检测 failure Re-execute 失效节点上已经完成 + 正在执行的 map ta
sks Why????
Re-execute 失效节点上正在执行的 reduce tasks Task completion committed through master
Robust: lost 1600/1800 machines once finished ok
Master Failure?
Refinement: Redundant Execution
Slow workers significantly delay completion time Other jobs consuming resources on machine Bad disks w/ soft errors transfer data slowly
Solution: Near end of phase, spawn backup tasks Whichever one finishes first "wins"
Dramatically shortens job completion time
Refinement: Locality Optimization
Master scheduling policy: Asks GFS for locations of replicas of input file
blocks Map tasks typically split into 64MB (GFS block
size) Map tasks scheduled so GFS input block replica
are on same machine or same rack Effect
Thousands of machines read input at local disk speed
Without this, rack switches limit read rate
Refinement: Skipping Bad Records
Map/Reduce functions sometimes fail for particular inputs Best solution is to debug & fix
Not always possible ~ third-party source libraries
On segmentation fault: Send UDP packet to master from signal handler Include sequence number of record being
processed If master sees two failures for same record:
Next worker is told to skip the record
Compression of intermediate data Combiner
“Combiner” functions can run on same machine as a mapper
Causes a mini-reduce phase to occur before the real reduce phase, to save bandwidth
Local execution for debugging/testing User-defined counters
Other Refinements
Hadoop MapReduce Architecture
JobTrackerMapReduce job
submitted by client computer
Master node
TaskTracker
Slave node
Task instance
TaskTracker
Slave node
Task instance
TaskTracker
Slave node
Task instance
Master/Worker ModelLoad-balancing by polling mechanism
Master/Worker ModelLoad-balancing by polling mechanism
History of Hadoop
2004 - Initial versions of what is now Hadoop Distributed File System and Map-Reduce implemented by Doug Cutting & Mike Cafarella
December 2005 - Nutch ported to the new framework. Hadoop runs reliably on 20 nodes.
January 2006 - Doug Cutting joins Yahoo! February 2006 - Apache Hadoop project official started to support the standalone d
evelopment of Map-Reduce and HDFS. March 2006 - Formation of the Yahoo! Hadoop team May 2006 - Yahoo sets up a Hadoop research cluster - 300 nodes April 2006 - Sort benchmark run on 188 nodes in 47.9 hours May 2006 - Sort benchmark run on 500 nodes in 42 hours (better hardware than Apr
il benchmark) October 2006 - Research cluster reaches 600 Nodes December 2006 - Sort times 20 nodes in 1.8 hrs, 100 nodes in 3.3 hrs, 500 nodes in 5.
2 hrs, 900 nodes in 7.8 January 2006 - Research cluster reaches 900 node April 2007 - Research clusters - 2 clusters of 1000 nodes Sep 2008 - Scaling Hadoop to 4000 nodes at Yahoo! April 2009 – release 0.20.0, many improvements, new features, bug fixes and
optimizations.
Hadoop 0.18 Highlights
Apache Hadoop 0.18 was released on 8/22 number of patches committed (266) patches (20%) from contributors outside of
Yahoo! grid mix benchmark in ~45% of the time taken
by Hadoop 0.15 new stuff in MapReduce
Intermediate compression that just works (Single) reduce optimizations Archive tool
References and Resources
[1] J. Dean and S. Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters," in Osdi, 2004, pp. 137-150.
[2] Ucb/Eecs, K. Asanovic, R. Bodik, B. Catanzaro, J. Gebis, P. Husbands, K. Keutzer, D. Patterson, W. Plishker, J. Shalf, S. Williams, and K. Yelick, "The landscape of parallel computing research: a view from Berkeley," 2006.
[3] I. Michael, B. Mihai, Y. Yuan, B. Andrew, and F. Dennis, "Dryad: distributed data-parallel programs from sequential building blocks," SIGOPS Oper. Syst. Rev., vol. 41, pp. 59-72, 2007.
[4] Hadoop, "The Hadoop Project," http://hadoop.apache.org/, 2009.