map-reduce-merge: simplified relational data processing on large clusters
DESCRIPTION
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters. Hung- chih Yang 1 , Ali Dasdan 1 Ruey -Lung Hsiao 2 , D. Stott Parker 2 Yahoo! 1 Computer Science Department, UCLA 2 SIGMOD 2007, Beijing, China Presented by Jongheum Yeon , 2009. 08. 13. Outline. - PowerPoint PPT PresentationTRANSCRIPT
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters
Hung-chih Yang 1, Ali Dasdan 1
Ruey-Lung Hsiao 2, D. Stott Parker 2
Yahoo! 1
Computer Science Department, UCLA 2
SIGMOD 2007, Beijing, China
Presented by Jongheum Yeon, 2009. 08. 13.
Copyright 2009 by CEBT
Outline
Introduction
Map-Reduce
Map-Reduce-Merge
Conclusions
2
Copyright 2009 by CEBT
Introduction
New data-processing systems should consider alterna-tives to using big, traditional databases
Map-Reduce does a good job, in a limited context, with extraordinary simplicity
Map-Reduce-Merge will try to extend the applicability without giving up too much simplicity
3
Copyright 2009 by CEBT
Introduction (cont’d)
4
Execution
Application
Storage
Language
ParallelDatabases
Map-Reduce
GFSBigTable
CosmosAzure
SQL Server
Dryad
DryadLINQScope
Sawzall
Hadoop
HDFSS3
Pig, Hive
SQL ≈SQL LINQ, SQLSawzall
Copyright 2009 by CEBT
Map-Reduce : Motivation
Many special purpose tasks that operate on and produce large amounts of data
Crawled documents, web requests, etc
Inverted indices, summaries, other kinds of derived data
Needs to be distributed across large number of machines to finish in a reasonable time
Parallelize the computation
Distribute data
Obscures original computation with these extra concerns
5
Copyright 2009 by CEBT
Map-Reduce : Benefits
Automatic parallelization and distribution
User code complexity and size reduced
Transparent fault-tolerance
I/O scheduling
Fine grained partitioning of tasks
Dynamically scheduled on available workers
Status and monitoring
6
Copyright 2009 by CEBT
Map-Reduce : Programming Model
Input & Output: each a set of key/value pairs
Programmer specifies two functions:
map (in_key, in_value) -> list (out_key, intermediate_value)
– Processes input key/value pair
– Produces set of intermediate pairs
reduce (out_key, list(intermediate_value)) -> list (out_value)
– Produces a set of merged output values (usually just one)
7
Copyright 2009 by CEBT
Map-Reduce : Data Flow
8
Data
Data
Data
Map
Map
Map
Reduce
Reduce
Copyright 2009 by CEBT
Map-Reduce : Data Flow
Map : Generate new Key and its value
Reduce : Integrate values of same key
9
Map
Map
Reduce
Reduce
Key1Value1
Key1Value1
KeyAValueX
KeyBValueY
KeyBValueZ
A=X
B=Y,Z
Copyright 2009 by CEBT
Map-Reduce : Architecture
10
Map
Map
Reduce
Reduce
Master
GFS GFS
Worker
Worker
Worker
Worker
Copyright 2009 by CEBT
Map-Reduce : Architecture
Master
Assigns and maintains the state of each map/reduce task
Propagating intermediate files to reduce tasks
Worker
Execute Map or Reduce by request of Master
11
Copyright 2009 by CEBT
Map-Reduce : Distributed Processing
12
Input 1 Input 2 … Input M
Map Map Map
1 2 1 2 R 2 R… … …
…
…
Shuffle
Reduce
Shuffle
Reduce
Shuffle
Reduce…
Output 1 Output 2 Output R…
Input File
IntermediateFile
Output File
Copyright 2009 by CEBT
Map-Reduce : Example
Inverted Index
13
IDS 연구실의 페이지
IDB 연구실의 페이지
DocID=1
DocID=2
wordID docID Loca-tion
101 1 1
2 1
201 1 2
2 2
203 1 3
2 3
301 1 0
302 2 0
Word docID
연구실 101
의 201
페이지 203
IDS 301
IDB 302
Inverted Index
Copyright 2009 by CEBT
Map-Reduce : Example (cont’d)
Input data to Map
Output of Map
14
Key(docID) Value(Text)
1 IDS 연구실의 페이지
2 IDB 연구실의 페이지
Key(wordID)
Value(docID:Locatio
n)
301 1:0
101 1:1
201 1:2
203 1:3
Key(wordID)
Value(docID:Locatio
n)
302 2:0
101 2:1
201 2:2
203 2:3
Data
Data
Data
Map
Map
Map
Reduce
Reduce
Copyright 2009 by CEBT
Map-Reduce : Example (cont’d)
Shuffle
Collect same keys and convey them to Reduce
Reduce writes the final result
15
Data
Data
Data
Map
Map
Map
Reduce
Reduce
Key(wordID)
Value(docID:Location)
101 1:1 2:1
201 1:2 2:2
203 1:3 2:3
301 1:0
302 2:0
101=1:1, 2:1
201=1:2, 2:2
203=1:3, 2:3
301=1:0
302=2:0
Copyright 2009 by CEBT
Map-Reduce : Example (cont’d)
Other Examples
Distributed Grep
Count URL Access Frequency
– <URL, 1>
– <URL, total count>
Reverse Web-Link Graph
– <target, source>
– <target, list(source)>
16
Copyright 2009 by CEBT
Map-Reduce-Merge
Map-Reduce is an extremely simple model, but with lim-ited context
Map-Reduce handles mainly homogeneous datasets
Relational operators are hard to implement with Map-Re-duce(especially join operations)
Map-Reduce-Merge tries to keep the simplicity of Map-Reduce while extending it to be more complete
17
Copyright 2009 by CEBT
Map-Reduce-Merge
Adds a merge phase to the Map-Reduce algorithm
Allows processing of multiple heterogeneous datasets
Like Map and Reduce, the Merge phase is implemented by the developer
Example:
Two datasets: department and employee
Goal: compute employee’s bonus based on individual re-wardsand department bonus adjustment
18
19
Copyright 2009 by CEBT
Map-Reduce-Merge
Example
Match keys on dept_id in tables
20
Copyright 2009 by CEBT
Map-Reduce-Merge: Extending Map-Reduce
Change to reduce phase / Merge phase
Phases
1. Map: (k1, v1) → [(k2, v2)]
2. Reduce: (k2, [v2]) → [v3]
becomes:
1. Map: (k1, v1) → [(k2, v2)]
2. Reduce: (k2, [v2]) → (k2, [v3])
3. Merge: ((k2, [v3]), (k3, [v4])) → (k4, [v5])
21
Copyright 2009 by CEBT
Map-Reduce-Merge
Additional user-definable operations
Merger: same principle as map and reduce
– analogous to the map and reduce definitions, define logic to do the merge operation
Processor: processes data from one source
– process data on an individual source
Partition selector: selects the data that should go to the merger
– which data should go to which merger?
Configurable iterator: how to iterate through each list as the merging is done
– how to step through each of the lists as you merge
22
Copyright 2009 by CEBT
Map-Reduce-Merge
23
Copyright 2009 by CEBT
Map-Reduce-Merge : Relational Data Processing
Relational operators can be implemented using the Map-Reduce-Merge model. This includes:
Projection
Aggregation
Generalized selection
Joins
Set union
Set intersection
Set difference
Etc…
24
Copyright 2009 by CEBT
Map-Reduce-Merge : Example, Set Union
The two Map-Reduces emit each a sorted list of unique elements
The Merge merges the two lists by iterating in the follow-ing way:
Store the smallest value of two and increase it’s iterator by one
If they are equal, store one of them and increase both itera-tors
25
Copyright 2009 by CEBT
Map-Reduce-Merge : Example, Set Difference
We have two sets, A and B, we want to compute A-B
The two Map-Reduces emit each a sorted list of unique elements
The merge iterates simultaneously over the two lists:
If the value of A is less than B’s, store A’s value
If the value of B is smaller, increment B’s iterator
If the two are equal, increment both iterators
26
Copyright 2009 by CEBT
Map-Reduce-Merge : Example, Sort-Merge Join
Map: partition records into buckets which are mutually exclusive and each key range is assigned to a reducer
Reduce: data in the sets are merged into a sorted set => sort the data
Merge: the merger joins the sorted data for each key range
27
Copyright 2009 by CEBT
Map-Reduce-Merge : Optimizations
Map-reduce already optimizes using locality and backup tasks
Optimize the number of connections between the out-puts of the reduce phase and the input of the merge phase ( Example: Set intersection)
Combining two phases into one (example: ReduceMerge)
28
Copyright 2009 by CEBT
Conclusions
Map-Reduce-Merge allows us to work on heterogeneous datasets
Map-Reduce-Merge supports joins which Map-reduce didn’t directly do
Nextstep: develop an SQL-like interface and an optimizer which simplifies the development of a Map-reduce-Merge workflow
29