map-reduce-merge: simplified relational data processing on large clusters

Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters

Hung-chih Yang 1, Ali Dasdan 1

Ruey-Lung Hsiao 2, D. Stott Parker 2

Yahoo! 1

Computer Science Department, UCLA 2

SIGMOD 2007, Beijing, China

Presented by Jongheum Yeon, 2009. 08. 13.

Outline

Introduction

Map-Reduce

Map-Reduce-Merge

Conclusions

Introduction

New data-processing systems should consider alterna-tives to using big, traditional databases

Map-Reduce does a good job, in a limited context, with extraordinary simplicity

Map-Reduce-Merge will try to extend the applicability without giving up too much simplicity

Introduction (cont’d)

Execution

Application

Storage

Language

ParallelDatabases

Map-Reduce

GFSBigTable

CosmosAzure

SQL Server

DryadLINQScope

Sawzall

Hadoop

HDFSS3

Pig, Hive

SQL ≈SQL LINQ, SQLSawzall

Map-Reduce : Motivation

Many special purpose tasks that operate on and produce large amounts of data

Crawled documents, web requests, etc

Inverted indices, summaries, other kinds of derived data

Needs to be distributed across large number of machines to finish in a reasonable time

Parallelize the computation

Distribute data

Obscures original computation with these extra concerns

Map-Reduce : Benefits

Automatic parallelization and distribution

User code complexity and size reduced

Transparent fault-tolerance

I/O scheduling

Fine grained partitioning of tasks

Dynamically scheduled on available workers

Status and monitoring

Map-Reduce : Programming Model

Input & Output: each a set of key/value pairs

Programmer specifies two functions:

map (in_key, in_value) -> list (out_key, intermediate_value)

– Processes input key/value pair

– Produces set of intermediate pairs

reduce (out_key, list(intermediate_value)) -> list (out_value)

– Produces a set of merged output values (usually just one)

Map-Reduce : Data Flow

Reduce

Map-Reduce : Data Flow

Map : Generate new Key and its value

Reduce : Integrate values of same key

Reduce

Key1Value1

KeyAValueX

KeyBValueY

KeyBValueZ

Map-Reduce : Architecture

Reduce

Master

GFS GFS

Worker

Map-Reduce : Architecture

Master

Assigns and maintains the state of each map/reduce task

Propagating intermediate files to reduce tasks

Worker

Execute Map or Reduce by request of Master

Map-Reduce : Distributed Processing

Input 1 Input 2 … Input M

Map Map Map

1 2 1 2 R 2 R… … …

Shuffle

Reduce

Shuffle

Reduce

Shuffle

Reduce…

Output 1 Output 2 Output R…

Input File

IntermediateFile

Output File

Map-Reduce : Example

Inverted Index

IDS 연구실의 페이지

IDB 연구실의 페이지

DocID=1

DocID=2

wordID docID Loca-tion

101 1 1

201 1 2

203 1 3

301 1 0

302 2 0

Word docID

연구실 101

의 201

페이지 203

IDS 301

IDB 302

Inverted Index

Map-Reduce : Example (cont’d)

Input data to Map

Output of Map

Key(docID) Value(Text)

1 IDS 연구실의 페이지

2 IDB 연구실의 페이지

Key(wordID)

Value(docID:Locatio

301 1:0

101 1:1

201 1:2

203 1:3

Key(wordID)

Value(docID:Locatio

302 2:0

101 2:1

201 2:2

203 2:3

Reduce

Shuffle

Collect same keys and convey them to Reduce

Reduce writes the final result

Reduce

Key(wordID)

Value(docID:Location)

101 1:1 2:1

201 1:2 2:2

203 1:3 2:3

301 1:0

302 2:0

101=1:1, 2:1

201=1:2, 2:2

203=1:3, 2:3

301=1:0

302=2:0

Other Examples

Distributed Grep

Count URL Access Frequency

– <URL, 1>

– <URL, total count>

Reverse Web-Link Graph

– <target, source>

– <target, list(source)>

Map-Reduce-Merge

Map-Reduce is an extremely simple model, but with lim-ited context

Map-Reduce handles mainly homogeneous datasets

Relational operators are hard to implement with Map-Re-duce(especially join operations)

Map-Reduce-Merge tries to keep the simplicity of Map-Reduce while extending it to be more complete

Map-Reduce-Merge

Adds a merge phase to the Map-Reduce algorithm

Allows processing of multiple heterogeneous datasets

Like Map and Reduce, the Merge phase is implemented by the developer

Example:

Two datasets: department and employee

Goal: compute employee’s bonus based on individual re-wardsand department bonus adjustment

Map-Reduce-Merge

Example

Match keys on dept_id in tables

Map-Reduce-Merge: Extending Map-Reduce

Change to reduce phase / Merge phase

Phases

1. Map: (k1, v1) → [(k2, v2)]

2. Reduce: (k2, [v2]) → [v3]

becomes:

1. Map: (k1, v1) → [(k2, v2)]

2. Reduce: (k2, [v2]) → (k2, [v3])

3. Merge: ((k2, [v3]), (k3, [v4])) → (k4, [v5])

Map-Reduce-Merge

Additional user-definable operations

Merger: same principle as map and reduce

– analogous to the map and reduce definitions, define logic to do the merge operation

Processor: processes data from one source

– process data on an individual source

Partition selector: selects the data that should go to the merger

– which data should go to which merger?

Configurable iterator: how to iterate through each list as the merging is done

– how to step through each of the lists as you merge

Map-Reduce-Merge

Map-Reduce-Merge : Relational Data Processing

Relational operators can be implemented using the Map-Reduce-Merge model. This includes:

Projection

Aggregation

Generalized selection

Set union

Set intersection

Set difference

Etc…

Map-Reduce-Merge : Example, Set Union

The two Map-Reduces emit each a sorted list of unique elements

The Merge merges the two lists by iterating in the follow-ing way:

Store the smallest value of two and increase it’s iterator by one

If they are equal, store one of them and increase both itera-tors

Map-Reduce-Merge : Example, Set Difference

We have two sets, A and B, we want to compute A-B

The two Map-Reduces emit each a sorted list of unique elements

The merge iterates simultaneously over the two lists:

If the value of A is less than B’s, store A’s value

If the value of B is smaller, increment B’s iterator

If the two are equal, increment both iterators

Map-Reduce-Merge : Example, Sort-Merge Join

Map: partition records into buckets which are mutually exclusive and each key range is assigned to a reducer

Reduce: data in the sets are merged into a sorted set => sort the data

Merge: the merger joins the sorted data for each key range

Map-Reduce-Merge : Optimizations

Map-reduce already optimizes using locality and backup tasks

Optimize the number of connections between the out-puts of the reduce phase and the input of the merge phase ( Example: Set intersection)

Combining two phases into one (example: ReduceMerge)

Conclusions

Map-Reduce-Merge allows us to work on heterogeneous datasets

Map-Reduce-Merge supports joins which Map-reduce didn’t directly do

Nextstep: develop an SQL-like interface and an optimizer which simplifies the development of a Map-reduce-Merge workflow

map-reduce-merge: simplified relational data processing on large clusters

Documents

map-reduce-merge: simplified relational data processing on...

relational algebra. 2 outline relational algebra unary...

map-reduce-merge: simpliﬁed relational data processing on...

map-reduce-merge: simpliﬁed relational data processing on...

merge and quick sort - islamic university of...

mail merge concepts - fairfield universitymail merge -...

1 simple nested loops join: block nested loops join index...

letter automation letter management organize contacts...

map-reduce-merge: simplified relational data processing on...

cluster of excellence merge - european commission€¦ ·...

signz mail merge / e-mail merge / labels signz mail merge /...

sysinfotools nsf merge · sysinfotools nsf merge 2 1....

autofill and mail merge - alasbo...

ibm’ db2 s e m g -s i w t dbms - página · pdf fileand...

map-reduce-merge: simplified relational data processing on...

databases unit 2 relational data model and relational...

map-reduce-merge: simplified relational data processing on...

cs simplified - georgia business education · pdf filecs...

relational algebra & relational calculus

on the nature of merge: external merge, internal...