map-reduce-merge: simplified relational data processing on large clusters

Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters

Hung-chih Yang 1, Ali Dasdan 1

Ruey-Lung Hsiao 2, D. Stott Parker 2

Yahoo! 1

Computer Science Department, UCLA 2

SIGMOD 2007, Beijing, China

Presented by Jongheum Yeon, 2009. 08. 13.

Copyright 2009 by CEBT

Outline

Introduction

Map-Reduce

Map-Reduce-Merge

Conclusions

2


Introduction

New data-processing systems should consider alterna-tives to using big, traditional databases

Map-Reduce does a good job, in a limited context, with extraordinary simplicity

Map-Reduce-Merge will try to extend the applicability without giving up too much simplicity

3


Introduction (cont’d)

4

Execution

Application

Storage

Language

ParallelDatabases

Map-Reduce

GFSBigTable

CosmosAzure

SQL Server

Dryad

DryadLINQScope

Sawzall

Hadoop

HDFSS3

Pig, Hive

SQL ≈SQL LINQ, SQLSawzall


Map-Reduce : Motivation

Many special purpose tasks that operate on and produce large amounts of data

Crawled documents, web requests, etc

Inverted indices, summaries, other kinds of derived data

Needs to be distributed across large number of machines to finish in a reasonable time

Parallelize the computation

Distribute data

Obscures original computation with these extra concerns

5


Map-Reduce : Benefits

Automatic parallelization and distribution

User code complexity and size reduced

Transparent fault-tolerance

I/O scheduling

Fine grained partitioning of tasks

Dynamically scheduled on available workers

Status and monitoring

6


Map-Reduce : Programming Model

Input & Output: each a set of key/value pairs

Programmer specifies two functions:

map (in_key, in_value) -> list (out_key, intermediate_value)

– Processes input key/value pair

– Produces set of intermediate pairs

reduce (out_key, list(intermediate_value)) -> list (out_value)

– Produces a set of merged output values (usually just one)

7


Map-Reduce : Data Flow

8

Data

Data

Data

Map

Map

Map

Reduce

Reduce


Map-Reduce : Data Flow

Map : Generate new Key and its value

Reduce : Integrate values of same key

9

Map

Map

Reduce

Reduce

Key1Value1

Key1Value1

KeyAValueX

KeyBValueY

KeyBValueZ

A=X

B=Y,Z


Map-Reduce : Architecture

10

Map

Map

Reduce

Reduce

Master

GFS GFS

Worker

Worker

Worker

Worker


Map-Reduce : Architecture

Master

Assigns and maintains the state of each map/reduce task

Propagating intermediate files to reduce tasks

Worker

Execute Map or Reduce by request of Master

11


Map-Reduce : Distributed Processing

12

Input 1 Input 2 … Input M

Map Map Map

1 2 1 2 R 2 R… … …

…

…

Shuffle

Reduce

Shuffle

Reduce

Shuffle

Reduce…

Output 1 Output 2 Output R…

Input File

IntermediateFile

Output File


Map-Reduce : Example

Inverted Index

13

IDS 연구실의 페이지

IDB 연구실의 페이지

DocID=1

DocID=2

wordID docID Loca-tion

101 1 1

2 1

201 1 2

2 2

203 1 3

2 3

301 1 0

302 2 0

Word docID

연구실 101

의 201

페이지 203

IDS 301

IDB 302

Inverted Index


Map-Reduce : Example (cont’d)

Input data to Map

Output of Map

14

Key(docID) Value(Text)

1 IDS 연구실의 페이지

2 IDB 연구실의 페이지

Key(wordID)

Value(docID:Locatio

n)

301 1:0

101 1:1

201 1:2

203 1:3

Key(wordID)

Value(docID:Locatio

n)

302 2:0

101 2:1

201 2:2

203 2:3

Data

Data

Data

Map

Map

Map

Reduce

Reduce



Shuffle

Collect same keys and convey them to Reduce

Reduce writes the final result

15

Data

Data

Data

Map

Map

Map

Reduce

Reduce

Key(wordID)

Value(docID:Location)

101 1:1 2:1

201 1:2 2:2

203 1:3 2:3

301 1:0

302 2:0

101=1:1, 2:1

201=1:2, 2:2

203=1:3, 2:3

301=1:0

302=2:0



Other Examples

Distributed Grep

Count URL Access Frequency

– <URL, 1>

– <URL, total count>

Reverse Web-Link Graph

– <target, source>

– <target, list(source)>

16


Map-Reduce-Merge

Map-Reduce is an extremely simple model, but with lim-ited context

Map-Reduce handles mainly homogeneous datasets

Relational operators are hard to implement with Map-Re-duce(especially join operations)

Map-Reduce-Merge tries to keep the simplicity of Map-Reduce while extending it to be more complete

17


Map-Reduce-Merge

Adds a merge phase to the Map-Reduce algorithm

Allows processing of multiple heterogeneous datasets

Like Map and Reduce, the Merge phase is implemented by the developer

Example:

Two datasets: department and employee

Goal: compute employee’s bonus based on individual re-wardsand department bonus adjustment

18


Map-Reduce-Merge

Example

Match keys on dept_id in tables

20


Map-Reduce-Merge: Extending Map-Reduce

Change to reduce phase / Merge phase

Phases

1. Map: (k1, v1) → [(k2, v2)]

2. Reduce: (k2, [v2]) → [v3]

becomes:

1. Map: (k1, v1) → [(k2, v2)]

2. Reduce: (k2, [v2]) → (k2, [v3])

3. Merge: ((k2, [v3]), (k3, [v4])) → (k4, [v5])

21


Map-Reduce-Merge

Additional user-definable operations

Merger: same principle as map and reduce

– analogous to the map and reduce definitions, define logic to do the merge operation

Processor: processes data from one source

– process data on an individual source

Partition selector: selects the data that should go to the merger

– which data should go to which merger?

Configurable iterator: how to iterate through each list as the merging is done

– how to step through each of the lists as you merge

22


Map-Reduce-Merge

23


Map-Reduce-Merge : Relational Data Processing

Relational operators can be implemented using the Map-Reduce-Merge model. This includes:

Projection

Aggregation

Generalized selection

Joins

Set union

Set intersection

Set difference

Etc…

24


Map-Reduce-Merge : Example, Set Union

The two Map-Reduces emit each a sorted list of unique elements

The Merge merges the two lists by iterating in the follow-ing way:

Store the smallest value of two and increase it’s iterator by one

If they are equal, store one of them and increase both itera-tors

25


Map-Reduce-Merge : Example, Set Difference

We have two sets, A and B, we want to compute A-B

The two Map-Reduces emit each a sorted list of unique elements

The merge iterates simultaneously over the two lists:

If the value of A is less than B’s, store A’s value

If the value of B is smaller, increment B’s iterator

If the two are equal, increment both iterators

26


Map-Reduce-Merge : Example, Sort-Merge Join

Map: partition records into buckets which are mutually exclusive and each key range is assigned to a reducer

Reduce: data in the sets are merged into a sorted set => sort the data

Merge: the merger joins the sorted data for each key range

27


Map-Reduce-Merge : Optimizations

Map-reduce already optimizes using locality and backup tasks

Optimize the number of connections between the out-puts of the reduce phase and the input of the merge phase ( Example: Set intersection)

Combining two phases into one (example: ReduceMerge)

28


Conclusions

Map-Reduce-Merge allows us to work on heterogeneous datasets

Map-Reduce-Merge supports joins which Map-reduce didn’t directly do

Nextstep: develop an SQL-like interface and an optimizer which simplifies the development of a Map-reduce-Merge workflow

29

map-reduce-merge: simplified relational data processing on large clusters

Documents