map-reduce-merge: simplified relational data processing on large clusters hung-chih yang(yahoo!),...

16
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA) SIGMOD 2007 (Industrial) Presented by Kisung Kim 2010. 7. 14

Upload: frederica-byrd

Post on 01-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters

Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)

SIGMOD 2007 (Industrial)

Presented by Kisung Kim

2010. 7. 14

Contents Introduction Map-Reduce Map-Reduce-Merge Applications to Relational Data Processing Case Study Conclusion

Introduction New challenges of data processing

– A vast amount of data collected from the entire WWW

Solutions of search engine companies – Customized parallel data processing systems – Use large clusters of shared-nothing commodity nodes– Ex) Google’s GFS, BigTable, MapReduce

Ask.com’s Neptune Microsoft’s Dryad Yahoo!’s Hadoop

Introduction Properties of data-intensive systems

– Simple Adopt only a selected subset of database principles

– Sufficiently generic and effective– Parallel data processing system deployed on large clusters of

shared-nothing commodity nodes– Refactoring of data processing into two primitives:

Map function Reduce function

Map-Reduce allow users not to worry about the nuisance details of:– Coordinating parallel sub-tasks– Maintaining distributed file storage\

This abstraction can greatly increase user productivity

Introduction Map-Reduce framework is best at handling homoge-

neous datasets– Ex) Joining multiple heterogeneous datasets does not quite fit

into the Map-Reduce framework

Extending Map-Reduce to process heterogeneous datasets simultaneously– Processing data relationships is ubiquitous– Join-enabled Map-Reduce system can provide a highly parallel

yet cost effective alternative– Include relational algebra in the subset of the database principles

Relational operators can be modeled using various com-binations of the three primitives: Map, Reduce, and Merge

Map-Reduce Input dataset is stored in GFS Mapper

– Read splits of the input dataset– Apply map function to the input records– Produce intermediate key/value sets– Partition the intermediate sets into # of

reducers sets Reducer

– Read their part of intermediate sets from mappers

– Apply reduce function to the values of a same key

– Output final resultssplit split split split

mapper mapper mapper

reducer reducer reducer

map: (k1, v1) [(k2, v2)]reduce: (k2, [v2]) [v3]

Signatures of Map, Reduce Func-tion

Input

Intermediate Sets

Final Re-sults

Join using Map-Reduce: Use homogenization procedure

– Apply one map/reduce task on each dataset– Insert a data-source tag into every value– Extract a key attribute common for all heterogeneous

datasets– Transformed datasets now have two common attributes

Key and data-source

Problems– Take lots of extra disk space and incur excessive map-reduce

communications– Limited only to queries that can be rendered as equi-joins

Join using Map-Reduce: Homogeniza-tion

Key Value

10 1, “Value1”

85 1, “Value2”

320 1, “Value3”

Key Value

10 2, “Value4”

54 2, “Value5”

320 2, “Value6”

map

reduce

map

reduce

map

reduce

Dataset 1 Dataset 2

Collect records with same key

Map-Reduce-Merge Signatures

– α, β, γ represent dataset lineages– Reduce function produces a key/value list instead of just val-

ues– Merge function reads data from both lineages

These three primitives can be used to implement the parallel version of several join algorithm

map: (k1, v1) [(k2, v2)]reduce: (k2, [v2]) [v3]

Map-Re-duce

Merge Modules Merge function

– Process two pairs of key/values Processor function

– Process data from one source only– Users can define two processor functions

Partition selector– Determine from which reducers this merger retrieves its in-

put data based on the merger number Configurable iterator

– A merger has two logical iterators– Control their relative movement against each others

Merge Modules

Partition Selector

Processor Processor

Iterator

Merge

Re-ducer Output

Re-ducer Output

Re-ducer Output

Re-ducer Output

Re-ducer Output

Re-ducer Output

Reducers for 1st Dataset Reducers for 2nd DatasetRe-

ducer Output

Re-ducer Output

Re-ducer Output

Re-ducer Output

Re-ducer Output

Re-ducer Output

Applications to Relational Data Processing

Map-Reduce-Merge can be used to implement primi-tive and some derived relational operators– Projection– Aggregation– Generalized selection– Joins– Set union– Set intersection– Set difference– Cartesian product– Rename

Map-Reduce-Merge is relationally complete, while being load-balanced, scalable and parallel

Example: Hash Join

split split split split

mapper mapper mapper

reducer reducer reducer

merger merger merger

split split split split

mapper mapper mapper

reducer reducer reducer

Use a hash partitioner

Read from every mapper for one designated parti-tion

• Read from two sets of reducer outputs that share the same hashing buckets

• One is used as a build set and the other probe

Case Study: TPC-H Query 2 Involves 5 tables, 1 nested query, 1 aggregate

and group by clause, and 1 order by

Case Study: TPC-H Query 2 Map-Reduce-Merge workflow

13 passes of Map-Reduce-Merge10 mappers, 10 reducers, and 4 mergers

6 passes of Map-Reduce-Merge5 mappers, 4 reduce-merge-map-pers, 1 reduce-mapper and 1 re-ducer

Combining phases

Conclusion Map-Reduce-Merge programming model

– Retain Map-Reduce’s many great features– Add relational algebra to the list of database principles it up-

holds– Contains several configurable components that enable many

data-processing patterns

Next step– Develop an SQL-like interface and an optimizer to simplify

the process of developing a Map-Reduce-Merge workflow– This work can readily reuse well-studied RDBMS techniques