map-reduce-merge: simplified relational data processing on … · 2013-10-11 · map-reduce-merge:...
TRANSCRIPT
![Page 1: Map-Reduce-Merge: Simplified Relational Data Processing on … · 2013-10-11 · Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung%chih)Yang,)Ali)Dasdan)](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8b93a5815fa03e223409aa/html5/thumbnails/1.jpg)
Map-Reduce-Merge: Simplified Relational Data
Processing on Large Clusters
Hung-‐chih Yang, Ali Dasdan Yahoo!
Ruey-‐Lung Hsiao, D. Sto; Parker UCLA
![Page 2: Map-Reduce-Merge: Simplified Relational Data Processing on … · 2013-10-11 · Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung%chih)Yang,)Ali)Dasdan)](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8b93a5815fa03e223409aa/html5/thumbnails/2.jpg)
Outline
1. IntroducCon 2. Map-‐Reduce 3. Map-‐Reduce-‐Merge: extending Map-‐Reduce
• ImplementaCon
4. ApplicaCons to RelaConal Data Processing 5. OpCmizaCons & Enhancements 6.Case Studies 7.Conclusion
![Page 3: Map-Reduce-Merge: Simplified Relational Data Processing on … · 2013-10-11 · Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung%chih)Yang,)Ali)Dasdan)](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8b93a5815fa03e223409aa/html5/thumbnails/3.jpg)
Introduction • New challenges of data processing
• Vast amount of data collected from the World Wide Web • Cannot rely on generic DBMS to reduce costs and improve efficiency
• SoluCons of Search Engine companies • Use customized parallel data processing systems • Use large clusters of shared-‐ nothing commodity nodes • Eg: Google’s GFS, Map-‐Reduce
• Microso_’s Dryad • Yahoo!’s Hadoop (open-‐source)
![Page 4: Map-Reduce-Merge: Simplified Relational Data Processing on … · 2013-10-11 · Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung%chih)Yang,)Ali)Dasdan)](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8b93a5815fa03e223409aa/html5/thumbnails/4.jpg)
Introduction
• ProperCes of Data-‐intensive systems • Simple
• Adopt only a selected subset of database principles • Generic and cost effecCve • Deployed on large clusters of shared nothing commodiCes • Refactoring of data processing into two primiCves:
• Map funcCon • Reduce funcCon
• Map-‐Reduce allows users not to worry about the nuisance details • CoordinaCng parallel sub-‐tasks • Maintaining distributed file storage
![Page 5: Map-Reduce-Merge: Simplified Relational Data Processing on … · 2013-10-11 · Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung%chih)Yang,)Ali)Dasdan)](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8b93a5815fa03e223409aa/html5/thumbnails/5.jpg)
Motivation
• Map-‐Reduce framework is best at handling homogeneous datasets
• Joining mulCple heterogeneous datasets is not efficient in map-‐reduce
• Extending Map-‐Reduce to process heterogeneous datasets simultaneously
• Join-‐enabled map-‐reduce systems can provide a parallel and cost effecCve alternaCve
• Can include relaConal algebra in the subset of database principles
![Page 6: Map-Reduce-Merge: Simplified Relational Data Processing on … · 2013-10-11 · Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung%chih)Yang,)Ali)Dasdan)](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8b93a5815fa03e223409aa/html5/thumbnails/6.jpg)
Map-Reduce • Input dataset stored in GFS • Mapper
• Read splits of input dataset • Apply map funcCon to input records • Produce intermediate key/value sets • ParCCon the intermediate sets into no of reducers sets
• Reducer • Read their part of intermediate sets from mappers • Apply reduce funcCon to the values of a same key • Output final results
Signature of Map-‐Reduce funcCon: Map: (k1, v1) → [(k2, v2)] Reduce: (k2, [v2]) → [v3]
![Page 7: Map-Reduce-Merge: Simplified Relational Data Processing on … · 2013-10-11 · Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung%chih)Yang,)Ali)Dasdan)](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8b93a5815fa03e223409aa/html5/thumbnails/7.jpg)
Join Using Map Reduce
• Use homogenizaCon procedure • Apply one map/reduce task on each dataset • Insert a data-‐source tag into every value • Extract a key a;ribute common for all heterogeneous datasets • Transformed datasets now have two common a;ributes
• Key and data-‐source
• Problems • Take a lot of extra disk space and incur excessive map-‐reduce communicaCons
• Limited only to queries that can be rendered as equi-‐joins
![Page 8: Map-Reduce-Merge: Simplified Relational Data Processing on … · 2013-10-11 · Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung%chih)Yang,)Ali)Dasdan)](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8b93a5815fa03e223409aa/html5/thumbnails/8.jpg)
Join using Map-Reduce: Homogenization
![Page 9: Map-Reduce-Merge: Simplified Relational Data Processing on … · 2013-10-11 · Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung%chih)Yang,)Ali)Dasdan)](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8b93a5815fa03e223409aa/html5/thumbnails/9.jpg)
Why go to Map-Reduce-Merge?
• Map-‐Reduce cant support relaConal algebra efficiently without sacrificing the exisCng generality and simplicity.
• Need to process heterogeneous datasets simultaneously • The exisCng join technique takes lots of extra disk space, incurs excessive map-‐reduce communicaCons and limited to queries that are equi-‐join.
• By adding a merge phase to this process, a variety of hierarchical workflows for data processing can be achieved
• Can embed programming logic into each phase.
![Page 10: Map-Reduce-Merge: Simplified Relational Data Processing on … · 2013-10-11 · Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung%chih)Yang,)Ali)Dasdan)](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8b93a5815fa03e223409aa/html5/thumbnails/10.jpg)
Map-Reduce-Merge
![Page 11: Map-Reduce-Merge: Simplified Relational Data Processing on … · 2013-10-11 · Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung%chih)Yang,)Ali)Dasdan)](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8b93a5815fa03e223409aa/html5/thumbnails/11.jpg)
Example
![Page 12: Map-Reduce-Merge: Simplified Relational Data Processing on … · 2013-10-11 · Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung%chih)Yang,)Ali)Dasdan)](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8b93a5815fa03e223409aa/html5/thumbnails/12.jpg)
Implementation of Merge Modules • ParCCon Selector
• Determine from which reducers this merger retrieves its input data based on the merger number
• Processor funcCon • Process data from one source only • Users can define two processor funcCons
• Merger funcCon • Process two pairs of key/values
• Configurable iterator • A merger has two logical iterators • Control their relaCve movement against each others
![Page 13: Map-Reduce-Merge: Simplified Relational Data Processing on … · 2013-10-11 · Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung%chih)Yang,)Ali)Dasdan)](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8b93a5815fa03e223409aa/html5/thumbnails/13.jpg)
Applications to Relational Data Processing Map-‐Reduce-‐Merge can be used to implement primiCve and derived relaConal operators: 1. ProjecCon 2. AggregaCon 3. SelecCon 4. Set OperaCons: Union, IntersecCon, Difference 5. Cartesian Product 6. Rename 7. Join
![Page 14: Map-Reduce-Merge: Simplified Relational Data Processing on … · 2013-10-11 · Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung%chih)Yang,)Ali)Dasdan)](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8b93a5815fa03e223409aa/html5/thumbnails/14.jpg)
Map-Reduce-Merge Implementations of Relational Join Algorithms • Sort-‐Merge Join • Hash Join • Block Nested-‐Loop Join
Eg: Hash Join
![Page 15: Map-Reduce-Merge: Simplified Relational Data Processing on … · 2013-10-11 · Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung%chih)Yang,)Ali)Dasdan)](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8b93a5815fa03e223409aa/html5/thumbnails/15.jpg)
Optimizations and Enhancements
• OpCmal Reduce-‐Merge ConnecCons • Combining Phases
• Reduce-‐Map, Merge-‐Map • Reduce-‐Merge • Reduce-‐Merge-‐Map
Enhancements: • Map-‐Reduce-‐Merge Library • Map-‐Reduce-‐Merge Workflow
Eg: Map-‐Reduce-‐Merge Workflow
![Page 16: Map-Reduce-Merge: Simplified Relational Data Processing on … · 2013-10-11 · Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung%chih)Yang,)Ali)Dasdan)](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8b93a5815fa03e223409aa/html5/thumbnails/16.jpg)
Case Studies • Join Web-‐Graph • Map-‐Reduce-‐Merge Workflow for TPC-‐H Query-‐2
• Involves 5 tables, 1 nested query, 1 aggregate and group by clause and 1 order by operaCons
![Page 17: Map-Reduce-Merge: Simplified Relational Data Processing on … · 2013-10-11 · Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung%chih)Yang,)Ali)Dasdan)](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8b93a5815fa03e223409aa/html5/thumbnails/17.jpg)
Case study :Map-Reduce-Merge Workflow for TPC-H Query 2
![Page 18: Map-Reduce-Merge: Simplified Relational Data Processing on … · 2013-10-11 · Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung%chih)Yang,)Ali)Dasdan)](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8b93a5815fa03e223409aa/html5/thumbnails/18.jpg)
Conclusion
• MapReduce & GFS represent a paradigm shi_ in data processing: use a simplified interface instead of overly-‐general DBMS
• Map-‐Reduce-‐Merge adds the ability to execute arbitrary relaConal algebra queries
• Next steps: • Develop SQL-‐like interface and • A Query OpCmizer