on large clusters simplified relational data...
TRANSCRIPT
![Page 1: on Large Clusters Simplified Relational Data Processingweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations/MapReduce… · Introduction Hadoop: open source refactor of data processing](https://reader034.vdocuments.site/reader034/viewer/2022042307/5ed373fa87321e22e0482823/html5/thumbnails/1.jpg)
Map-Reduce-MergeSimplified Relational Data Processing
on Large Clusters
![Page 2: on Large Clusters Simplified Relational Data Processingweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations/MapReduce… · Introduction Hadoop: open source refactor of data processing](https://reader034.vdocuments.site/reader034/viewer/2022042307/5ed373fa87321e22e0482823/html5/thumbnails/2.jpg)
Contents1. Introduction2. Map-Reduce3. Map-Reduce-Merge4. Application to relational data processing5. Optimization6. Enhancements7. Case studies8. Conlusions
![Page 3: on Large Clusters Simplified Relational Data Processingweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations/MapReduce… · Introduction Hadoop: open source refactor of data processing](https://reader034.vdocuments.site/reader034/viewer/2022042307/5ed373fa87321e22e0482823/html5/thumbnails/3.jpg)
IntroductionChallenge:
process and manage a vast amount of data collected from the entire World Wide Web.
Current Solutions:Customized parallel data processing systems Use large clusters of shared-nothing commodity nodes Google’s GFS, BigTable, MapReduce
Ask.com’s Neptune Microsoft’s Dryad Yahoo!’s Hadoop
![Page 4: on Large Clusters Simplified Relational Data Processingweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations/MapReduce… · Introduction Hadoop: open source refactor of data processing](https://reader034.vdocuments.site/reader034/viewer/2022042307/5ed373fa87321e22e0482823/html5/thumbnails/4.jpg)
IntroductionHadoop: open source
refactor of data processing into two primitives:map + reduce
don't need to worry about the nuisance details of coordinating parallel sub-tasks and managing distributed file storage => increase productivity
![Page 5: on Large Clusters Simplified Relational Data Processingweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations/MapReduce… · Introduction Hadoop: open source refactor of data processing](https://reader034.vdocuments.site/reader034/viewer/2022042307/5ed373fa87321e22e0482823/html5/thumbnails/5.jpg)
IntroductionMR is best at handling homogeneous datasets
Ex. joins --> calls for extra MR steps
Map-Reduce-Mergesimplified designrelational complete
![Page 6: on Large Clusters Simplified Relational Data Processingweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations/MapReduce… · Introduction Hadoop: open source refactor of data processing](https://reader034.vdocuments.site/reader034/viewer/2022042307/5ed373fa87321e22e0482823/html5/thumbnails/6.jpg)
Map-Reduce
![Page 7: on Large Clusters Simplified Relational Data Processingweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations/MapReduce… · Introduction Hadoop: open source refactor of data processing](https://reader034.vdocuments.site/reader034/viewer/2022042307/5ed373fa87321e22e0482823/html5/thumbnails/7.jpg)
Features and Principles
Low-cost unreliable commodity hardwareextremely scalable RAIN clusterfault-tolerant yet easy to administersimplified and restricted yet powerfulhighly parallel yet abstractedhigh throughputhigh performance by the largefunctional programming primitives......
![Page 8: on Large Clusters Simplified Relational Data Processingweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations/MapReduce… · Introduction Hadoop: open source refactor of data processing](https://reader034.vdocuments.site/reader034/viewer/2022042307/5ed373fa87321e22e0482823/html5/thumbnails/8.jpg)
Map-Reduce
Homogenization: for equi-join
Transform each dataset into (join key, data-source tag + payload)Then apply map-reduce to merge entries from different datasets
Problem: only equi-joins may take lots of extra disk space, incur excessive communications
![Page 9: on Large Clusters Simplified Relational Data Processingweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations/MapReduce… · Introduction Hadoop: open source refactor of data processing](https://reader034.vdocuments.site/reader034/viewer/2022042307/5ed373fa87321e22e0482823/html5/thumbnails/9.jpg)
Mape-Reduce-Merge
α, β, γ represent dataset lineagesReduce function produces a key/value list instead of just valuesMerge function reads data from both lineages
![Page 10: on Large Clusters Simplified Relational Data Processingweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations/MapReduce… · Introduction Hadoop: open source refactor of data processing](https://reader034.vdocuments.site/reader034/viewer/2022042307/5ed373fa87321e22e0482823/html5/thumbnails/10.jpg)
Mape-Reduce-Merge
![Page 11: on Large Clusters Simplified Relational Data Processingweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations/MapReduce… · Introduction Hadoop: open source refactor of data processing](https://reader034.vdocuments.site/reader034/viewer/2022042307/5ed373fa87321e22e0482823/html5/thumbnails/11.jpg)
Example
![Page 12: on Large Clusters Simplified Relational Data Processingweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations/MapReduce… · Introduction Hadoop: open source refactor of data processing](https://reader034.vdocuments.site/reader034/viewer/2022042307/5ed373fa87321e22e0482823/html5/thumbnails/12.jpg)
![Page 13: on Large Clusters Simplified Relational Data Processingweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations/MapReduce… · Introduction Hadoop: open source refactor of data processing](https://reader034.vdocuments.site/reader034/viewer/2022042307/5ed373fa87321e22e0482823/html5/thumbnails/13.jpg)
Merge Phase
![Page 14: on Large Clusters Simplified Relational Data Processingweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations/MapReduce… · Introduction Hadoop: open source refactor of data processing](https://reader034.vdocuments.site/reader034/viewer/2022042307/5ed373fa87321e22e0482823/html5/thumbnails/14.jpg)
![Page 15: on Large Clusters Simplified Relational Data Processingweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations/MapReduce… · Introduction Hadoop: open source refactor of data processing](https://reader034.vdocuments.site/reader034/viewer/2022042307/5ed373fa87321e22e0482823/html5/thumbnails/15.jpg)
Merge Phase
Partition Selector: Determine from which reducers this merger retrieves its input data based on the merger numberProcessors: 1.Process data from one source only 2.Users can define two processor functions Merger: Process two pairs of key/valuesConfigurable Iterators: 1. A merger has two logical iterators 2.Control their relative movement against each others
![Page 16: on Large Clusters Simplified Relational Data Processingweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations/MapReduce… · Introduction Hadoop: open source refactor of data processing](https://reader034.vdocuments.site/reader034/viewer/2022042307/5ed373fa87321e22e0482823/html5/thumbnails/16.jpg)
Merger
![Page 17: on Large Clusters Simplified Relational Data Processingweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations/MapReduce… · Introduction Hadoop: open source refactor of data processing](https://reader034.vdocuments.site/reader034/viewer/2022042307/5ed373fa87321e22e0482823/html5/thumbnails/17.jpg)
Configurable Iterators
example 1.
![Page 18: on Large Clusters Simplified Relational Data Processingweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations/MapReduce… · Introduction Hadoop: open source refactor of data processing](https://reader034.vdocuments.site/reader034/viewer/2022042307/5ed373fa87321e22e0482823/html5/thumbnails/18.jpg)
Configurable Iterators
example 2.
![Page 19: on Large Clusters Simplified Relational Data Processingweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations/MapReduce… · Introduction Hadoop: open source refactor of data processing](https://reader034.vdocuments.site/reader034/viewer/2022042307/5ed373fa87321e22e0482823/html5/thumbnails/19.jpg)
Configurable Iterators
example 3.
![Page 20: on Large Clusters Simplified Relational Data Processingweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations/MapReduce… · Introduction Hadoop: open source refactor of data processing](https://reader034.vdocuments.site/reader034/viewer/2022042307/5ed373fa87321e22e0482823/html5/thumbnails/20.jpg)
Application to relational data processingrelational completeprojection: mapaggregation: map + reducegeneralized selection: map --> where, reduce-->having, merger--> filtering condition involving more than one relationsjoins: to be discussed... set union/set intersection/set difference: easily handle it in mergercartersian product: nested looprename: trivial
![Page 21: on Large Clusters Simplified Relational Data Processingweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations/MapReduce… · Introduction Hadoop: open source refactor of data processing](https://reader034.vdocuments.site/reader034/viewer/2022042307/5ed373fa87321e22e0482823/html5/thumbnails/21.jpg)
Sort-Merge Join
Map: use range partitioner => records are partitioned into ordered buckets, each mutually exclusive
Reduce: sort data
Merge: reads from two sets of reducer outputs that cover the same key range
![Page 22: on Large Clusters Simplified Relational Data Processingweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations/MapReduce… · Introduction Hadoop: open source refactor of data processing](https://reader034.vdocuments.site/reader034/viewer/2022042307/5ed373fa87321e22e0482823/html5/thumbnails/22.jpg)
Hash Join
Map: use a common partitioner => records are partitioned into hashed buckets
Reduce: reads from every mapper for one designated partition, use the same hash function, records from these partitions can be grouped and aggregated using a hash table
Merge: reads from two sets of reducer outputs that share the same hashing buckets build/probe
![Page 23: on Large Clusters Simplified Relational Data Processingweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations/MapReduce… · Introduction Hadoop: open source refactor of data processing](https://reader034.vdocuments.site/reader034/viewer/2022042307/5ed373fa87321e22e0482823/html5/thumbnails/23.jpg)
![Page 24: on Large Clusters Simplified Relational Data Processingweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations/MapReduce… · Introduction Hadoop: open source refactor of data processing](https://reader034.vdocuments.site/reader034/viewer/2022042307/5ed373fa87321e22e0482823/html5/thumbnails/24.jpg)
Block Nested-Loop Join
Map: same as the one for the hash join
Reduce: same as the one for the hash join
Merge: almost the same as hash join, except for a nested-loop join is used instead
![Page 25: on Large Clusters Simplified Relational Data Processingweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations/MapReduce… · Introduction Hadoop: open source refactor of data processing](https://reader034.vdocuments.site/reader034/viewer/2022042307/5ed373fa87321e22e0482823/html5/thumbnails/25.jpg)
Optimizations
Optimal Reduce-Merge Connections Results of Reduce: partitioned and sorted
The selector of Merge can choose pertinent part of data
![Page 26: on Large Clusters Simplified Relational Data Processingweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations/MapReduce… · Introduction Hadoop: open source refactor of data processing](https://reader034.vdocuments.site/reader034/viewer/2022042307/5ed373fa87321e22e0482823/html5/thumbnails/26.jpg)
Optimizations
Combining PhasesReduceMap, MergeMap
Directly send output to new mappers
Reduce MergeCombine merger to reducer
ReduceMergeMapCombination of above two
![Page 27: on Large Clusters Simplified Relational Data Processingweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations/MapReduce… · Introduction Hadoop: open source refactor of data processing](https://reader034.vdocuments.site/reader034/viewer/2022042307/5ed373fa87321e22e0482823/html5/thumbnails/27.jpg)
Enhancements
Map-Reduce-Merge LibraryA library that contains commonly used
merger configurations like all kinds of joins
![Page 28: on Large Clusters Simplified Relational Data Processingweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations/MapReduce… · Introduction Hadoop: open source refactor of data processing](https://reader034.vdocuments.site/reader034/viewer/2022042307/5ed373fa87321e22e0482823/html5/thumbnails/28.jpg)
Enhancements
Map-Reduce-Merge WorkflowThe regular Map-Reduce workflow is very
strict Adding a new phase creates many workflow
combinations
![Page 29: on Large Clusters Simplified Relational Data Processingweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations/MapReduce… · Introduction Hadoop: open source refactor of data processing](https://reader034.vdocuments.site/reader034/viewer/2022042307/5ed373fa87321e22e0482823/html5/thumbnails/29.jpg)
Enhancements
Map-Reduce-Merge Workflow
![Page 30: on Large Clusters Simplified Relational Data Processingweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations/MapReduce… · Introduction Hadoop: open source refactor of data processing](https://reader034.vdocuments.site/reader034/viewer/2022042307/5ed373fa87321e22e0482823/html5/thumbnails/30.jpg)
Case Studies
Join Webgraphs| URL | inlinks | outlinks |Each column in a separate file
Goal: compute the intersection of inlinks and outlinks for each URL
![Page 31: on Large Clusters Simplified Relational Data Processingweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations/MapReduce… · Introduction Hadoop: open source refactor of data processing](https://reader034.vdocuments.site/reader034/viewer/2022042307/5ed373fa87321e22e0482823/html5/thumbnails/31.jpg)
Case Studies
Join WebgraphsReading all three columns into one Map-
Reduce can overflow buffer
Safer approach: 1) each URL as a row-id2) replicate row-id to each inlink and outlink3) produce <row-id, inoutlink> 4) then natural join <row-id, URL> with <row-
id, inoutlink>
![Page 32: on Large Clusters Simplified Relational Data Processingweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations/MapReduce… · Introduction Hadoop: open source refactor of data processing](https://reader034.vdocuments.site/reader034/viewer/2022042307/5ed373fa87321e22e0482823/html5/thumbnails/32.jpg)
Case Studies
Map-Reduce-Merge Workflow for TPC-H Q2
![Page 33: on Large Clusters Simplified Relational Data Processingweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations/MapReduce… · Introduction Hadoop: open source refactor of data processing](https://reader034.vdocuments.site/reader034/viewer/2022042307/5ed373fa87321e22e0482823/html5/thumbnails/33.jpg)
Case Studies
Map-Reduce-Merge Workflow for TPC-H Q2SQL: 5-way joins with aggregate and group by cluses
M-P-M: four 2-way joins then order by and sortings
![Page 34: on Large Clusters Simplified Relational Data Processingweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations/MapReduce… · Introduction Hadoop: open source refactor of data processing](https://reader034.vdocuments.site/reader034/viewer/2022042307/5ed373fa87321e22e0482823/html5/thumbnails/34.jpg)
Case Studies
![Page 35: on Large Clusters Simplified Relational Data Processingweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations/MapReduce… · Introduction Hadoop: open source refactor of data processing](https://reader034.vdocuments.site/reader034/viewer/2022042307/5ed373fa87321e22e0482823/html5/thumbnails/35.jpg)
Conclusions
Map-Reduce-Merge supports joins of heterogeneous datasets
Thus, it can be used to implement many relational operators, particularly joins