presentation by dr. greg speegle csi 5335, november 18, 2011
TRANSCRIPT
![Page 1: Presentation by Dr. Greg Speegle CSI 5335, November 18, 2011](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649ca75503460f9496996a/html5/thumbnails/1.jpg)
PROCESSING THETA-JOINS USING MAPREDUCE
AUTHORS: OKCAN, RIEDEWALDSIGMOD 2011
Presentation by Dr. Greg SpeegleCSI 5335, November 18, 2011
![Page 2: Presentation by Dr. Greg Speegle CSI 5335, November 18, 2011](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649ca75503460f9496996a/html5/thumbnails/2.jpg)
MapReduce
Automatic parallelization technique Map function
Reads input file in parallel Outputs <key,value> pairs
Reduce function Input: All pairs with same key Output: Results
Information Week: Hadoop skills in demand
![Page 3: Presentation by Dr. Greg Speegle CSI 5335, November 18, 2011](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649ca75503460f9496996a/html5/thumbnails/3.jpg)
Joins
Theta-join Join on non-equality predicate Example: Select qid, hid From Heroes h,
Quests q where q.level <= h.level Nested Block Loop
For every block of r read all of s Always applicable “Computes” cross-product
Hash Join Only examines tuples to join Cannot always be used (e.g., theta join)
![Page 4: Presentation by Dr. Greg Speegle CSI 5335, November 18, 2011](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649ca75503460f9496996a/html5/thumbnails/4.jpg)
1-Bucket Theta
MapReduce Algorithm “Computes” cross-product Goals:
Tuples matched at exactly one reducer Minimal input to a reducer Minimal output from each reducer
“1-Bucket” refers to no statistics about data distribution
![Page 5: Presentation by Dr. Greg Speegle CSI 5335, November 18, 2011](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649ca75503460f9496996a/html5/thumbnails/5.jpg)
Algorithm : Precomputation
Precompute regions of cross-product SxT Use size of S (|S|) and T (|T|) Regions are disjoint Union of regions covers cross-product Each region assigned to single reducer
![Page 6: Presentation by Dr. Greg Speegle CSI 5335, November 18, 2011](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649ca75503460f9496996a/html5/thumbnails/6.jpg)
Example
1 1 1 1 2 2 2 2
1 1 1 1 2 2 2 2
1 1 1 1 2 2 2 2
1 1 1 1 2 2 2 2
3 3 3 3 4 4 4 4
3 3 3 3 4 4 4 4
3 3 3 3 4 4 4 4
3 3 3 3 4 4 4 4
|S|=8; |T|=8; #reducers =4Rows are tuples in s; columns are tuples in tValue is region for the <s,t> pair
![Page 7: Presentation by Dr. Greg Speegle CSI 5335, November 18, 2011](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649ca75503460f9496996a/html5/thumbnails/7.jpg)
Algorithm: Mapper
Each row in S Randomly assign value (x) from 1 to size(S) Output <region, row + ‘S’> for each region
containing x Example: Assume x=3. Output <1,row+’S’>
and <2,row+’S’> Each row in T
Same, except output <region, row+’T’> ExampleL Assume x=3. Output <1, row+’T’>
and <3,row+’T’>
![Page 8: Presentation by Dr. Greg Speegle CSI 5335, November 18, 2011](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649ca75503460f9496996a/html5/thumbnails/8.jpg)
Algorithm: Reducer
Joins all S rows with all T rows Can use any join algorithm appropriate
for join value Output cross-product, theta join or equi-
join
![Page 9: Presentation by Dr. Greg Speegle CSI 5335, November 18, 2011](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649ca75503460f9496996a/html5/thumbnails/9.jpg)
Algorithm: Correctness
Random assignment of tuples Since actual row number unknown, any row
number works Some reducer will compare tuple to any tuple
in other table Therefore, every pair compared (as in
nested block loop join) in only one reducer
![Page 10: Presentation by Dr. Greg Speegle CSI 5335, November 18, 2011](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649ca75503460f9496996a/html5/thumbnails/10.jpg)
Optimal Partitioning
Basis for minimal input and minimal output
Let |S| be size of table S; r number of reducers
Optimal output |S||T|/r Optimal input sqrt(|S||T|/r) from each
table Special case:
|S| = s*sqrt(|S||T|/r); |T| = t* s*sqrt(|S||T|/r) Optimal: s*t squares with side length sqrt(|S||
T|/r)
![Page 11: Presentation by Dr. Greg Speegle CSI 5335, November 18, 2011](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649ca75503460f9496996a/html5/thumbnails/11.jpg)
Example
1 1 1 1 2 2 2 2
1 1 1 1 2 2 2 2
1 1 1 1 2 2 2 2
1 1 1 1 2 2 2 2
3 3 3 3 4 4 4 4
3 3 3 3 4 4 4 4
3 3 3 3 4 4 4 4
3 3 3 3 4 4 4 4
|S|=8; |T|=8; r=4; sqrt(|S||T|/r) =4; s=t=2
![Page 12: Presentation by Dr. Greg Speegle CSI 5335, November 18, 2011](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649ca75503460f9496996a/html5/thumbnails/12.jpg)
Near Optimal Partitioning
Optimal case is rare General case
t=floor(|T|/ sqrt(|S||T|/r)) Side length: floor((1+1/min(s,t)) *
sqrt(|S||T|/r)) Note floor function omitted from paper
Example: |S|=|T|=8; r=9 s=t=floor(8/sqrt(64/9))=3 Side length = floor((1+1/3)*sqrt(64/9))=3
![Page 13: Presentation by Dr. Greg Speegle CSI 5335, November 18, 2011](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649ca75503460f9496996a/html5/thumbnails/13.jpg)
Example: Near-Optimal Partitioning
1 1 1 2 2 2 5 5
1 1 1 2 2 2 5 5
1 1 1 2 2 2 5 5
3 3 3 4 4 4 6 6
3 3 3 4 4 4 6 6
3 3 3 4 4 4 6 6
7 7 7 8 8 8 9 9
7 7 7 8 8 8 9 9
Assumed partitioningNote: 64/9=7.111 . . .Eight partitions with 7 and one with 8 is better
![Page 14: Presentation by Dr. Greg Speegle CSI 5335, November 18, 2011](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649ca75503460f9496996a/html5/thumbnails/14.jpg)
Alternative for Equi-join
Map Each row in S output <join values, S> Each row in T output <join values, T>
Reducer Join all matching rows (same as 1-Bucket)
Cannot be used for arbitrary theta joins Subject to skew Great for foreign key join w/uniform
distribution
![Page 15: Presentation by Dr. Greg Speegle CSI 5335, November 18, 2011](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649ca75503460f9496996a/html5/thumbnails/15.jpg)
Experiments
Cloud data set Information about cloud cover 382 million records 28.8 GB Cloud-5-i is 5 million record subset
SELECT S.date, S.longitude, S.latitude FROM Cloud S, Cloud T WHERE s.date = t.date and S.longitude = T. longitude and ABS(S.latitude-T.latitude) <= 10
SELECT S.latitude, T.latitude FROM Cloud-5-1 S, Cloud-5-2 T WHERE ABS(S.latitude-T.latitude) < 2
![Page 16: Presentation by Dr. Greg Speegle CSI 5335, November 18, 2011](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649ca75503460f9496996a/html5/thumbnails/16.jpg)
Experimental Results
Figure 6: Input imbalance for Figure 7: Max-reducer-input for 1- Figure 8: MapReduce time for 1- 1-Bucket-Theta (#buckets=1) and Bucket-Theta and M-Bucket-I on Bucket-Theta and M-Bucket-I on M-Bucket-I on Cloud Cloud Cloud
Figure 9: Output imbalance for Figure 10: Max-reducer-output for Figure 11: MapReduce time for 1- 1-Bucket-Theta (#buckets=1) and 1-Bucket-Theta and M-Bucket-O Bucket-Theta and M-Bucket-O on M-Bucket-O on Cloud-5 on Cloud-5 Cloud-5
![Page 17: Presentation by Dr. Greg Speegle CSI 5335, November 18, 2011](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649ca75503460f9496996a/html5/thumbnails/17.jpg)
Conclusion
MapReduce algorithm for arbitrary joins Always applicable Effective for large-scale data analysis Additional statistics provide better
performance