Download - Fine-grained Partitioning for Aggressive Data Skipping Calvin 2015-06-03 SIGMOD 2014 UC Berkeley
![Page 1: Fine-grained Partitioning for Aggressive Data Skipping Calvin 2015-06-03 SIGMOD 2014 UC Berkeley](https://reader036.vdocuments.site/reader036/viewer/2022062422/56649f095503460f94c1d93c/html5/thumbnails/1.jpg)
Fine-grained Partitioning for
Aggressive Data Skipping
Calvin2015-06-03
SIGMOD 2014
UC Berkeley
![Page 2: Fine-grained Partitioning for Aggressive Data Skipping Calvin 2015-06-03 SIGMOD 2014 UC Berkeley](https://reader036.vdocuments.site/reader036/viewer/2022062422/56649f095503460f94c1d93c/html5/thumbnails/2.jpg)
Contents
• Background• Contribution• Overview• Algorithm• Data skipping• Experiment
![Page 3: Fine-grained Partitioning for Aggressive Data Skipping Calvin 2015-06-03 SIGMOD 2014 UC Berkeley](https://reader036.vdocuments.site/reader036/viewer/2022062422/56649f095503460f94c1d93c/html5/thumbnails/3.jpg)
Background
• How to get insights of enormous datasets interactively ?• How to shorten query response time on huge datasets ?
• Drawbacks1. Coarse-grained block(partition)s2. Not balance3. The remaining block(partition)s still contain many tuples4. Blocks do not match the workload skew5. Data and query filter correlation
• Block / Partition• Oracle / Hbase / Hive / LogBase
Prune data block(partition) according to metadata
![Page 4: Fine-grained Partitioning for Aggressive Data Skipping Calvin 2015-06-03 SIGMOD 2014 UC Berkeley](https://reader036.vdocuments.site/reader036/viewer/2022062422/56649f095503460f94c1d93c/html5/thumbnails/4.jpg)
Goals
• Workload-driven blocking techniqueFined-grainedBalance-sizedOfflineRe-executableCo-exists with original partitioning techniques
![Page 5: Fine-grained Partitioning for Aggressive Data Skipping Calvin 2015-06-03 SIGMOD 2014 UC Berkeley](https://reader036.vdocuments.site/reader036/viewer/2022062422/56649f095503460f94c1d93c/html5/thumbnails/5.jpg)
Example
Extract features
Vectorization
1.Split block2.Storage
How to choose
How to split
Condition Skip
F3 P1, P3
F1^F2 P2, P3
![Page 6: Fine-grained Partitioning for Aggressive Data Skipping Calvin 2015-06-03 SIGMOD 2014 UC Berkeley](https://reader036.vdocuments.site/reader036/viewer/2022062422/56649f095503460f94c1d93c/html5/thumbnails/6.jpg)
Contribution
• Feature selectionIdentity representative filtersModeled as Frequent itemset mining
• Optimal partitioningBalanced-Max-Skip partitioning problem – NP HardA bottom-up framework for approximate solution
![Page 7: Fine-grained Partitioning for Aggressive Data Skipping Calvin 2015-06-03 SIGMOD 2014 UC Berkeley](https://reader036.vdocuments.site/reader036/viewer/2022062422/56649f095503460f94c1d93c/html5/thumbnails/7.jpg)
Overview
(1)extract features from workload(2)scan table and transform tuple to (vector, tuple)-pair(3)count by vector to reduce partitioner input(4)generate blocking map(vector -> blockId)(5)route each tuple to its destination block(6)update union block feature to catalog
![Page 8: Fine-grained Partitioning for Aggressive Data Skipping Calvin 2015-06-03 SIGMOD 2014 UC Berkeley](https://reader036.vdocuments.site/reader036/viewer/2022062422/56649f095503460f94c1d93c/html5/thumbnails/8.jpg)
Workload Assumptions
• Filters in query of the workload have commonality and stability Scheduled or reporting queries Template query with different value range
![Page 9: Fine-grained Partitioning for Aggressive Data Skipping Calvin 2015-06-03 SIGMOD 2014 UC Berkeley](https://reader036.vdocuments.site/reader036/viewer/2022062422/56649f095503460f94c1d93c/html5/thumbnails/9.jpg)
Workload Modeling
• Q={Q1,Q2,…Qm}• Examples:
Q1: product=‘shoes’Q2: product in (‘shoes’, ‘shirts’), revenue > 32Q3: product=‘shirts’, revenue > 21
• F: All predicates in Q• Fi: Qi’s predicates• fij: Each item in Fi
product in (‘shoes’, ‘shirts’) vs product= ‘shoes’product in (‘shoes’, ‘shirts’) vs revenue > 21
![Page 10: Fine-grained Partitioning for Aggressive Data Skipping Calvin 2015-06-03 SIGMOD 2014 UC Berkeley](https://reader036.vdocuments.site/reader036/viewer/2022062422/56649f095503460f94c1d93c/html5/thumbnails/10.jpg)
Filter augmentation
• Examples:Q1: product=‘shoes’Q2: product in (‘shoes’, ‘shirts’), revenue > 32Q3: product=‘shirts’, revenue > 21
• Examples:Q1: product=‘shoes’, product in (‘shoes’, ‘shirts’)Q2: product in (‘shoes’, ‘shirts’), revenue > 32, revenue > 21Q3: product=‘shirts’, revenue > 21, product in (‘shoes’, ‘shirts’)
Frequent itemset mining with threshold T(=2) numFeat
![Page 11: Fine-grained Partitioning for Aggressive Data Skipping Calvin 2015-06-03 SIGMOD 2014 UC Berkeley](https://reader036.vdocuments.site/reader036/viewer/2022062422/56649f095503460f94c1d93c/html5/thumbnails/11.jpg)
Partitioning problem modeling
• ={F1,F2,…Fm} as features, weight wi
• V={v1,v2,…vn} as transformed tupleVij indicates whether vi satisfies Fj
• P={P1,P2,P3} as a partition𝑣 (𝑃 𝑖)=𝑂𝑅𝑣 𝑗∈𝑃 𝑖
𝑣 𝑗
• Cost function C(Pi) as sum of tuples that Pi can skip for all queries in workload :
Max(C(P)) NP-Hard
![Page 12: Fine-grained Partitioning for Aggressive Data Skipping Calvin 2015-06-03 SIGMOD 2014 UC Berkeley](https://reader036.vdocuments.site/reader036/viewer/2022062422/56649f095503460f94c1d93c/html5/thumbnails/12.jpg)
The bottom up framework
Ward’s method: Hierarchical grouping to optimize an objective function
n2log(n)
R: {vector -> blockId, …}
![Page 13: Fine-grained Partitioning for Aggressive Data Skipping Calvin 2015-06-03 SIGMOD 2014 UC Berkeley](https://reader036.vdocuments.site/reader036/viewer/2022062422/56649f095503460f94c1d93c/html5/thumbnails/13.jpg)
Data skipping1. Generate vector2. OR with each partition vector3. Block with at least one 0 bit can be skipped
![Page 14: Fine-grained Partitioning for Aggressive Data Skipping Calvin 2015-06-03 SIGMOD 2014 UC Berkeley](https://reader036.vdocuments.site/reader036/viewer/2022062422/56649f095503460f94c1d93c/html5/thumbnails/14.jpg)
Experiment
• EnvironmentAmazon Spark EC2 cluster with 25 instances8*2.66GHz CPU cores64 GB RAM2*840 GB disk storage
• Implement and experiment on Shark (SQL on spark)
![Page 15: Fine-grained Partitioning for Aggressive Data Skipping Calvin 2015-06-03 SIGMOD 2014 UC Berkeley](https://reader036.vdocuments.site/reader036/viewer/2022062422/56649f095503460f94c1d93c/html5/thumbnails/15.jpg)
Datasets• TPC-H
• 600 million rows, 700GB in size• Query templates (q3,q5,q6,q8,q10,q12,q14,q19)• 800 queries as training workload, 100 from each• 80 testing queries, 10 from each
• TPC-H Skewed• TPC-H query generator has a uniform distribution• 800 queries as training workload, 100 from each under Zipf distribution
• Conviva• User access log of video streams• 104 columns: customerId, city, mediaUrl, genre, date, time, responseTime, …• 674 training queries and 61 testing queries• 680 million tuples, 1TB in size
TPC-H 相关说明: http://blog.csdn.net/fivedoumi/article/details/12356807
![Page 16: Fine-grained Partitioning for Aggressive Data Skipping Calvin 2015-06-03 SIGMOD 2014 UC Berkeley](https://reader036.vdocuments.site/reader036/viewer/2022062422/56649f095503460f94c1d93c/html5/thumbnails/16.jpg)
TPC-H results
• Query performance• Measure number of tuples scanned and response time for different blocking
and skipping schemas• Full scan: no data skipping, baseline• Range1: filter on o_orderdate, about 2300 partitions. Shark’s data skipping used• Range2: filter on {o_orderdate, r_name, c_mkt_segment, quantity}, about 9000
partitions. Shark’s data skipping used• Fineblock: numFeature=15 features from 800 training queries, minSize=50k,
Shark’s data skipping and feature-based data skipping are used
![Page 17: Fine-grained Partitioning for Aggressive Data Skipping Calvin 2015-06-03 SIGMOD 2014 UC Berkeley](https://reader036.vdocuments.site/reader036/viewer/2022062422/56649f095503460f94c1d93c/html5/thumbnails/17.jpg)
TPC-H results - efficiency
![Page 18: Fine-grained Partitioning for Aggressive Data Skipping Calvin 2015-06-03 SIGMOD 2014 UC Berkeley](https://reader036.vdocuments.site/reader036/viewer/2022062422/56649f095503460f94c1d93c/html5/thumbnails/18.jpg)
TPC-H results – effect of minSize
• The smaller the block size is, the more chance we can skip data• numFeature=15 and various minSize
Y-value : ratio of number scanned to number must be scanned
![Page 19: Fine-grained Partitioning for Aggressive Data Skipping Calvin 2015-06-03 SIGMOD 2014 UC Berkeley](https://reader036.vdocuments.site/reader036/viewer/2022062422/56649f095503460f94c1d93c/html5/thumbnails/19.jpg)
TPC-H results – effect of numFeat
![Page 20: Fine-grained Partitioning for Aggressive Data Skipping Calvin 2015-06-03 SIGMOD 2014 UC Berkeley](https://reader036.vdocuments.site/reader036/viewer/2022062422/56649f095503460f94c1d93c/html5/thumbnails/20.jpg)
TPC-H results – blocking time
• A month partition in TPC-H• 7.7 million tuples, 8GB in size• 1000 blocks• numFeat=15,minSize=50• One minute
![Page 21: Fine-grained Partitioning for Aggressive Data Skipping Calvin 2015-06-03 SIGMOD 2014 UC Berkeley](https://reader036.vdocuments.site/reader036/viewer/2022062422/56649f095503460f94c1d93c/html5/thumbnails/21.jpg)
Convia results
• Query performance• Fullscan: no data skipping• Range: partition on date and a frequently queried column, Shark’s skipping used• Fineblock: first partition on date, numFeature=40, minSize=50k, Shark’s skipping
and feature-based skipping used