fine-grained partitioning for aggressive data skipping calvin 2015-06-03 sigmod 2014 uc berkeley
Post on 04-Jan-2016
221 Views
Preview:
TRANSCRIPT
Fine-grained Partitioning for
Aggressive Data Skipping
Calvin2015-06-03
SIGMOD 2014
UC Berkeley
Contents
• Background• Contribution• Overview• Algorithm• Data skipping• Experiment
Background
• How to get insights of enormous datasets interactively ?• How to shorten query response time on huge datasets ?
• Drawbacks1. Coarse-grained block(partition)s2. Not balance3. The remaining block(partition)s still contain many tuples4. Blocks do not match the workload skew5. Data and query filter correlation
• Block / Partition• Oracle / Hbase / Hive / LogBase
Prune data block(partition) according to metadata
Goals
• Workload-driven blocking techniqueFined-grainedBalance-sizedOfflineRe-executableCo-exists with original partitioning techniques
Example
Extract features
Vectorization
1.Split block2.Storage
How to choose
How to split
Condition Skip
F3 P1, P3
F1^F2 P2, P3
Contribution
• Feature selectionIdentity representative filtersModeled as Frequent itemset mining
• Optimal partitioningBalanced-Max-Skip partitioning problem – NP HardA bottom-up framework for approximate solution
Overview
(1)extract features from workload(2)scan table and transform tuple to (vector, tuple)-pair(3)count by vector to reduce partitioner input(4)generate blocking map(vector -> blockId)(5)route each tuple to its destination block(6)update union block feature to catalog
Workload Assumptions
• Filters in query of the workload have commonality and stability Scheduled or reporting queries Template query with different value range
Workload Modeling
• Q={Q1,Q2,…Qm}• Examples:
Q1: product=‘shoes’Q2: product in (‘shoes’, ‘shirts’), revenue > 32Q3: product=‘shirts’, revenue > 21
• F: All predicates in Q• Fi: Qi’s predicates• fij: Each item in Fi
product in (‘shoes’, ‘shirts’) vs product= ‘shoes’product in (‘shoes’, ‘shirts’) vs revenue > 21
Filter augmentation
• Examples:Q1: product=‘shoes’Q2: product in (‘shoes’, ‘shirts’), revenue > 32Q3: product=‘shirts’, revenue > 21
• Examples:Q1: product=‘shoes’, product in (‘shoes’, ‘shirts’)Q2: product in (‘shoes’, ‘shirts’), revenue > 32, revenue > 21Q3: product=‘shirts’, revenue > 21, product in (‘shoes’, ‘shirts’)
Frequent itemset mining with threshold T(=2) numFeat
Partitioning problem modeling
• ={F1,F2,…Fm} as features, weight wi
• V={v1,v2,…vn} as transformed tupleVij indicates whether vi satisfies Fj
• P={P1,P2,P3} as a partition𝑣 (𝑃 𝑖)=𝑂𝑅𝑣 𝑗∈𝑃 𝑖
𝑣 𝑗
• Cost function C(Pi) as sum of tuples that Pi can skip for all queries in workload :
Max(C(P)) NP-Hard
The bottom up framework
Ward’s method: Hierarchical grouping to optimize an objective function
n2log(n)
R: {vector -> blockId, …}
Data skipping1. Generate vector2. OR with each partition vector3. Block with at least one 0 bit can be skipped
Experiment
• EnvironmentAmazon Spark EC2 cluster with 25 instances8*2.66GHz CPU cores64 GB RAM2*840 GB disk storage
• Implement and experiment on Shark (SQL on spark)
Datasets• TPC-H
• 600 million rows, 700GB in size• Query templates (q3,q5,q6,q8,q10,q12,q14,q19)• 800 queries as training workload, 100 from each• 80 testing queries, 10 from each
• TPC-H Skewed• TPC-H query generator has a uniform distribution• 800 queries as training workload, 100 from each under Zipf distribution
• Conviva• User access log of video streams• 104 columns: customerId, city, mediaUrl, genre, date, time, responseTime, …• 674 training queries and 61 testing queries• 680 million tuples, 1TB in size
TPC-H 相关说明: http://blog.csdn.net/fivedoumi/article/details/12356807
TPC-H results
• Query performance• Measure number of tuples scanned and response time for different blocking
and skipping schemas• Full scan: no data skipping, baseline• Range1: filter on o_orderdate, about 2300 partitions. Shark’s data skipping used• Range2: filter on {o_orderdate, r_name, c_mkt_segment, quantity}, about 9000
partitions. Shark’s data skipping used• Fineblock: numFeature=15 features from 800 training queries, minSize=50k,
Shark’s data skipping and feature-based data skipping are used
TPC-H results - efficiency
TPC-H results – effect of minSize
• The smaller the block size is, the more chance we can skip data• numFeature=15 and various minSize
Y-value : ratio of number scanned to number must be scanned
TPC-H results – effect of numFeat
TPC-H results – blocking time
• A month partition in TPC-H• 7.7 million tuples, 8GB in size• 1000 blocks• numFeat=15,minSize=50• One minute
Convia results
• Query performance• Fullscan: no data skipping• Range: partition on date and a frequently queried column, Shark’s skipping used• Fineblock: first partition on date, numFeature=40, minSize=50k, Shark’s skipping
and feature-based skipping used
top related