fine-grained partitioning for aggressive data skipping calvin 2015-06-03 sigmod 2014 uc berkeley

Fine-grained Partitioning for

Aggressive Data Skipping

Calvin2015-06-03

SIGMOD 2014

UC Berkeley

Contents

• Background• Contribution• Overview• Algorithm• Data skipping• Experiment

Background

• How to get insights of enormous datasets interactively ?• How to shorten query response time on huge datasets ?

• Drawbacks1. Coarse-grained block(partition)s2. Not balance3. The remaining block(partition)s still contain many tuples4. Blocks do not match the workload skew5. Data and query filter correlation

• Block / Partition• Oracle / Hbase / Hive / LogBase

Prune data block(partition) according to metadata

• Workload-driven blocking techniqueFined-grainedBalance-sizedOfflineRe-executableCo-exists with original partitioning techniques

Example

Extract features

Vectorization

1.Split block2.Storage

How to choose

How to split

Condition Skip

F3 P1, P3

F1^F2 P2, P3

Contribution

• Feature selectionIdentity representative filtersModeled as Frequent itemset mining

• Optimal partitioningBalanced-Max-Skip partitioning problem – NP HardA bottom-up framework for approximate solution

Overview

(1)extract features from workload(2)scan table and transform tuple to (vector, tuple)-pair(3)count by vector to reduce partitioner input(4)generate blocking map(vector -> blockId)(5)route each tuple to its destination block(6)update union block feature to catalog

Workload Assumptions

• Filters in query of the workload have commonality and stability Scheduled or reporting queries Template query with different value range

Workload Modeling

• Q={Q1,Q2,…Qm}• Examples:

Q1: product=‘shoes’Q2: product in (‘shoes’, ‘shirts’), revenue > 32Q3: product=‘shirts’, revenue > 21

• F: All predicates in Q• Fi: Qi’s predicates• fij: Each item in Fi

product in (‘shoes’, ‘shirts’) vs product= ‘shoes’product in (‘shoes’, ‘shirts’) vs revenue > 21

Filter augmentation

• Examples:Q1: product=‘shoes’Q2: product in (‘shoes’, ‘shirts’), revenue > 32Q3: product=‘shirts’, revenue > 21

• Examples:Q1: product=‘shoes’, product in (‘shoes’, ‘shirts’)Q2: product in (‘shoes’, ‘shirts’), revenue > 32, revenue > 21Q3: product=‘shirts’, revenue > 21, product in (‘shoes’, ‘shirts’)

Frequent itemset mining with threshold T(=2) numFeat

Partitioning problem modeling

• ={F1,F2,…Fm} as features, weight wi

• V={v1,v2,…vn} as transformed tupleVij indicates whether vi satisfies Fj

• P={P1,P2,P3} as a partition𝑣 (𝑃 𝑖)=𝑂𝑅𝑣 𝑗∈𝑃 𝑖

𝑣 𝑗

• Cost function C(Pi) as sum of tuples that Pi can skip for all queries in workload :

Max(C(P)) NP-Hard

The bottom up framework

Ward’s method: Hierarchical grouping to optimize an objective function

n2log(n)

R: {vector -> blockId, …}

Data skipping1. Generate vector2. OR with each partition vector3. Block with at least one 0 bit can be skipped

Experiment

• EnvironmentAmazon Spark EC2 cluster with 25 instances8*2.66GHz CPU cores64 GB RAM2*840 GB disk storage

• Implement and experiment on Shark (SQL on spark)

Datasets• TPC-H

• 600 million rows, 700GB in size• Query templates (q3,q5,q6,q8,q10,q12,q14,q19)• 800 queries as training workload, 100 from each• 80 testing queries, 10 from each

• TPC-H Skewed• TPC-H query generator has a uniform distribution• 800 queries as training workload, 100 from each under Zipf distribution

• Conviva• User access log of video streams• 104 columns: customerId, city, mediaUrl, genre, date, time, responseTime, …• 674 training queries and 61 testing queries• 680 million tuples, 1TB in size

TPC-H 相关说明： http://blog.csdn.net/fivedoumi/article/details/12356807

TPC-H results

• Query performance• Measure number of tuples scanned and response time for different blocking

and skipping schemas• Full scan: no data skipping, baseline• Range1: filter on o_orderdate, about 2300 partitions. Shark’s data skipping used• Range2: filter on {o_orderdate, r_name, c_mkt_segment, quantity}, about 9000

partitions. Shark’s data skipping used• Fineblock: numFeature=15 features from 800 training queries, minSize=50k,

Shark’s data skipping and feature-based data skipping are used

TPC-H results - efficiency

TPC-H results – effect of minSize

• The smaller the block size is, the more chance we can skip data• numFeature=15 and various minSize

Y-value : ratio of number scanned to number must be scanned

TPC-H results – effect of numFeat

TPC-H results – blocking time

• A month partition in TPC-H• 7.7 million tuples, 8GB in size• 1000 blocks• numFeat=15,minSize=50• One minute

Convia results

• Query performance• Fullscan: no data skipping• Range: partition on date and a frequently queried column, Shark’s skipping used• Fineblock: first partition on date, numFeature=40, minSize=50k, Shark’s skipping

and feature-based skipping used

fine-grained partitioning for aggressive data skipping calvin 2015-06-03 sigmod 2014 uc berkeley

Documents

skipping class muckrakers

doubles skipping

skipping christmas by john grisham

generation-skipping trusts in estate...

arkansas generation skipping tax

grade skipping ii

skipping class

sigmod 2008 tutorial, june 10th,...

clube de rope skipping das taipas ·...

hadoop in sigmod 2011

cubrik research at sigmod 2012

fine-grained or coarse-grained? strategies for

sigmod 2013 new researcher symposium

building rope skipping in canada. rope skipping canadas...

skipping streams with xhints

daily skipping mistake

intermediate skipping skills

skipping skills passport 2021

storm@twitter, sigmod 2014 paper

acm sigmod 2007, beijing, china -1 -