selectivity-based partitioning alkis polyzotis uc santa cruz

Selectivity-Based Partitioning

Alkis PolyzotisUC Santa Cruz

Parser

Optimizer

ExecutionEngine

R1 R2 R3 R4

R1 R2 R3 R4

( (R2 R3) R1) R4

Query Optimization

• Integral component of declarative query processing

• Key problem: join ordering

• Most important (and most complex!) module of a DBMS

“Monolithic” Query Optimization

• Output: a single join order based on join selectivities between tables

Plan: (P E) D

Partition-Based Query Optimization

• Output: multiple join orders based on selectivities between fragments of tables

Plan: ( (P D2) E ) ( (E D1) P )

Selectivity-Based Partitioning

• Divide-and-Union paradigm • Optimization problem and analysis

• Partitioning algorithm• Experimental results

Roadmap

• Preliminaries• Problem Definition• Partitioning Algorithm

• Optimal Splits• Iterative Partitioning

• Experimental Results• Conclusions

Data and Query Model

• Chain-join queries• Example: R1 R2 R3 R4

• Relations may have optional selections

• Relation Frequency matrix• Left-deep evaluation plans

• Example: R3 R2 R4 R1

R3 R2

R4

R1

Problem Definition

• Given: query Q, maximum partition count N• Goal: find partitioning of Q in nN partitions that minimizes query cost

• On-the-fly partitioning vs. Off-line partitioning

• Difficult optimization problem!• Determine the pivot relation• Determine the number of partitions• Compute a partitioning of the pivot• Determine the orderings of partitioned plans

R1 R2 R3 R4 R1 R21 R4 R3

R3 R22 R1 R4

Query Cost Function

• One possibility: optimizer’s cost model• Accurate cost estimation• Solution depends on low-level system details

• Difficult to gain intuitions

• Our approach: query cost = number of intermediate results• Simple function that admits analysis• Sound connections to realistic cost models (Cluet and Moerkotte, ICDT’95)Cost(R3 R2 R4 R1 ) = |R3 R2| + |R3 R2 R4|

Roadmap




Partitioning Algorithm - Overview

• State space: partitioned join orders

• Partitioning algorithm:• Explore a set of states• Compute optimal partitioning for each state• Return global optimum

• Our approach: order joins then partition• Another possibility: partition then order joins

Distributing Tuples

• Goal: Distribute tuples to minimize cost

• Optimal distribution depends on:• Frequency matrices of other relations• Position (m,l)

Optimal Split Theorem

• Distribute each value (m,l) independently

• Place (m,l) in partition that minimizes g(L,T,m,l)

Partitioning Algorithm - Overview

• State space: partitioned join orders

• Partitioning algorithm:• Explore a set of states• Compute optimal partitioning for each state• Return global optimum

Search Algorithm

• Exhaustive search is impractical [ Pivot, Leading orders, Trailing orders ]

• Search heuristics:• Tighter search space:

[ Pivot, Optimal Leading orders ]

• Iterative Partitioning• Guided search by using lower bounds on cost of partitions

Encoding of State Space

• State: [ Pivot , Optimal leading orders ]

• Transition: insert relation in a leading order

R5 R1

R3 R4 R5

Iterative Partitioning

• Key idea: (Partition, Optimize)+• Compute optimal split for leading/trailing orders

• Optimize trailing orders for the current split

• Theorem: query cost can only decrease

• Idea extended to more detailed cost models

R1

R3 R4

R2

R21

R22

R3 R5 R4

R1 R5

R21

R22

Leading Trailing

Search Algorithm

• Initial states: single-relation leading orders

• Search process:• Compute partitions with IP• Open more states with transition function

• Transitions are guided by lower bound on cost function

• Same lower bound can also prune states

• Stopping criteria:• Search space is exhausted• Time budget is exhausted

System Integration

Parser

Optimizer

ExecutionEngine

Parser

Optimizer

ExecutionEngine

Partitioner

Monolithic Partition-based

Roadmap




Effect of Skew

0

10

20

30

40

50

60

0.5 1 1.5 2Data Skew

Avg. Reduction (%)

ComputePartitionOptimalPartition

Synthetic Data

Execution Time

0.1

1

10

100

1000

2 3 4Maximum Partition Count

Execution Time (sec)

IterativeOptimal

Synthetic Data (Skew=1.5)

Varying Time Budget

0

5

10

15

20

25

30

35

40

45

50

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

Time (seconds)

Avg. Reduction (%)

Q_HQ_L

Synthetic Data (Skew=1.5)

Results on Real-Life Data

0

5

10

15

20

25

30

Q0 Q1 Q2 Q3Query

Avg. Reduction (%)

7.64E+05

3.19E+05

7.20E+05

1.08E+06

SwissProt

Conclusions

• Monolithic optimization Missed opportunities

• Selectivity-Based Partitioning• Divide & Union approach• Multiple join orders per query• Join selectivity between relation fragments

• Partitioning Algorithm• Iterative Partitioning

• Experimental Results• Significant reduction of intermediate results

Future Work

• Extension to multiple pivots• Partition-then-order optimization

• Efficient execution of partitioned plans

• Off-line workload-aware partitioning

Thank you!

Partitioning Model

• General case: Multi-relation partitioning

• Our approach: Single-relation partitioning

R1 R2 R3 R4 R1 R21 R4 R3

R31 R22 R1 R4

R1 R22 R32 R4

selectivity-based partitioning alkis polyzotis uc santa cruz

Documents

optimal partitioning

partitioning of q

partitioning divide

fly partitioning

p slide

joins slide

dbms slide

leading order slide