selectivity-based partitioning alkis polyzotis uc santa cruz
TRANSCRIPT
Selectivity-Based Partitioning
Alkis PolyzotisUC Santa Cruz
Parser
Optimizer
ExecutionEngine
R1 R2 R3 R4
R1 R2 R3 R4
( (R2 R3) R1) R4
Query Optimization
• Integral component of declarative query processing
• Key problem: join ordering
• Most important (and most complex!) module of a DBMS
“Monolithic” Query Optimization
• Output: a single join order based on join selectivities between tables
Plan: (P E) D
Partition-Based Query Optimization
• Output: multiple join orders based on selectivities between fragments of tables
Plan: ( (P D2) E ) ( (E D1) P )
Selectivity-Based Partitioning
• Divide-and-Union paradigm • Optimization problem and analysis
• Partitioning algorithm• Experimental results
Roadmap
• Preliminaries• Problem Definition• Partitioning Algorithm
• Optimal Splits• Iterative Partitioning
• Experimental Results• Conclusions
Data and Query Model
• Chain-join queries• Example: R1 R2 R3 R4
• Relations may have optional selections
• Relation Frequency matrix• Left-deep evaluation plans
• Example: R3 R2 R4 R1
R3 R2
R4
R1
Problem Definition
• Given: query Q, maximum partition count N• Goal: find partitioning of Q in nN partitions that minimizes query cost
• On-the-fly partitioning vs. Off-line partitioning
• Difficult optimization problem!• Determine the pivot relation• Determine the number of partitions• Compute a partitioning of the pivot• Determine the orderings of partitioned plans
R1 R2 R3 R4 R1 R21 R4 R3
R3 R22 R1 R4
Query Cost Function
• One possibility: optimizer’s cost model• Accurate cost estimation• Solution depends on low-level system details
• Difficult to gain intuitions
• Our approach: query cost = number of intermediate results• Simple function that admits analysis• Sound connections to realistic cost models (Cluet and Moerkotte, ICDT’95)Cost(R3 R2 R4 R1 ) = |R3 R2| + |R3 R2 R4|
Roadmap
• Preliminaries• Problem Definition• Partitioning Algorithm
• Optimal Splits• Iterative Partitioning
• Experimental Results• Conclusions
Partitioning Algorithm - Overview
• State space: partitioned join orders
• Partitioning algorithm:• Explore a set of states• Compute optimal partitioning for each state• Return global optimum
• Our approach: order joins then partition• Another possibility: partition then order joins
Distributing Tuples
• Goal: Distribute tuples to minimize cost
• Optimal distribution depends on:• Frequency matrices of other relations• Position (m,l)
Optimal Split Theorem
• Distribute each value (m,l) independently
• Place (m,l) in partition that minimizes g(L,T,m,l)
Partitioning Algorithm - Overview
• State space: partitioned join orders
• Partitioning algorithm:• Explore a set of states• Compute optimal partitioning for each state• Return global optimum
Search Algorithm
• Exhaustive search is impractical [ Pivot, Leading orders, Trailing orders ]
• Search heuristics:• Tighter search space:
[ Pivot, Optimal Leading orders ]
• Iterative Partitioning• Guided search by using lower bounds on cost of partitions
Encoding of State Space
• State: [ Pivot , Optimal leading orders ]
• Transition: insert relation in a leading order
R5 R1
R3 R4 R5
Iterative Partitioning
• Key idea: (Partition, Optimize)+• Compute optimal split for leading/trailing orders
• Optimize trailing orders for the current split
• Theorem: query cost can only decrease
• Idea extended to more detailed cost models
R1
R3 R4
R2
R21
R22
R3 R5 R4
R1 R5
R21
R22
Leading Trailing
Search Algorithm
• Initial states: single-relation leading orders
• Search process:• Compute partitions with IP• Open more states with transition function
• Transitions are guided by lower bound on cost function
• Same lower bound can also prune states
• Stopping criteria:• Search space is exhausted• Time budget is exhausted
System Integration
Parser
Optimizer
ExecutionEngine
Parser
Optimizer
ExecutionEngine
Partitioner
Monolithic Partition-based
Roadmap
• Preliminaries• Problem Definition• Partitioning Algorithm
• Optimal Splits• Iterative Partitioning
• Experimental Results• Conclusions
Effect of Skew
0
10
20
30
40
50
60
0.5 1 1.5 2Data Skew
Avg. Reduction (%)
ComputePartitionOptimalPartition
Synthetic Data
Execution Time
0.1
1
10
100
1000
2 3 4Maximum Partition Count
Execution Time (sec)
IterativeOptimal
Synthetic Data (Skew=1.5)
Varying Time Budget
0
5
10
15
20
25
30
35
40
45
50
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Time (seconds)
Avg. Reduction (%)
Q_HQ_L
Synthetic Data (Skew=1.5)
Results on Real-Life Data
0
5
10
15
20
25
30
Q0 Q1 Q2 Q3Query
Avg. Reduction (%)
7.64E+05
3.19E+05
7.20E+05
1.08E+06
SwissProt
Conclusions
• Monolithic optimization Missed opportunities
• Selectivity-Based Partitioning• Divide & Union approach• Multiple join orders per query• Join selectivity between relation fragments
• Partitioning Algorithm• Iterative Partitioning
• Experimental Results• Significant reduction of intermediate results
Future Work
• Extension to multiple pivots• Partition-then-order optimization
• Efficient execution of partitioned plans
• Off-line workload-aware partitioning
Thank you!
Partitioning Model
• General case: Multi-relation partitioning
• Our approach: Single-relation partitioning
R1 R2 R3 R4 R1 R21 R4 R3
R31 R22 R1 R4
R1 R22 R32 R4