6- join techniques and performance evaluation

33
1 High Performance Data Warehouse Design and Construction Join Techniques

Upload: hamid-mahmood

Post on 04-Aug-2015

14 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 6- Join Techniques and Performance Evaluation

1

High Performance Data WarehouseDesign and Construction

Join Techniques

Page 2: 6- Join Techniques and Performance Evaluation

2

Objectives

Illustrate fundamental differences between join techniques in DSS and OLTP environments.

Describe performance characteristics for major join techniques found in a DSS environment.

Describe resource requirements for major join techniques found in a DSS environment.

Page 3: 6- Join Techniques and Performance Evaluation

3

Topics

DSS vs. OLTP Queries. Nested Loop Join. Sort Merge Join. Merge Join. Hash Join. Pointer-based Join. Query Optimization.

Page 4: 6- Join Techniques and Performance Evaluation

4

DSS versus OLTPTraditional OLTP Traditional DSS

Performance Critical. Seconds or sub-seconds Important but secondary issue.Minutes/hours.

Flexibility Highly tuned for predefined applicationsand operations

What if analysis. Changing datarelationships.

Availability 24*7. MTTR in minutes. 16*6. MTTR sometimes hours.

Update frequency Constant. Real time. Often transactionoriented.

Periodic. Batch.

Query type Simple. Narrow, predefined set. Usuallyprime key. Often repeated

Complex. Ad hoc. Aggregates. Joins.Many one time only queries. Rarelyprime key.

Index Usage Usually primary key index. Secondary index where helpful.

Result size Relatively small. Usually <100 rows Varies. Can be very large. Answer setsoften loaded in other tools for analysis

Data source Online transactions (humans). Operational systems

Referential integrity Transaction-oriented. Handled in realtime.

Handled in operational systems andchecked by local programs or after loadscomplete.

Page 5: 6- Join Techniques and Performance Evaluation

5

Simplified OLTP Joinselect customer.first_nm ,customer.last_nm ,account.balance_amt ,account.open_dt ,account.account_type_cd ,tx.tx_amt ,tx.tx_dt ,tx.tx_type_cd ,channel.channel_descfrom customer ,account ,tx ,channelwhere customer.customer_id = '789451965' and customer.customer_id = account.customer_id and account.account_id = tx.account_id and account.account_type_cd = ‘DDA’ and tx.channel_id = channel.channel_id and tx.tx_dt > '31-DEC-2011';

Page 6: 6- Join Techniques and Performance Evaluation

6

Simplified DSS Joincreate table dss_resultasselect customer.customer_id ,customer.birth_dt ,customer.income_range_cd ,sum( case when channel.channel_type_cd = 'ATM' then 1 else 0 end ) atm_tx_qty ,sum( case when channel.channel_type_cd = 'ATM' then tx.tx_amt else 0.0 end ) atm_total_tx_amt ,sum( case when channel.channel_type_cd = 'BRA' then 1 else 0 end ) branch_tx_qty ,sum( case when channel.channel_type_cd = 'BRA' then tx.tx_amt else 0.0 end ) branch_total_tx_amt

Page 7: 6- Join Techniques and Performance Evaluation

7

Simplified DSS Join (cont’d)

,sum( case when channel.channel_type_cd not in ('ATM','BRA') then 1 else 0 end ) other_tx_qty ,sum( case when channel.channel_type_cd not in ('ATM','BRA') then tx.tx_amt else 0.0 end ) other_total_tx_amtfrom customer ,account ,tx ,channelwhere customer.customer_id = account.customer_id and account.account_id = tx.account_id and tx.channel_id = channel.channel_id and account.account_type_cd = 'DDA' and tx.tx_dt > '31-DEC-2011'group by customer.customer_id;

Page 8: 6- Join Techniques and Performance Evaluation

8

Key Points to Observe

Traditional OLTP Query: High selectivity based on specification of customer_id. Returns a relatively small number of rows.

Traditional DSS Query: Join all customers with transactions since start of year. Returns a large set of rows which will be further analyzed in subsequent queries. Aggregation of detailed data.

Page 9: 6- Join Techniques and Performance Evaluation

9

Nested Loop Joins. (Sort) Merge Joins. Hash Joins. Pointer-Based Joins.

Most Common Join Techniques

Page 10: 6- Join Techniques and Performance Evaluation

10

Nested Loop Joins

Join table T1 to table T2 by taking each (qualifying) row of T1 and searching table T2 for matches.

T1 is referred to as the Outer table. T2 is referred to as the Inner table. Simplest form of join algorithm. No sorting required.

Page 11: 6- Join Techniques and Performance Evaluation

11

Nested Loop Joins

Inner table is scanned once for each qualifying row in the outer table.

Indices are (almost always) essential for high performance with nested loop joins by narrowing scan of inner table.

Alternatively, the RDBMS can use block nested loop joins to reduce the number of scans against the inner table.

Page 12: 6- Join Techniques and Performance Evaluation

12

Nested Loop Joins

“Sum dollars sold for stores > 100,000 sq. ft.?”

Each qualifying row in Store table results in an interrogation of Sales Detail table for matching rows.

Store Sales Detail

126-------------------------------------------------------------50--------------------------------------------------------------12------------------

Search

Search

Search

Page 13: 6- Join Techniques and Performance Evaluation

13

Naive nested loop join with no indices requires (R*S) I/O operations even with (simple) blocking.

Nested loop join with index on inner table may require fewer I/Os, depending on index selectivity.

Index selectivity, for these purposes, should be measured in terms of blocks, not rows.

Access to inner table using index will usually involve random I/O operations (as opposed to sequential I/O).

Note: R and S are the number of blocks in the inner and outer tables, respectively.

Nested Loop Joins

Page 14: 6- Join Techniques and Performance Evaluation

14

Nested Loop JoinsOLTP queries will almost always perform best with a nested loop

join using indices:

1. Use the primary key index on customer_id to quickly fetch customer '789451965’ from the customer table.

2. Use the index on the customer_id foreign key in the account table to quickly fetch all accounts with customer_id value '789451965' (highly selectivity access path).

3. For each account fetched, match account_id against the foreign key index on account_id in the tx table to fetch all transactions for the desired accounts (filtering transactions by date after match).

4. For each transaction fetched, match channel_id from tx record against channel_id (primary key) index for the channel table to fetch the (unique) channel for the transaction.

Page 15: 6- Join Techniques and Performance Evaluation

15

Nested Loop Joins

Nested loop joins may be fastest join technique when outer table has a small number of qualifying rows and the inner table is very large.

Want number of I/Os to index and data blocks for inner table to be small compared to total number of blocks in inner table.

Can approximate break-even analysis by calculating number of qualifying rows in outer table times the sum of the indexing cost plus the average number of inner table rows retrieved per qualifying outer table row versus total number of blocks in inner table.

Page 16: 6- Join Techniques and Performance Evaluation

16

Nested Loop Joins

Nested loop join performs well relative to other techniques when: (# Qualifying Rows in Outer Table)

x

[ (Index Depth* on Inner Table)

+

(Inner Table Rows Retrieved per Outer Table Qualifying Row) ]

is less than

(# Blocks in Inner Table)

Note: Break even point will be even more optimistic if we assume that substantial portions of the index on the inner table can be cached in memory (first term in square brackets goes to zero) or if a large sort can be avoided.

*Index depth = # of I/Os to get to the desired RID list.

Page 17: 6- Join Techniques and Performance Evaluation

17

Nested Loop JoinsFor traditional DSS queries, high performance with nested loop joins is often dependent on the ability to accessthrough a clustering index.

Traditional DSS queries are not typically provided with a selection criteria using a primary or foreign key value with extreme selectivity as is found in OLTP queries.

Traditional DSS queries often join tables containing (hundreds of) millions of rows to tables with (billions) tens of millions of rows in one-to-many relationships.

Access through a clustering index implies that the data file is physically ordered on the index value, so locality of reference is guaranteed to be very high when matching on the join predicate.

Page 18: 6- Join Techniques and Performance Evaluation

18

Sort Merge Join Concept

SortT1

SortT2

Merge

R

T1 T2

Customer Account

Filter: age < 30

Filter: acct_type=brokerage

Match on customer_id

Q: Which customers under 30 years of age have brokerage accounts?

Page 19: 6- Join Techniques and Performance Evaluation

19

Sort Merge Joins

VLDB sorting requires divide and conquer algorithms for combining main memory sorting with data set management for merging to/from disk.

Create initial runs with in-memory sorts and then merge runs into increasingly longer (and fewer) runs until a single sorted result set is produced.

Need substantial disk space to manage execution of the large sort.

Page 20: 6- Join Techniques and Performance Evaluation

20

Sort Merge Joins

Expected number of passes for sorting a large file is:

log( R/K )

Each pass requires R reads and R writes, as well as CPU for merging sorted runs.

Note: K is the size of the output set resulting from in-memory sort (constrained by availability of main memory); R is the total number of blocks in the data set to be sorted after unneeded columns are eliminated.

In-memory sort to data sets of size k

External sorting using divide and conquer:

...Merge Merge...K K ...2K 2K R/2

...2K 2K...K K R/2

R Input

Page 21: 6- Join Techniques and Performance Evaluation

21

Sort Merge Joins

Estimate of total I/O cost for sort merge join:

O( R * log(R/ K) ) + O( S * log(S/K) ) + O( R+S )

Sort T1 Sort T2 Merge

Note that all reads/writes can make use of sequential I/O.

Page 22: 6- Join Techniques and Performance Evaluation

22

Sort Merge Joins

Make use of “in-place” sorts whenever possible to avoid additional storage overhead.

Amount of main memory available for sorting and merging can have a significant impact on performance.

Page 23: 6- Join Techniques and Performance Evaluation

23

Merge Joins (sans Sort)

Most expensive operations in a sort merge join (by far) are the sort steps prior to merging.

Eliminate this overhead whenever the sorted result from a previous join step can be “re-used” to avoid this operation.

May be able to avoid sort step if base tables are “pre-sorted” or “pre-hashed” on join key.

Merge

R

T1 T2

Filter Filter

Match IDs

Sort Sort

Page 24: 6- Join Techniques and Performance Evaluation

24

Merge Join (sans Sort)

Estimate of total I/O cost for merge join: O( R+S )

Number of I/Os are linearly proportional to combined size of the two tables!

Best performance yet!

Page 25: 6- Join Techniques and Performance Evaluation

25

Hash Joins

Join tables T1 and T2 by building an in-memory hash table from the smaller of the tables (after considering qualifying rows) and then probing the hash table for matches from the other table.

Table to be used in hash table construction is referred to as build input.

Table to be used to search into hash table is referred to as probe input.

Page 26: 6- Join Techniques and Performance Evaluation

26

Hash Joins

When build input can be hashed all into main memory, performance can be quite attractive using the hash join technique.

No sorting required. No need to build any temporary files even when using

very large probe input.

Page 27: 6- Join Techniques and Performance Evaluation

27

Hash Joins

When build input can be hashed all into main memory the number of I/Os is linearly proportional to the combined size of both tables:

Estimate of total I/O cost for hash join: O( R+S )

Beware: Memory availability can be a major performance issue when build input is of significant size.

Page 28: 6- Join Techniques and Performance Evaluation

28

Hash Joins

When the build input has a large number of rows that must be placed into the hash table, overflow to disk must be considered.

Recursive partitioning must be used so that build input partitions can fit into memory.

Partitioning needs to be performed using same splitting algorithm for both build and probe inputs.

Binary hybrid hash operations are used to perform the partitioning.

Page 29: 6- Join Techniques and Performance Evaluation

29

Hash JoinsHash-based binary matching operations work particularly well when the two input sizes are quite different because the recursion depth is proportional to the smaller of the two inputs.

Estimate of total I/O cost for hash join:

O( R * log(R/K) ) + O( S * log(R/K) ) + O( R+S )

Partition T1 Partition T2 Hash Match

Assume R < S.

Critical to avoid skew in hashing function results for

good performance!

Page 30: 6- Join Techniques and Performance Evaluation

30

Pointer-Based Joins

Join tables T1 and T2 by tracing explicit links between related data items.

Found primarily in object databases in the form of object identifiers (OIDs).

The links pre-compute the join path between data items.

Cost of storing and maintaining links can be very expensive.

Page 31: 6- Join Techniques and Performance Evaluation

31

Query Optimization

Quality of query optimizer is critical in VLDB environments targeted to data warehouse applications.

Rule-Based Optimizer

Page 32: 6- Join Techniques and Performance Evaluation

32

Query OptimizationRequire cost-based optimizer to make good decisions

regarding join plans for sophisticated DSS queries.

Need to collect statistics on tables and individual columns within tables.

Optimizer choices should be independent of table ordering in from clause, ordering or structure of where clause predicates, and so on - but reality is that many optimizers are very sensitive to these nuances.

Optimizer should understand available resources within machine (e.g., CPU bound versus I/O bound machine configuration may influence optimization choices).

Note that the quality of optimizer software is a significant differentiator among database products currently on the market.

Page 33: 6- Join Techniques and Performance Evaluation

33

Bottom Line

Join technique is a dominant factor in determining the performance characteristics in a VLDB data warehouse environment.

Join techniques required for a decision support environment are significantly more sophisticated than what is required in an OLTP environment.

No one join technique is appropriate in all cases - must rely on cost-based optimizer to determine optimal plan.