sql optimization tips

33
-------------------------------------------- By Yu, Fang ([email protected]) SQL Optimization Tips Warning: The contents in this document are mainly coming from the book<<Database Solution>> (http://baike.baidu.com/view/4620738.htm). Contents 1. SQL Execution Explain Plan ....................................................................................................................... 3 1.1. SQL and Optimizer ............................................................................................................................. 3 1.1.1. SQL Statement Transformation .................................................................................................. 3 1.2. Explain Plan ........................................................................................................................................ 6 1.2.1. Scans ........................................................................................................................................... 6 1.2.2. Table Join .................................................................................................................................... 8 1.2.3. Other operations ....................................................................................................................... 10 2. Create Efficient Indexes .......................................................................................................................... 12 2.1. Comparison between “Index Merge” and “Composite Index” ........................................................ 12 2.2. The characteristics of the “Composite Index” ................................................................................. 12 3. Partial Range Scan ................................................................................................................................... 14 3.1. What’s Partial Range Scan ............................................................................................................... 14 3.2. The partial range scan usage rule .................................................................................................... 14 3.2.1. The requirements for Partial Range Scan ................................................................................. 14 3.2.2. Partial Range Scan in different optimizer mode ....................................................................... 15 3.3. The principle to improve the execution speed of Partial Range Scan ............................................. 15 3.3.1. The principle of Partial Range Scan........................................................................................... 15 3.4. The ways to instruct the optimizer to choose Partial Range Scan................................................... 16 3.4.1. Replace SORT operation by (index) access path ....................................................................... 16 3.4.2. Use index scan only for partial range scan ............................................................................... 17 3.4.3. MAX and MIN functions ............................................................................................................ 17 3.4.4. “Filter” partial range scan ......................................................................................................... 18 3.4.5. Take advantage of “ROWNUM” for partial range scan ............................................................ 18 3.4.6. Take advantage of “Inline View/Scalar Sub Query” for partial range scan .............................. 18 3.4.7. Take advantage of “Function” for partial range scan ............................................................... 19 4. Table Joins ............................................................................................................................................... 21

Upload: saikat-banerjee

Post on 03-Dec-2015

33 views

Category:

Documents


0 download

DESCRIPTION

report

TRANSCRIPT

Page 1: SQL Optimization Tips

-------------------------------------------- By Yu, Fang ([email protected])

SQL Optimization Tips

Warning: The contents in this document are mainly coming from the book<<Database Solution>>

(http://baike.baidu.com/view/4620738.htm).

Contents 1. SQL Execution Explain Plan ....................................................................................................................... 3

1.1. SQL and Optimizer ............................................................................................................................. 3

1.1.1. SQL Statement Transformation .................................................................................................. 3

1.2. Explain Plan ........................................................................................................................................ 6

1.2.1. Scans ........................................................................................................................................... 6

1.2.2. Table Join .................................................................................................................................... 8

1.2.3. Other operations ....................................................................................................................... 10

2. Create Efficient Indexes .......................................................................................................................... 12

2.1. Comparison between “Index Merge” and “Composite Index” ........................................................ 12

2.2. The characteristics of the “Composite Index” ................................................................................. 12

3. Partial Range Scan ................................................................................................................................... 14

3.1. What’s Partial Range Scan ............................................................................................................... 14

3.2. The partial range scan usage rule .................................................................................................... 14

3.2.1. The requirements for Partial Range Scan ................................................................................. 14

3.2.2. Partial Range Scan in different optimizer mode ....................................................................... 15

3.3. The principle to improve the execution speed of Partial Range Scan ............................................. 15

3.3.1. The principle of Partial Range Scan ........................................................................................... 15

3.4. The ways to instruct the optimizer to choose Partial Range Scan ................................................... 16

3.4.1. Replace SORT operation by (index) access path ....................................................................... 16

3.4.2. Use index scan only for partial range scan ............................................................................... 17

3.4.3. MAX and MIN functions ............................................................................................................ 17

3.4.4. “Filter” partial range scan ......................................................................................................... 18

3.4.5. Take advantage of “ROWNUM” for partial range scan ............................................................ 18

3.4.6. Take advantage of “Inline View/Scalar Sub Query” for partial range scan .............................. 18

3.4.7. Take advantage of “Function” for partial range scan ............................................................... 19

4. Table Joins ............................................................................................................................................... 21

Page 2: SQL Optimization Tips

-------------------------------------------- By Yu, Fang ([email protected])

4.1. Join VS Loop Query .......................................................................................................................... 21

4.2. The impact of Join Condition on Table Join ..................................................................................... 22

4.2.1. Both sides of the Join Condition are valid ................................................................................. 22

4.2.2. One side of the join condition is invalid .................................................................................... 23

4.2.3. Neither side of the join condition is valid ................................................................................. 23

4.3. Different kinds of table join ............................................................................................................. 23

4.3.1. Nested Loop Join ....................................................................................................................... 23

4.3.2. Sort Merge Join ......................................................................................................................... 24

4.3.3. Nested Loop Join V.S. Sort Merge Join ..................................................................................... 24

4.3.4. Hash Join ................................................................................................................................... 25

4.3.5. Semi Join ................................................................................................................................... 26

4.3.6. Star Join ..................................................................................................................................... 30

4.3.7. Star Transforming Join .............................................................................................................. 31

4.3.8. Bitmap Join Index ...................................................................................................................... 32

Page 3: SQL Optimization Tips

-------------------------------------------- By Yu, Fang ([email protected])

1. SQL Execution Explain Plan

1.1. SQL and Optimizer

1.1.1. SQL Statement Transformation

The SQL Optimizer consists of “Query Transformer”, “Cost Estimator” and “Explain Plan Generator”.

Most of the SQL statements will be transformed more or less by the optimizer before generating the

execution plan for the purpose of getting the best performance.

Here are some examples of query transformation…

(1) sales_qty > 1200 / 12

(2) sales_qty > 100

(3) sales_qty * 12 > 1200

The predicate (1) will be transformed into (2), but the (3) will not. This is because SQL optimizer will not

“move” the condition across the “operator” (>).

Another example, the IN list will be transformed using “OR” operator.

(1) job IN ('MANAGER', 'CLERK')

(2) job = 'CLERK' OR job = 'MANAGER'

“BETWEEN” will be transformed using “>=” and “<=” operators.

(1) sales_qty BETWEEN 100 AND 200

(2) sales_qty >= 100 AND sales_qty <=200

“ANY” operator will be transformed using “OR” operators in some cases…

(1) sales_qty > ANY ( 100, 200) (2) sales_qty > 100 OR sales_qty > 200

1.1.1.1. Transitivity principle (only available in CBO)

Comparison with constant

If the same table column is used in two different predicates (join conditions), the optimizer will generate

a new predicate (join condition) and create the best explain plan based on this new predicate.

WHERE column1 comparision_operators constant AND column1=column2

The “comparision_operators” must be one of “=”, “<>”, “>”, “<”, “>=”, “<=”. And the “constant” can be

“operation”, “constant literal”, “bind variable” or SQL function.

Then under such circumstances, the optimizer will generate a new predicate like below,

Page 4: SQL Optimization Tips

-------------------------------------------- By Yu, Fang ([email protected])

column2 comparision_operators constant

However, if the SQL statement is like…

WHERE column1 comparision_operators column3

AND column1=column2

Then the optimizer cannot deduce the following predicate…

column2 comparision_operators column3

Transform “OR” to “UNION ALL”

Please note that the transformation only happens when the performance will be better.

If the transformed SQL statement can use index to boost the performance, the transformation

will be performed. The optimizer will generate the explain plan as “IN-LIST ITERATOR” or

“CONCATENATION” in this case.

If some query condition cannot use index or the “OR” query condition is used for data check

(filter) only, then the optimizer will not conduct the query transformation. However, we can

take advantage of the hint “USE_CONCAT” to instruct the optimizer to transform the query if we

are sure of better performance for the new query.

For example,

SELECT sal

FROM emp

WHERE job = 'CLERK'

OR deptno = 10;

If there are indexes created on the column job and deptno, the optimizer will transform the SQL

statement above into the following one…

SELECT * FROM emp WHERE job = 'CLERK' UNION ALL

SELECT * FROM emp WHERE deptno=10 AND job <> 'CLERK';

Transform sub query to table join

The optimizer will not transform every sub query into table join. If the transformation will help the

performance boost and the transformation is feasible, it will do it. Otherwise, it will generate the

best explain plan for both the main query and sub query instead of transforming it.

For example,

SELECT *

FROM emp WHERE deptno IN (SELECT deptno FROM dept WHERE loc='New York')

Page 5: SQL Optimization Tips

-------------------------------------------- By Yu, Fang ([email protected])

If there is one unique index on the column deptno in table dept, which indicates there is only one

record corresponding to the main query, then the SQL above can be transformed into the table join

as below,

SELECT *

FROM emp, dept

WHERE emp.deptno = dept.deptno AND dept.loc = 'NEW YORK';

However, for the query below,

SELECT *

FROM emp

WHERE sal > (SELECT AVG(sal) FROM emp WHERE deptno=20);

The optimizer cannot transform it into table join. Instead, it will generate the best plan for main

query and sub-query. If the sub-query can use index and the result can be used to probe the data

from the main query (sub-query is the data provider), the optimizer will generate the plan that

execute the sub-query first. If the sub-query will act as the data filter, the query will be executed as

hash join (semi).

1.1.1.2. View Merging

In order to generate the best execution plan for the view (inline view), the optimizer may need to

transform the SQL query. There are two ways for the transformation:

View Merging: merge the view query and the query condition (predicates)

Predicate Pushing: If the view merge cannot be performed, push the predicates into the view query

Please note the “direction” of the two methods above is different. The former is to “rewrite” the outer

query using the view query (inner query), while the latter on is to push the query condition of the outer

query into the view query.

If the outer query includes the following operations, then the “View Merging” will not be applicable…

SET operations, like UNION, UNION ALL, INTERSECT, MINUS, etc

CONNECT BY

ROWNUM

Aggregation function in SELECT-list, like SUM, AVG, MAX, MIN, etc

GROUP BY ( can use hint MERGE to instruct the optimizer to choose view merging)

DISTINCT in SELECT-list ( can use hint MERGE to instruct the optimizer to choose view merging)

If the outer query has many query conditions that can reduce the query range and merge the query

condition into the view can reduce the data volume that need to be processed, the view merge is

preferable, otherwise the view merging is not necessary.

Page 6: SQL Optimization Tips

-------------------------------------------- By Yu, Fang ([email protected])

For example,

CREATE VIEW emp_10(e_no, e_name, job, manager, hire_date, salary)

AS SELECT empno, ename, job, mgr, hiredate, sal, comm

FROM emp WHERE deptno = 10;

SELECT e_no, e_name, salary, hire_date

FROM emp_10 WHERE salary > 10000000;

Can be transformed using view merging as follows,

SELECT empno, ename, sal, hiredate FROM emp

WHERE deptno=10 AND sal > 10000000;

Another example,

CREATE VIEW emp_group_by_deptno

AS SELECT deptno, AVG(sal) avg_sal, min(sal) min_sal, max(sal) max_sal

FROM emp

GROUP BY deptno;

SELECT *

FROM emp_group_by_deptno WHERE deptno=10;

Can be transformed as follows…

SELECT deptno, AVG(sal) avg_sal, min(sal) min_sal, max(sal) max_sal

FROM emp WHERE deptno=10

GROUP BY deptno;

1.2. Explain Plan

1.2.1. Scans

1.2.1.1. Full Table Scans

Page 7: SQL Optimization Tips

-------------------------------------------- By Yu, Fang ([email protected])

Full table scan will scan all the data block that under the HWM (high water mark), including the empty

data block. In order to reduce the physical I/O, one parameter DB_FILE_MUTLIBLOCK_READ_COUNT

can be set to a higher value.

1.2.1.2. ROWID Scans

ROWID is composed of data object id, data file id, data block id and the record slot in the data block. The

fastest way to retrieve one record from one table is to use ROWID.

1.2.1.3. Index Scans

Index Unique Scan

Index Range Scan

Index Range Scans Descending

Index range scan descending is similar to index range scan, except it accesses the table data in

descending order instead of in ascending order. The optimizer will choose this kind of index scan

under two circumstances: one is the query uses the “ORDER BY…DESC” and the other one is the

query uses the “INDEX_DESC” hint.

Index Skip Scan

Index Skip Scan is introduced to resolve the issue of the composite index cannot be used if the

leading column is not used in the predicates.

Index Full Scan

Index Full Scan will be used when the following two conditions are met,

All the columns in the SELECT-list are included in the index.

There is at least one NOT NULL column in the index

Index Fast Full Scan

The difference between Index Fast Full Scan and Index Full Scan is that the Index Fast Full Scan will

read multiple index blocks rather than one block in each I/O operation.

1.2.1.4. B-Tree Cluster Access

1.2.1.5. Hash Cluster Access

1.2.1.6. Sample Table Access

Sample table access is only available in Full Table Scans and Index Fast Full Scans. The basic syntax is as

follows,

SELECT …

FROM table_name SAMPLE {BLOCK option} (Sample Percent)

Page 8: SQL Optimization Tips

-------------------------------------------- By Yu, Fang ([email protected])

WHERE…

GROUP BY… HAVING…

ORDER BY…

1.2.2. Table Join

Table joins will be detailed in Section 4.

1.2.2.1. Nested Loop Join

The most distinguishing character of Nested Loop Join is that the outer query (driving query) determines

the data volume that need to be processed. The Nested Loop Join performs well when the data volume

is small and there are proper indexes on the join columns. The most outstanding disadvantage of the

Nested Loop join is that it may cause too many random table accesses.

1.2.2.2. Sort Merge Join

Compared with Nested Loop Join, the SORT MERGE JOIN will not introduce much random table accesses.

And there is no “driving table” in SORT MERGE JOIN.

If most of the join conditions are ‘LIKE’, ‘BETWEEN’, ‘>’, >=’, ‘<’, ‘<=’ instead of ‘=’, the SORT MERGE JOIN

is better than Nested Loop Join.

1.2.2.3. Hash Join

Hash join is to use the hash function for the table join. And Hash Join can be only used when the join

operator is “=”.

1.2.2.4. Semi Join

Semi join happens when there is sub query in the SQL statement. The join between “main query” and

“sub query” is called semi join. Since the “sub query” is subject to “main query”, if the relationship

between the main query and sub query is “M:1”, then the join between the main query and sub query is

the same as the general table join, otherwise the sub query will be transformed into the “1” to make

sure the final result is compatible with the main query.

The sub query can be executed earlier than the main query (acts as data provider) or later than the main

query (acts as the data filter). In the first case, if the sub query is the “M” side, one operation named

“SORT(UNIQUE)” will be involved to transform the sub query to the “1” side. In the second case, the sub

query will be aborted when the first matching record in found.

Please note that the IN-list sub query will not necessarily be executed before the main query.

1.2.2.5. Cartesian Join

Page 9: SQL Optimization Tips

-------------------------------------------- By Yu, Fang ([email protected])

Cartesian join means there is no join condition between the two tables. Generally, the Cartesian join is

executed as “Sort Merge Join”.

The typical sort merge join execution plan is as below…

MERGE JOIN (CATESIAN) TABLE ACCESS (FULL) OF 'emp'

BUFFER (SORT)

TABLE ACCESS (FULL) OF 'copy_t'

1.2.2.6. Outer Join

Nested Loop Outer Join

Hash Outer Join

If the outer join query has (inline) view, the view merging will not be executed. Instead, the (inline)

view must be executed separately before the table outer joins.

If the inner table has some query condition, the outer join needs more caution.

For example,

SELECT last_name, nvl(sum(ord_amt), 0) FROM customers c, order o

WHERE c.cust_id = o.cust_id(+) AND c.credit_limit > 1000

AND o.ord_type IN ('01', '03') ------ query condition on the inner table

GROUP BY last_name;

Please note the query condition “o.ord_type IN (‘01’, ‘03’)” will be used as the data filter which is

executed after the outer join which leads to the wrong results.

To resolve this issue, we need to turn to “inline view” for help as the inline view will be executed

first in the outer join.

SELECT last_name, nvl(sum(ord_amt), 0)

FROM customers c, (SELECT cust_id, ord_amt

FROM orders

WHERE ord_type IN ('01', '03')) o WHERE c.cust_id = o.cust_id(+)

AND c.credit_limit > 1000

GROUP BY last_name;

Another better solution is to use ANSI SQL…

SELECT c.last_name, nvl(sum(o.ord_amt), 0)

FROM customers c LEFT OUTER JOIN orders o

ON (c.cust_id = o.cust_id AND o.ord_type IN ('01', '03'))

Page 10: SQL Optimization Tips

-------------------------------------------- By Yu, Fang ([email protected])

WHERE c.credit_limit > 1000

GROUP BY c.last_name;

Sort Merge Outer Join

Full Outer Join

1.2.2.7. Index Join

Index Join means if the table in the query has more than one index on the columns then use hash join to

join these indexes together to get the final result. This means no need to query the table via the index,

just retrieve the data using index join operation.

1.2.3. Other operations

1.2.3.1. IN list iterator explain plan

Please note the difference between “BETWEEN…AND” and “IN list”. The “BETWEEN…AND” means a

range while “IN list” means a list of separate values.

SELECT order_id, order_type, order_amount

FROM orders WHERE order_type IN (1, 2, 3);

Execution Plan

SELECT STATEMENT

INLIST ITERATOR TABLE ACCESS (BY INDEX ROWID) OF ‘orders’

INDEX (RANGE SCAN) OF ‘orders_idx1’ (NON-UNIQUE)

1.2.3.2. Concatenation explain plan

Concatenation explain plan means the SQL statement uses “OR” operator to concatenate multiple query

conditions associated with “different” columns. In this case, the SQL statement will be split into multiple

SELECT clauses with the best explain plan chose for each query portion, and at last combine

(concatenate) the result of each query potion.

Please note that only if the “OR” query condition is used as the driving condition will the concatenation

explain plan be chosen by the optimizer; otherwise the “OR” query condition will be used as the filter

only.

The execution order of the each “query portion” is starting from the last predicates (query condition) in

the “OR” list.

For example,

SELECT *

Page 11: SQL Optimization Tips

-------------------------------------------- By Yu, Fang ([email protected])

FROM table1

WHERE A = '10'

OR B = '123';

Execution plan

CONCATENATION

TABLE ACCESS (BY INDEX ROWID) OF ‘table1’

INDEX (RANGE SCAN) OF ‘b_idx’ (NON-UNIQUE) ---- b is executed first TABLE ACCESS (BY INDEX ROWID) OF ‘table1’

INDEX (RANGE SCAN) OF ‘a_idx’ (NON-UNIQUE)

1.2.3.3. Sort explain plan

SORT (UNIQUE)

There are two possibilities for this explain plan: one is there is DISTINCT operation in the SELECT-list

and the other one is there is one sub-query acting as the data provider for the main query.

SORT (AGGREGATE)

There is no GROUP BY clause but aggregation function is used in the SELECT-list.

SORT (GROUP BY)

There is GROUP BY clause in the SQL statement.

SORT (JOIN)

Sort Merge Join.

SORT (ORDER BY)

There is ORDER BY clause in the SQL statement.

1.2.3.4. SET operation explain plan

Union/Union-All explain plan

Intersection explain plan

Minus explain plan

1.2.3.5. COUNT (STOPKEY) explain plan

When the SQL statement has ROWNUM used, the explain plan will show “COUNT (STOPKEY)” operation.

SELECT *

FROM orders WHERE order_date = :b1

AND ROWNUM <= 20;

Page 12: SQL Optimization Tips

-------------------------------------------- By Yu, Fang ([email protected])

Execution Plan

SELECT STATEMENT

COUNT (STOPKEY) TABLE ACCESS (BY INDEX ROWID) OF ‘orders’

INDEX (RANGE SCAN) OF ‘order_idx2’ (NON-UNIQUE)

2. Create Efficient Indexes

2.1. Comparison between “Index Merge” and “Composite Index” The “Index Merge” works well when the indexes that will be merged have similar density. And

“Composite Index” works well when the query condition (predicates in the WHERE clause) uses “=”

operator.

When the query condition doesn’t use the first column in the composite index, the composite index will

generally perform badly.

2.2. The characteristics of the “Composite Index”

When the leading column (the first column in the index) isn’t used in the query condition, the composite

index will most likely not be used. Even under some circumstances the “index skip scan” can use the

composite index, the performance is not very sound.

To create a composite index, two factors should be considered. One is which columns should be

included in the index, the other one is the order of the columns in the index. These two factors have

great influence on the performance of the index.

The relationship between the density and the order of the columns

If the indexed columns will be only used using “=” operator, the density of the columns has little

impact on the order of the columns.

The impact of “=” operation on the order of the columns

If the query condition doesn’t use “=” operator for the first column in the composite index, the

index will not perform well even if other columns in the index are used with “=” operator in the

query condition.

“=” operation is more important than the density of the column when deciding the order of columns

in the composite index. So to make the best use of the composite index, we need to take both the

density and the column usage (“=” or not) into consideration.

IN list iterator

Sometimes, if the leading column of the composite index is used in “BETWEEN...AND” or “LIKE”

operation, we can take advantage of “IN list” to rewrite the SQL to improve the performance.

For example, suppose the there is one index idx_tab1 (col1, col2) on the table TAB1…

Page 13: SQL Optimization Tips

-------------------------------------------- By Yu, Fang ([email protected])

SELECT * FROM TAB1

WHERE col1 between 10 and 20 AND col2 = 'A';

If there are only limited values that meets the predicate (col1 between 10 and 20), we can rewrite

the SQL as follows..

SELECT * FROM TAB1

WHERE col1 IN (10, 15, 20) AND col2 = 'A';

The SQL statement above is equal to …

SELECT * FORM TAB1 WHERE (col1=10 AND col2=’A’) OR

(col1=15 AND col2=’A’) OR (col1=20 AND col2=’A’)

This way, the composite index idx_tab1 can be used well because the SQL engine can scan less of

the index entries.

Another example, suppose there is one index idx_tab1 (col1, col3, col2) on table TAB1,

SELECT * FROM TAB1 WHERE col1 = 'A' and col2='222';

This time, even the leading column col1 is used in the “=” operator, the second column col3 is not

used the WHERE clause. This way, the col2=’222’ can only be used as the “filter” to check the index

entry which is not very efficient.

If we know the column col3 only have several values, like 1, 2, 3, and then the SQL statement above

can be rewritten as follows…

SELECT * FROM TAB1

WHERE col1 = 'A' and col2='222' and col3 in (1, 2, 3);

It is equal to the following SQL statement…

SELECT * FROM TAB1

WHERE (col1= 'A' and col3=1 and col2='222') OR (col1= ‘A’ and col3=2 and col2='222')

OR (col1= ‘A’ and col3=3 and col2='222')

This way, the column col2, col3 can be used for index entry access which is much efficient than

being a data “filter”.

Page 14: SQL Optimization Tips

-------------------------------------------- By Yu, Fang ([email protected])

3. Partial Range Scan

3.1. What’s Partial Range Scan The Partial Range Scan doesn’t mean to scan all the data that meet the conditions (predicates) in the

SQL Where clause; instead it means that the SQL engine doesn’t need to scan all the data before

returning the first set of data to the users. This is similar to what the optimizer mode “FIRST_ROWS”

indicates.

Partial Range Scan is very helpful for the OLTP operations, but this doesn’t mean the partial range scan

cannot be used in the batch process operations.

Partial Range Scan can reduce the data volume that need to scan and to what extent the data volume

that the Partial Range Scan need to scan is not impacted by the data volume that meets the conditions

(predicates) in the WHERE clause. This is the charming characteristic of the Partial Range Scan.

3.2. The partial range scan usage rule If we can change the Full Range Scan to Partial Range Scan sometimes, the SQL execution performance

will most likely be improved greatly. However, not all the Full Range Scan can be converted to Partial

Range Scan.

If the SQL execution plan has “SORT” operations, like SORT(UNIQUE), SORT(JOIN), SORT(AGGREGATE),

SORT(ORDER BY), SORT(GROUP BY), etc, we can deems the optimizer doesn’t choose Partial Range Scan,

instead it chooses the Full Range Scan operation.

Besides, if the SQL statements have set operations, like UNION, MINUS, INTERSECT, then the SQL

statements cannot be executed via Partial Range Scan as the set operation will sort all the data (SORT

(UNIQUE)) to remove the duplicated records. But UNION ALL can be executed via Partial Range Scan.

3.2.1. The requirements for Partial Range Scan

Generally, if the SQL statement has ORDER BY clause, the SQL statement cannot be executed via Partial

Range Scan. However, if the column in the ORDER BY clause is indexed and the index is used for the

driving index, then the SQL statement can be executed via Partial Range Scan.

SELECT ord_date, ordqty * 1000 FROM order

WHERE ord_date like '200512%' ORDER BY ord_date;

If the column ord_date is indexed, the optimizer can ignore the ORDER BY clause and then the SQL

statement can be executed via Partial Range Scan.

Page 15: SQL Optimization Tips

-------------------------------------------- By Yu, Fang ([email protected])

As a result, not all SQL statements that have ORDER BY clause cannot be executed via the Partial Range

Scan. Only when the SORT operation comes up in the SQL execution plans that the Partial Range Scan

cannot be applied.

3.2.2. Partial Range Scan in different optimizer mode

Generally, the SQL statement in FIRST_ROWS will be executed via Partial Range Scan and in ALL_ROWS

will be executed via Full Range Scan. If want to instruct the optimizer to choose Partial Range Scan, we

can use some hints, like INDEX or FIRST_ROWS (n). In general, set the optimizer mode to “FIRST_ROWS”

in OLTP system.

3.3. The principle to improve the execution speed of Partial Range Scan Take a look at an example first,

SELECT * FROM order;

Generally, the SQL statement above will get the results returned (first set of data) quickly. But the SQL

statement below will get the returned much more slowly.

SELECT * FROM order ORDER BY item;

The reason is not merely because there is one SORT operation in the second SQL statement. The more

important reason is that the SORT operation causes the SQL engine need to perform the FULL RANGE

(table) SCAN operation before the first set of data can be returned.

If there is one index of which the leading column is “item”, the SQL statement above can be rewritten as

follows,

SELECT * FROM order WHERE item > ' ';

This way, the optimizer will take advantage of the index to perform the data scan and the partial range

scan is possible. We can also uses the hint INDEX to impose the use the index, like…

SELECT /*+ index (order item_index) */ * FROM order WHERE item > ' ';

3.3.1. The principle of Partial Range Scan

If the data volume that meets the “driving” query condition is small, the execution operation cost will be

less. If the data volume that meets the “filtering” query condition is big, the execution operation cost

will be low.

In order to make the query condition that will lead to small data volume be the “driving” condition, we

can take advantage of some hints (index, etc) or other methods.

Page 16: SQL Optimization Tips

-------------------------------------------- By Yu, Fang ([email protected])

For example, suppose there are indexes created on the column “ordno” and “custno”, but the query

condition on the column “custo” will lead to smaller data volume which is appropriate for the driving

condition, we can instruct the optimizer to follow our intent…

SELECT * FROM order WHERE RTRIM(ordno) between 1 and 1000 AND custno like 'DN%';

SELECT /*+ INDEX(order custno_index)*/ *

FROM order WHERE ordno BETWEEN 1 and 1000 AND cusno like 'DN%';

The data range that meets the “driving” query condition

The data range that meets the “filtering” query condition

The execution speed

The solution

Small Small Fast

Big Fast

Big Small Slow Swap the “driving” and “filtering” role

Big Fast

3.4. The ways to instruct the optimizer to choose Partial Range Scan

3.4.1. Replace SORT operation by (index) access path

In order to eliminate the “SORT” operation from the SQL execution plan, we can add the columns used

in the ORDER BY clause into the index. This way, we can take advantage of this index to avoid the Full

Range Scan operation.

SELECT ord_dept, ordqty * 1000

FROM order

WHERE ord_date like '2005%' ORDER BY ord_dept DESC

In the SQL statement above, the condition used to filter (drive) the data set is using the column

“ord_date” while the column used in the ORDER BY clause is the column “ord_dept”. If the data set

returned by applying the condition “orde_date like ‘2005%’” is large, the Full Range Scan will respond

slowly. However, if there is also one index on the column ord_dept, we can rewrite the SQL statement

as follows,

SELECT /*+ INDEX_DESC (a ord_dept_index)*/ *

FROM order a WHERE a.ord_date like ‘2005%’ AND ord_dept > ' ';

This way, we not only remove the ORDER BY clause (by using the hint INDEX_DESC) which make the

Partial Range Scan possible, but also make the ord_dept the driving column and the column ord_date be

Page 17: SQL Optimization Tips

-------------------------------------------- By Yu, Fang ([email protected])

the “filter” column. According the principle of the Partial Range Scan, if the filtering condition causes

large data volume, the execution speed will be fast.

3.4.2. Use index scan only for partial range scan

If all the columns used by the SQL statement are included in the index, then the optimizer can only

access the index to get the data. There is no need to scan the table in this case.

This is very efficient as the I/O will be reduced.

As a result, to instruct the optimizer to choose this index scan, we need to think carefully for those

candidate columns that can be included in the index.

3.4.3. MAX and MIN functions

Since MAX and MIN are aggregate functions, it seems that if the SQL statement has these function used

then the Partial Range Scan is impossible.

However, in the new optimizer, there is a special process operation for the MAX/MIN function which

uses the Partial Range Scan which make the MAX/MIN have good response time.

For example, the index pk_order is based on the column (deptno, seq)…

SELECT MAX(seq) + 1 FROM order

WHERE deptno = '1234';

EXECUTION PLAN

SELECT STATEMENT SORT (AGGREGATE)

FIRST ROW INDEX (RANGE SCAN (MIN/MAX)) OF ‘pk_order’ (UNIQUE)

Please note the “FIRST ROW” and “RANGE SCAN(MIN/MAX) in the execution plan. They make the SQL

engine doesn’t need to wait until all the deptno ‘1234’ are scanned before returning the result.

The SQL statement above is almost executed by the optimizer as the SQL statement below…

SELECT /*+ INDEX_DESC(order pk_order) */ NVL(MAX(SEQ), 0) + 1

FROM order WHERE dept_no = '1234' AND ROWNUM =1;

Please note the hint “INDEX_DESC” and ROWNUM are used in the SQL statement “explicitly tell” the

optimizer to choose partial range scan.

Page 18: SQL Optimization Tips

-------------------------------------------- By Yu, Fang ([email protected])

3.4.4. “Filter” partial range scan

The “EXISTS” sub query will return when the first matching record is found. The sub query will be not

executed fully which means the all the records will be joined with the main query, which is valuable for

the partial range scan.

SELECT 1 INTO :cnt FROM DUAL WHERE EXISTS

(SELECT NULL

FROM item_tab WHERE dept='101'

AND seq > 100);

Generally, the “EXISTS” sub query is correlated with the main query. However, just like the example

above, the “EXISTS” sub query can be non-correlated at all. This SQL statement is check whether there is

one record that matching the predicates (dept=’101’ and seq>100) in the table “item_tab”.

The execution plan is as follows,

Execution Plan

SELECT STATEMENT

TABLE ACCESS (FULL) OF ‘dual’ TABLE ACCESS (BY INDEX ROWID) OF ‘item_tab’

INDEX (RANGE SCAN) OF ‘item_dept_idx’ (NOT UNIQUE)

3.4.5. Take advantage of “ROWNUM” for partial range scan

ROWNUM is a fake column which is usually used to limit the number of the records that returned by the

query.

Please note that ROWNUM is not the sequence number of the record that is processed, but the

sequence number of the record that is returned by the query. That’s to say, even if the SQL query has

ROWNUM <=10 predicate, the actual records processed by the query is most likely more than 10.

3.4.6. Take advantage of “Inline View/Scalar Sub Query” for partial range scan

Include the data that must be processed via “Full Range Scan” inside one inline view, and this can make

sure the other part of the SQL can be processed via “Partial Range Scan”. Otherwise, the whole SQL

query would be processed via “Full Range Scan”.

SELECT a.dept_name, b.empno, b.emp_name, c.sal_ym, c.sal_tot FROM department a, employee b, salary c

WHERE b.deptno = a.deptno AND c.empno = b.empno

AND a.location = 'SEOUL'

AND b.job = 'MANAGER'

Page 19: SQL Optimization Tips

-------------------------------------------- By Yu, Fang ([email protected])

AND c.sal_ym = '200512'

ORDER BY a.dept_name, b.hire_date, c.sal_ym;

Since the SQL statement above has the ORDER BY clause, it seems the SQL statement can only be

executed via “Full Range Scan”. But considering the data volume in the table department and employee

are not very large, we can join these two tables first and then join the table salary. What’s more, in

order not to sort by column sal_ym in the table salary, we can create one index on the columns

(empno+sal_ym). This way, the SQL statement above can be rewritten as follows,

SELECT /*+ ORDERED USE_NL(x y)*/

a.dept_name, b.empno, b.emp_name, c.sal_ym, c.sal_tot FROM (SELECT a.dept_name, b.hire_date, b.empno, b.emp_name

FROM dept a, employee b

WHERE b.deptno = a.deptno AND a.location = 'SEOUL'

AND b.job='MANAGER' ORDER BY a.dept_name, b.hire_date) x, salary y

WHERE y.empno = x.empno

AND y.sal_ym = '200512';

Another example,

SELECT a.product_cd, product_name, avg_stock

FROM product a, ( SELECT product_cd, SUM(stock_qty) / (:b2 - :b1) avg_stock

FROM prod_stock WHERE stock_date BETWEEN :b1 AND :b2

GROUP BY product_cd) b WHERE b.product_cd = a.product_cd

AND a.category_cd = '20';

Can be rewritten as follows,

SELECT a.product_cd, product_name,

(SELECT SUM(stock_qty) / (:b2 - :b1)

FROM prod_stock b WHERE b.product_cd = a.product_cd

AND b.stock_date BETWEEN :b1 AND :b2 ) avg_stock

FROM product a

WHERE category_cd = '20';

3.4.7. Take advantage of “Function” for partial range scan

Take a look at the following SQL statement…

SELECT y.cust_no, y.cust_name, x.bill_tot

FROM ( SELECT a.cust_no, SUM(bill_amt) bill_tot FROM account a, charge b

WHERE a.acct_no = b.acct_no

Page 20: SQL Optimization Tips

-------------------------------------------- By Yu, Fang ([email protected])

AND b.bill_cd = ‘FEE’

AND b.bill_ym between :b1 and :b2 GROUP BY a.cust_no

HAVING SUM(b.bill_amt) > 1000000) x, Customer y

WHERE y.cust_no = x.cust_no

AND y.cust_status = 'ARR' AND ROWNUM <= 30;

Though the SQL statement only needs to query the customer that has status with ‘ARR’, the inline view

still needs to group all the customers. Obviously, this is not very efficient as the inline view performs

much useless operation. What’s more, the SQL statement cannot return the first set of data until the

inline view is completely processed.

To resolve this issue, we can take advantage of function as follows…

CREATE OR REPLACE FUNCTION cust_arr_fee_func

( v_custno IN varchar2, v_start_ym in varchar2, v_end_ym IN varchar2) RETURN number

AS

Ret_val number(14); BEGIN

SELECT SUM(bill_amt) INTO ret_val FROM account a, charge b

WHERE a.acct_no = b.acct_no AND a.cust_no = v_cust_no

AND b.bill_cd = 'FEE'

AND b.bill_ym BETWEEN v_start_ym AND v_end_ym;

RETURN ret_val; END cust_arr_fee_func;

SELECT cust_no, cust_name, CUST_ARR_FEE_FUNC(cust_no, :b1, :b2)

FROM customer WHERE cust_status = 'ARR'

AND CUST_ARR_FEE_FUNC(cust_no, :b1, :b2) >= 1000000 AND ROWNUM <= 30;

The SQL statement calls the function twice, and it can be rewritten by using of inline view…

SELECT cust_no, cust_name, bill_tot

FROM ( SELECT ROWNUM, cust_no, cust_name,

CUST_ARR_FEE_FUNC(cust_no, :b1, :b2) bill_tot FROM customer

WHERE cust_status = 'ARR') WHERE bill_tot >= 1000000

AND ROWNUM <= 30;

Page 21: SQL Optimization Tips

-------------------------------------------- By Yu, Fang ([email protected])

Please note that the inline view includes one fake column – ROWNUM, which is used to prevent the

view merging.

4. Table Joins Table joins are set operations (集合运算); they are not merely to retrieve data by using the FKs defined

on the tables.

4.1. Join VS Loop Query The “Loop Query” means using “procedural processing logic” to replace the table join operation. It will

first query the data from one table and then use the results (a list of constant values) to probe the final

result from the other table in one loop.

For example, the SQL statement (using table join)

SELECT t1.col1, t2.col2 FROM tab1 t1, tab2 t2

WHERE t1.key# = t2.join_field;

…can be rewritten by using “Loop Query” like below…

FOR rec in (SELECT key#, col1 FROM tab1)

LOOP SELECT col2 FROM tab2 WHERE join_field = rec.key#;

END LOOP;

If the SQL (table join) statement involves some operations (like order by, group by, etc) which makes the

SQL cannot return the first set of results before processing the whole data set, the “loop query” might

performs better than Join sometimes. However, we can rewrite the general table join by taking

advantage of some techniques (like “inline view”, “scalar sub-query”, etc), which can make the table join

performs well.

Example 1:

SELECT a.fld1, ……, b.col1, …..

FROM tab2 b, tab1 a WHERE a.key1 = b.key2

AND a.fld1 = '10'

ORDER BY a.fld2

Can be rewritten using inline view as follows,

SELECT x.fld1, …., x.fldn, y.col1….., y.coln FROM (SELECT fld1, ….., fldn

FROM tab1 WHERE fld = '10'

ORDER BY fld2) x, Tab2 y

Page 22: SQL Optimization Tips

-------------------------------------------- By Yu, Fang ([email protected])

WHERE y.key2 = x.key1

Example 2:

SELECT b.dept_name, sum(a.sale_money)

FROM tab1 a, tab2 b WHERE a.dept# = b.dept#

AND a.sale_date like '200503%'

GROUP BY b.dept#

Can be rewritten as follows…

SELECT x.dept#, y.dept_name, sale_money

FROM (SELECT dept#, sum(sale_money) sale_money

FROM tab1 WHERE sale_date like '200503%'

GROUP BY dept#) x, TAB2 y

WHERE y.dept# = x.dept#

Example 3:

SELECT a.*, decode(a.type, ‘1’, b.client_name, ‘2’, project_name) name

FROM tab a, clients b, projects c

WHERE a.issue_date like '200503%' AND b.client_no(+) = decode(a.type, ‘1’, a.type_code)

AND c.project_no(+) = decode(a.type, ‘2’, a.type_code)

Can be rewritten using scalar sub-query as follows…

SELECT a.*, (SELECT client_name FROM clients b WHERE b.client_no = a.type_code),

(SELECT project_name FROM projects c WHERE c.project_no = a.type_code) FROM tab a

WHERE a.issue_date like '200503%'

4.2. The impact of Join Condition on Table Join The join condition here mainly means whether there is any valid or proper index on the join columns,

which is very important for optimizer to generate an efficient execution plan.

4.2.1. Both sides of the Join Condition are valid

Under this circumstance, there are proper or valid indexes created on two sides of the join columns. In

this case, each of the two tables can be the “driving” table and will not yield bad execution plan under

most of circumstances.

However, bear in mind that to get the best performance, we need to filter as much as possible data

volume before joining two tables. That’s to say, we need to choose the table that can filter more data to

be the driving table.

Page 23: SQL Optimization Tips

-------------------------------------------- By Yu, Fang ([email protected])

If the optimizer chooses the wrong join order, we can instruct the optimizer to take the right join order

by taking advantage of some hints (like ORDERED) or rewrite the SQL statement.

For example, suppose there are indexes created on the tab2(fld2+key2) and tab1(fld1+key1) and we

know make the table tab2 as the driving table will be better, we can write the following SQL statement

to make the optimizer to follow our intents,

SELECT a.*, b.*

FROM tab2 b, tab1 a WHERE a.key1 = b.key2

AND b .fld2 like 'ABC%'’ AND RTRIM (a.fld1) = ‘10’;

SELECT /*+ordered*/ a.*, b.*

FROM tab2 b, tab1 a WHERE a.key1 = b.key2

AND b.fld2 like 'ABC%'

AND a.fld1 = '10';

4.2.2. One side of the join condition is invalid

Under this circumstance, only one join column is indexed. In this case, the join order is very important.

Generally, the table that has join column indexed should be inner table if uses the NESTED LOOP joins,

or just uses the SORT MERGE JOIN or HASH JOIN which doesn’t use indexes.

4.2.3. Neither side of the join condition is valid

Under these circumstances, both sides of the join columns are not indexed. As a result, the NESTED

LOOP JOIN will not be a good performer, and the SORT MERGE JOIN or HASH JOIN will be a better choice.

4.3. Different kinds of table join

4.3.1. Nested Loop Join

4.3.1.1. The characteristics of Nested Loop Join

The (data) sets are processed in order. The records in the driving table are processed in order and

the tables are joined in specific order.

The data volume need to be processed in the driving table determines the data volume that need to

be processed. So it’s better to choose the table with small data volume (need to be processed) as

the driving data set.

Not all the indexes on the columns in the predicates will be used in the table join.

Page 24: SQL Optimization Tips

-------------------------------------------- By Yu, Fang ([email protected])

The join condition (i.e. valid indexes) is very important for the Nested Loop join.

Permit the partial range scan.

4.3.1.2. The rules of applying Nested Loop Join

If the partial range scan is possible, it’s better to choose nested loop join.

If one side of the table cannot reduce the data volume that need to be scanned by itself (i.e. it

depends on other table to reduce the data volume that need to be scanned), it’s better to choose

nested loop join.

If the data volume that needs to be processed is not very large, it’s better to choose nested loop join.

If the query range of the driving table is large or the random table access is too much when joining

tables, it is better not to choose nested loop join.

4.3.2. Sort Merge Join

Sort Merge Join means sort the two data sets based on the joined columns before joining two tables.

4.3.2.1. The characteristics of Sort Merge Join

It can reduce the random table accesses in great extent.

It is processed via full range scan. The table join cannot happen before the sort operation is finished.

The join order of the table is irrelevant.

The join condition (i.e. the index on the joined column) is not important.

4.3.2.2. The rules of applying Sort Merge Join

If the data volume is large and the partial range scan is impossible, it’s better to use sort merge join.

It’s better to create efficient index to reduce the data volume that need to be sorted rather than the

index on the join columns.

4.3.3. Nested Loop Join V.S. Sort Merge Join

4.3.3.1. If only side of the joined tables have query conditions

Under such circumstance, the Nested Loop join can work well since there is still one side of the table

have query condition which can reduce the data volume (random table access) that needs to be joined.

This is even better when the partial range scan is possible because no query condition (filter) on the

inner table can make the first set of data returned more quickly.

However, this is bad for Sort Merge Join as without query condition filtering data volume, much more

data volume will be sorted before the table join.

4.3.3.2. If both side of the joined tables have no query conditions

Under such circumstance, the Nested Loop Join will perform badly as the both side of tables will be scanned via Full Table Scan and there are much more random table accesses introduced.

Page 25: SQL Optimization Tips

-------------------------------------------- By Yu, Fang ([email protected])

However, the Sort Merge Join will be the better choice under such circumstance.

4.3.4. Hash Join

The most distinguished advantage of the Hash Join is that it can get rid of lots of random table accesses and sort operation when processing the huge amount of data.

Please note there are two hashing (hash function) happening during the hash join, one hashing is to determine the position of the “partition”, the second one is to calculate the “hash value” based on which the hash table is built. The hash table stores the hash values and the corresponding positions of the “clusters” (also called slot in the partition).

Followings are some terms related to Hash Join:

4.3.4.1. Hash Area

Hash Area is the memory space allocated for hash join to work normally. It consists of “bitmap vector”,

“hash table”, “partition table” and the space occupied by the “partitions”. “Bitmap vector” stores the

unique values generated from the joined value from the “build input”, it is used to filter the data from

the “probe input” before the table join.

4.3.4.2. Partition

Partition is the bucket of the records from the “build input” whose hash values are the same. One partition can be further divided into multiple “clusters” which is the unit of the one I/O or query.

4.3.4.3. Cluster

The cluster is contained in the one partition which is the unit of the I/O operation. The cluster is also called slot. The cluster not only stores the joined columns, but also the columns referenced in the SELECT-list of “Build Input”.

4.3.4.4. Build Input and Probe Input

The Build Input is the data set used to build the hash table. The Probe Input is the data set that used the

hash table for the table join. Generally, the smaller data set will be chosen as the build input.

4.3.4.5. In-memory hash join and Delayed hash join

If the build input can be fully contained in the Hash Area, the hash join is In-Memory hash join,

otherwise it’s called delayed hash join.

4.3.4.6. Bitmap Vector

Page 26: SQL Optimization Tips

-------------------------------------------- By Yu, Fang ([email protected])

Bitmap Vector is created during creating the partitions for the build input. It’s used to store the unique

(hash) value of the joined columns of the build input in the Hash Area. When building the partitions for

the Probe Input, the Bitmap Vector is used to filter the data set.

4.3.4.7. Hash Table

Hash table is created in memory (Hash Area) which is used as the “index” for the “probe input” during

the table join. The “probe input” uses the Hash Table to query the “addresses” of the “cluster” which

contains the joined columns and the columns referenced in the SELECT-list from the “build input”, and

then join to the “Cluster”.

4.3.4.8. Partition Table

Partition Table is used to store the information (e.g. address) of the each “partition” when the “Build

Input” cannot be contained fully in the memory. The information contained in the “Partition Table” can

be used to reload the “partition” from the temporary segments into the memory.

4.3.5. Semi Join

The SQL optimizer will often choose “semi join” when there is “sub-query” clause in the SQL statement.

The reason that called it “semi join” is that “sub-query” is quite different from general table joins.

4.3.5.1. What’s the semi join and what’s the characteristics

Semi-join is to join the “sub-query” and “main query”, it’s a “broad-wide” table join. Sub-query is the

child query while the main-query is the parent query. Just like the inheritance in the OO, the sub-query

can reference the fields (columns) in the main-query, but not vice versa.

Though table join and sub-query are similar, they are quite different in essence. The sets (tables) in the

table join are au pair but not in the sub query.

4.3.5.2. The execution plans for semi-join

Nested Loop Semi-join

In semi-join, the sub-query can be executed before or after the main-query. In the former case, the

records (literal values) returned in the sub-query will be used to probe the final results from the

main-query; while in the latter case, the results returned from the main-query will be further

checked by the sub-query.

Suppose there are two tables TAB1 and TAB2 the relations between these two tables are 1 to M,

which means one record in TAB may have more than 1 record matched in the table TAB2.

The following SQL statement (pseudo snippet)…

SELECT col1, col2… FROM TAB1 x

Page 27: SQL Optimization Tips

-------------------------------------------- By Yu, Fang ([email protected])

WHERE key1 IN ( SELECT key2

FROM TAB2 y WHERE y.col1…

AND y.col2…)

The corresponding SQL execution plan can be like…

NESTED LOOPS

VIEW SORT(UNIQUE)

TABLE ACCESS BY ROWID OF TAB2 INDEX RANGE SCAN OF col1_idx

TABLE ACCESS BY ROWID OF TAB1

INDEX RANGE SCAN OF key1_idx

From the SQL execution plan, we can see the sub-query (TAB2) is executed first. The records

returned by the sub-query is used to join (NESTED LOOP) with the main query (TAB1). Please note

that there is one operation – SORT (UNIQUE) in the sub-query which is to eliminate the duplicated

values returned from the sub-query as TAB2 is the child table of the TAB1. This is the difference

between the table join and sub-query. If TAB1 and TAB2 are joined as two tables, there will no

“SORT (UNIQUE)” operation. However, if the column (key2) returned from the sub-query is the

primary key which is unique; there is no need to perform “SORT (UNIQUE)” operation. Then under

such circumstance, the sub-query becomes the same as the table join.

The SQL statement above can be called “non-correlated sub-query” because the sub-query doesn’t

reference the columns in the main-query. For such sub-queries, the optimizer can choose to execute

the sub-query before the main-query or not depending on the statistics. However, if the “sub-query”

is “correlated” which means the sub-query references the columns in the main-query, the optimizer

will choose to execute the main-query first and uses the sub-query for the results filtering. The

reason is that the sub-query depends on the main-query; the sub-query cannot know the values of

the join-columns before executing the main-query. As the programmer, we can instruct the

execution plan by taking advantage of this fact. If we rewrite the SQL statement above as follows,

SELECT col1, col2…

FROM TAB1 x WHERE key1 IN (SELECT key2

FROM TAB2 y

WHERE y.col1… AND y.col2…

AND x.key1 = y.key2)

The main-query will be executed first. Please note the join condition x.key1=y.key2 doesn’t need to

specify explicitly, the optimizer can deduce this from the query. But if we intent to make the SQL

engine execute the main-query first we can make the sub-query “correlated”.

Sort Merge Semi-Join

Page 28: SQL Optimization Tips

-------------------------------------------- By Yu, Fang ([email protected])

“Filter” Semi-Join

As mentioned in section “Nested Loop Semi-Join”, the “sub-query” can be the provider (executed

before the main-query) or be the filter (executed after the main-query, i.e. correlated sub-query).

Generally, the SQL optimizer will choose the “Filter” type semi-join if the sub-query containing

“EXISTS” operator.

One typical “Filter” Semi-Join execution plan is as follows,

SELECT … FROM order x

WHERE ordate LIKE '200506%' AND EXISTS (SELECT NULL

FROM dept y

WHERE y.deptno = x.saldept AND y.type1 = '1');

FILTER TABLE ACCESS (BY ROWID) OF ‘order’

INDEX (RANGE SCAN) OF ‘orddate_index’ (NON_UNIQUE)

TABLE ACCESS (BY ROWID) OF ‘dept’ INDEX (UNIQUE SCAN) OF ‘dept_pk’ (UNIQUE)

Not the operation “FILTER” where the “NESTED LOOPS” is usually seen. This is the most obvious

difference between Nested Loop join and “Filter” semi-Join in term of execution plan. In the “filter”

semi-join, once the matching record is found in the sub-query (dept, in this case) the join is ended,

while in nested loop joins, all the match records between the two tables (dept and order, in this case)

are joined together. Compared with “Nested Loops”, “Filter Semi-Join” can reduce the times of table

random access for the tables in the sub-query. So if the sub-query acts merely as the “filter”, the

sub-query may perform over the table join. However, if the sub-query acts not just as the “filter” or

the optimizer will try to change the execution sequence between the main-query and sub-query,

then try to change the sub-query to the table join.

(For the SQL statement above, if there is one index on the column order (orddate, saldept), the SQL

statement will be more efficient as this will take the best advantage of the cache to reduce the

random table access of the table dept)

Hash Semi-Join

Just like the Nested Loop joins, the “Filter” semi-join will introduce lots of random table access for

the “sub-query” tables which is not efficient for large volume of data. When processing huge

amount of data, sort merge join or hash join is usually a better choice.

Page 29: SQL Optimization Tips

-------------------------------------------- By Yu, Fang ([email protected])

SELECT …

FROM order x WHERE orddate LIKE '200506%'

AND EXISTS (SELECT /*+ hash_sj(x, y) */ NULL FROM dept y

WHERE y.deptno = x.saldept

AND y.type1 = '1');

HASH JOIN SEMI

TABLE ACCESS (BY ROWID) OF ‘order’ INDEX (RANGE SCAN) OF ‘orddate_index’ (NON_UNIQUE)

TABLE ACCESS (FULL) OF ‘dept’

Please note there is one hint “hash_sj” in the sub-query which instructs the optimizer to choose the

“Hash Join Semi”.

However, there are some restrictions to use hash join. For example, the sub-query cannot have

more than one table; the join condition can only be ‘=’; the sub-query cannot have GROUP BY,

CONNECT BY, ROWNUM, etc.

Anti Semi-Join

The ANTI semi join will be chosen when there is “NOT” operator used between the main-query and

sub-query. If the sub-query uses “NOT” operator, no matter “NOT IN” or “NOT EXISTS”, the sub-

query will acts as the data “filter”.

Under most circumstances, the optimizer will choose “filter” semi-join for anti semi-join which is

good when the data volume is not large. When the data volume gets large, to reduce the times of

random table access, sort merge join and hash join will be better choices for the anti semi-join. We

can use some hints to instruct the optimizer to choose these join methods, like MERGE_AJ or

HASH_AJ.

For example,

SELECT COUNT(*) FROM tab1

WHERE col1 like 'ABC%' AND col2 IS NOT NULL

AND col2 NOT IN

(SELECT /*+ MERGE_AJ*/ FLD2 FROM tab2

WHERE fld3 BETWEEN '20050101' and '20050131' AND fld2 IS NOT NULL)

MERGE JOIN (ANTI)

SORT (JOIN) TABLE ACCESS (BY ROWID) OF ‘tab1’

Page 30: SQL Optimization Tips

-------------------------------------------- By Yu, Fang ([email protected])

INDEX (RANGE SCAN) OF ‘col1_index’ (NON-UNIQUE)

SORT (UNIQUE) VIEW

TABLE ACCESS (BY ROWID) OF ‘tab2’ INDEX (RANGE SCAN) OF ‘fld3_index’ (NON-UNIQUE)

Please note there is one filter condition in the sub-query (fld2 IS NOT NULL); this is because the

main-query uses NOT IN as the check operator to filter the data that equals to the values of “fld2” in

the sub-query. If the results returned by the sub-query include “NULL” values, the results returned

by the main-query will be incorrect as NULL doesn’t equal to any value.

4.3.6. Star Join

Star Join is not a brand new join method; it still uses the normal join methods, like nested loop join, sort

merge join, hash joint, etc. The special characteristic of star join is it will use special join order to join

tables.

Though star join mostly comes up in data warehouse (data mart), it’s not necessarily to say the normal

OLTP database cannot have such join operation. The start join works well when there are some small

tables joining to a big table and those small tables don’t join to each other directly. This is like a star

shape and that’s why “star” join is called. If the big table joins to each small table one by one, this is very

inefficient as this will lead to too much I/O overhead. (The small table corresponds to dimension table

and the big table corresponds to fact table in data warehouse.)

To resolve this issue, the star join will take advantage of “Cartesian join” to join those small tables first

to get a data set and then uses this data set to join the big table. Since each small table has small data

volume, the Cartesian product will not produce too much data volume. However, if the small table is not

small enough, the Cartesian join will produce too much data volume, which will make the star join works

badly.

The execution plan below shows what a typical star join is like…

SELECT STATEMENT Optimizer=ALL_ROWS

HASH JOIN MERGE JOIN (CARTESIAN)

TABLE ACCESS (FULL) OF ‘dept’

BUFFER (SORT) TABLE ACCESS (FULL) OF ‘products’

TABLE ACCESS (FULL) OF ‘sales’

The table ‘dept’ and ‘products’ are dimension tables while the table ‘sales’ is fact table. Please note that

star join is only available in CBO and when the statistics are gathered. And there is one hint (/*+STAR*/)

can be used to instruct the optimizer to choose star join.

Under most circumstance, the Cartesian product is usually generated by sort merge join.

Page 31: SQL Optimization Tips

-------------------------------------------- By Yu, Fang ([email protected])

Since there is generally no proper composite index created in the big table (fact table), the tables join

operation between the big table and the Cartesian product is usually hash join.

4.3.7. Star Transforming Join

Star transforming join is introduced to make up some drawbacks of the star join. It’s not a replacement

of star join.

As we know, if the data volume of the Cartesian product in star join is very huge, the star join will

perform badly. Star transforming join takes advantage of bitmap index to get rid of the Cartesian

product and the composite indexes created on the big (fact) table.

The “transforming” in the star transforming join means the optimizer will transform the SQL query in

another form by applying the idea that the “sub-query” can be used to data provider.

Let’s see an example,

SELECT d.dept_name, c.cust_city, p.product_name, SUM(s.amount) sales_amount

FROM sales s, products t, customers c, dept d WHERE s.product_cd = t.product_cd

AND s.cust_id = c.cust_id AND s.sales_dept = d.dept_no

AND c.cust_grade between '10' and '15'

AND d.location = 'SEOUL' AND p.product_name IN ('PA001', 'DR210')

GROUP BY d.dept_name, c.cust_city, p.product_name;

The SQL statement above can be transformed into the following one…

SELECT d.dept_name, c.cust_city, p.product_name, SUM(s.amount) sales_amount FROM sales s, products t, customers c, dept d

WHERE s.product_cd = t.product_cd AND s.cust_id = c.cust_id

AND s.sales_dept = d.dept_no

AND c.cust_grade between '10' and '15' AND d.location = 'SEOUL'

AND p.product_name IN (‘PA’, ‘DR’) AND s.product_cd IN (SELECT product_cd FROM products WHERE product_name IN

('PA001', 'DR210'))

AND s.cust_id IN (SELECT cust_id FROM customers WHERE cust_grade between '10' and '15')

AND s.sales_dept IN (SELECT dept_cd FROM dept WHERE location = ‘SEOUL’) GROUP BY d.dept_name, c.cust_city, p.product_name;

The execution plan is as follows,

SELECT STATEMENT Optimizer=ALL_ROWS HASH JOIN

Page 32: SQL Optimization Tips

-------------------------------------------- By Yu, Fang ([email protected])

HASH JOIN

HASH JOIN TABLE ACCESS (FULL) OF ‘dept’

TABLE ACCESS (BY INDEX ROWID) OF ‘sales’ BITMAP CONVERSION (TO ROWIDS)

BITMAP AND

BITMAP MERGE BITMAP KEY ITERATION

TABLE ACCESS (FULL) OF ‘products BITMAP INDEX (RANGE SCAN) OF ‘sales_product_bx’

BITMAP MERGE BITMAP KEY ITERATION

TABLE ACCESS (FULL) OF ‘dept’

BITMAP INDEX (RANGE SCAN) OF ‘sales_dept_bx’ BITMAP MERGE

BITMAP KEY ITERATION TABLE ACCESS (BY INDEX ROWID) OF ‘customers’

INDEX (RANGE SCAN) OF ‘cust_grade_idx’

BIMAP COVERSION (FROM ROWIDS) INDEX (RANGE SCAN) OF ‘sales_cust_idx’

TABLE ACCESS (FULL) OF ‘products’ TABLE ACCESS (BY INDEX ROWID) OF ‘customers’

INDEX (RANGE SCAN) OF ‘cust_state_province_idx’

There are some preconditions should be met before star transforming join can be used by the optimizer.

There must be one fact table and more than 2 dimension tables

There should be bitmap indexes created on the join columns in the fact table.

There should be statistics gathered on the fact table.

The parameter “star_transformation_enabled” should be set to TRUE or TEMP_DISABLE. Or the

hint (STAR_TRANSFORMATION) should be used in the SQL statement.

Please note that if the SQL statement uses the bind variable, the star transforming join will not be used

by the optimizer as the optimizer needs to know the statistics of the fact table. The bind variable will

make the optimizer have no idea of the statistics.

4.3.8. Bitmap Join Index

Bitmap join index is created to prompt the performance of the star transforming join. With the bitmap

join index at hand, the star transforming join can get rid of the “BITMAP MERGE” operation.

Suppose we create one bitmap join index…

CREATE BITMAP INDEX sales_cust_job_bjix ON sales (customers.cust_job)

FROM sales, customers WHERE sales.cust_id = customers.cust_id

LOCAL NOLOGGING COMPUTE STATISTICS;

Page 33: SQL Optimization Tips

-------------------------------------------- By Yu, Fang ([email protected])

And the SQL execution plan will be like below…

SELELCT STATEMENT SOR GROUP BY

HASH JOIN TABLE ACCESS FULL CHANNELS

TABLE ACCESS BY LOCAL INDEX ROWID SALES

BITMAP CONVERSION TO ROWIDS BITMAP AND

BITMAP INDEX SINGLE VALUE sales_cust_join_bjix BITMAP MERGE

BITMAP KEY ITERATION TABLE ACCESS FULL products

BITMAP INDEX RANGE SCAN sales_prod_bix

BITMAP MERGE BITMAP KEY ITERATION

TABLE ACCESS FULL dept BITMAP INDEX RANGE SCAN sales_dept_bix