Download - 10053 - null is not nothing

10053 – NULL is not “NOTHING”

“How many time do you ignore NULL and replace it with a special value?”

Have you ever do that before? Or maybe you are doing it now?

Please, stop it now! Take a sit and read this small example before you continue what you are doing.

Mr. A is working for some research company and his job is to enter employee’s details into the database. He is

working with PERSON table. These are the details of that table:

- It has 10,000 rows with 4 columns

- Column NAME (varchar) has 10,000 distinct values, it holds the employee’s name

- Column BIRTH (number) has 12 distinct values, it represents employee’s month of birth

- Column ID (number) has 10,000 distinct values, it holds employee’s ID

- Column CATEGORY (number) has 1,000 distinct values, it holds employee’s category

Here is the situation:

- Mr. A finds that 5 rows in the CATEGORY do not have any number yet (it should be any value between

10,001 and 11,000)

- Since his manager is not in the office, Mr. A decides to store “0” for those un-defined employee’s

category (“0” is the special value in this case)

- As the result, column CATEGORY has 1,001 distinct values.

Is he doing something right, or wrong?

Let’s answer that question using below example Let’s build the word

p.s. Another good example: people uses “1 January 1900” as date of birth. Who live for 100 years now?

Start the Exercise

I ran 15 scenarios for this exercise and the table of comparison can be found below.Later I will go through the

scenarios one by one. In general there are 3 categories for the scenarios:

1. SELECT * FROM person WHERE category <= 10001

Query with “special values”in the range.

2. SELECT * FROM person WHERE category >= 11000

Query without “special values” in the range with bounded – closed predicate.

3. SELECT * FROM person WHERE category BETWEEN 10999 AND 11000

Query without “special values” in the range with bounded – closed – closed predicate.

The main objective of this exercise is to show how Oracle calculate cardinality in below situations:

- When there is no statistics (to see the impact when dynamic sampling feature is turn on and turn off)

- When there is “special value” (I use “0” as the“special value”) which has an extreme range from the

lower bound of the data

- When there is a histogram in the column

- The impact of bucket’ size of histogram

autotrace outputs.zip

10053 trace files.zip

The Scenarios

Category 1, query with less than or equal

1. With Special Value and Without Statistics.

Without statistics in the table, Oracle uses dynamic sampling to get the statistics during run time (dynamic

sampling is enabled by default). This can be seen in the output of auto trace or 10053 trace file.With only

10K rows in the table, the sample size is 100% (all rows). During dynamic sampling process, Oracle runs the

following query which gives the correct answer in this case. For bigger table, things can be different.

SELECT /* OPT_DYN_SAMP */ /*+ ALL_ROWS IGNORE_WHERE_CLAUSE NO_PARALLEL(SAMPLESUB)

opt_param('parallel_execution_enabled', 'false') NO_PARALLEL_INDEX(SAMPLESUB) NO_SQL_TUNE */

NVL(SUM(C1),0), NVL(SUM(C2),0) FROM (SELECT /*+ IGNORE_WHERE_CLAUSE NO_PARALLEL("PERSON")

FULL("PERSON") NO_PARALLEL_INDEX("PERSON") */ 1 AS C1, CASE WHEN "PERSON"."CATEGORY"<=10001 THEN 1

ELSE 0 END AS C2 FROM "PERSON" "PERSON") SAMPLESUB;

NVL(SUM(C1),0) NVL(SUM(C2),0)

-------------- --------------

10000 15

Dynamic sampling level

Sample size in % and number of row

Original and computed cardinality after predicate is applied

In the above example, Oracle calculates the cardinality perfectly when dynamic sampling feature is turned

on. As comparison, when dynamic sampling is disabled, the calculation is far from perfect. It can be seen in

below example, the cardinality went to 176 and the cost of full table scan (11.22) is very close to cost of

index scan (10.01) which is critical since Oracle can be easily switches to full table scan for different

predicate (in this example, index range scan is more efficient).

nostats_nodyn.LST

orcl10_ora_6584_nostats_nodyn.trc

Lastly, when the table is big (with a lot of rows), the sample size can be different. In the next example, the

same query is executed against table with 10,000,000 rows in it and based on the calculation, Oracle uses

full table scan instead of more efficient index range scan and calculation of cardinality is wrong.

nostats_big.LST

orcl10_ora_4008_nostats_big.trc

2. With Special Value and Statisticsbut Without Histogram.

In the second scenario, I gather both table and index statistics using below command.

In the absent of histogram in the column as we can see above, the density is simply 1/(num_distinct).

Another interesting fact is that Oracle decides to use all rows as sample (the same effectwhen we use

estimate_percent=>100 in dbms_stats.gather_table_stats).

The cardinality calculation in this scenario is the worst of all, computed cardinality is 9,102 and Oracle uses

full table scan as the access method. This is happened because of skewed data in column CATEGORY (data

is not evenly distributed).

We can create histogram on the skewed column so that Oracle will be able to calculate the cardinality

better than before.

3. With Special Value, Statistics and Histogram of Bucket 50.

I will combine the explanation of scenario 3 and 4, since the different is only in the number of bucket in the

histogram.

4. With Special Value, Statisticsand Histogram of Bucket 250.

In these scenarios, I create histogram on CATEGORY column with 50 and 250 buckets. With 50 buckets,

Oracle calculates the cardinality as 200 (10,000 * 0.019962) and with 250 buckets, Oracle calculates the

cardinality as 40 (10,000 * 0.0039988) values in the bracket are taken from 10053 trace file. So when

we have more bucket, the calculation of cardinality is likely getting closer to the actual number of filtered

rows.

For histogram with 50 buckets, the calculation of cardinality and selectivity is like below:

Selectivity = ((required range) / (high value – low value) + density) / number of bucket

= ((10,001 – 0) / (10,020 – 0) + 0.0009993) / 50

= (0.9981038 + 0.0009993) / 50

= 0.9991031 / 50

= 0.019982

Cardinality = selectivity * number of rows

= 0.019982* 10,000

= 199.82 = 200

The calculation of cardinality when the number of bucket is increased to 250 is 39.96, and here is the

details:

Selectivity = ((required range) / (high value – low value) + density) / number of bucket

= ((10,001 – 0) / (10,020 – 0) + 0.0009993) / 250

= (0.9981038 + 0.0009993) / 250

= 0.9991031 / 250

= 0.003996

Cardinality = selectivity * number of rows

= 0.003996 * 10,000

= 39.96 = 40

In above 2 scenarios (when there is bucket), Oracle can perfectly uses index range scan as the access

method.

5. With NULL and Statisticsbut Without Histogram.

Next we will update the “0” records with NULL value (which is good and recommended way for storing un-

defined value). This method will not create a huge gap in the data distribution.

Cardinality

Selectivity

Now we can see Oracle calculates the cardinality perfectly and also chooses index range scan as the access

method. The output of 10053 trace file is like below, and since there is no histogram in the column, density

will be 1/(num_distinct)= 1/1,000 = 0.001.

Selectivity = ((10,001 – 10,001) / (11,000 – 10,001) + 0.001)

= (0 + 0.001)

= 0.001

Cardinality = 0.001 * 10,000

= 10

6. With NULL, Statisticsand Histogram of Bucket 50.

If we have evenly distributed data with normal distribution and no popular value, histogram is not

neededin that kind of column, because in most of the cases Oracle is able to calculate the cardinality

correctly.In this scenario, histogram is created with 50 buckets and Oracle calculates the cardinality

perfectly. But if we take a look in the output of 10053 trace file, Oracle uses 0.001 as selectivity since

Oracle thinks that the predicate is out-of-range.

So, cardinality will be 0.001 * 10,000 = 10.

For the comparison, let’s try one more query with different predicate, for example: 10,019. When there is

histogram with 50 buckets, we can calculate the selectivity and cardinality as below.

Selectivity = ((10,019 – 10,001) / (10,021 – 10,001) + 0.001) / 50

= ((18/20) + 0.001) / 50=0.01802

Cardinality = 0.01802 * 10,000= 180.2 = 180

While when we don’t have histogram, the calculation is like below. Both of the results are close to the real

number of returned rows (185).

Selectivity = ((10,019 – 10,001) / (11,000 – 10,001) + 0.001)

= ((18/999) + 0.001)=0.01902

Cardinality = 0.01902* 10,000= 190.2 =190

Category 2, query with more than or equal

7. With Special Value and Statistics but Without Histogram.



I will combine the explanation of above 3 scenarios (7 – 9) here. In this query (where category >=

11000), there is no “special value”, and also the predicate is in the upper bound of the range, Oracle uses

prorated density as the selectivity, so I don’t need to show how selectivity and cardinality is calculated

Now, let’s again take another predicate to simulate the calculation, for example: 10,995. The calculation

when there is no histogram will be:

Selectivity = ((11,000 – 10,995) / (11,000 – 10,001) + 0.001)

= ((5/999) + 0.001) = 0.006005

Cardinality = 0.006005 * 10,000 = 60.05 = 60

And here is the calculation when we create histogram with 50 buckets:

Selectivity = ((11,000 – 10,995) / (11,000 – 10,981) + 0.001) / 50

= ((5/19) + 0.001) / 50

= 0.264158 / 50

= 0.005283

Cardinality = 0.005283* 10,000 = 52.83 = 53

So, it is similar to scenario 6, when we have perfectly distributed data in the column, we don’t need to

create any histogram in it. Just leave it and Oracle will does the job nicely.

Category 3

10. With Special Value and Statistics but Without Histogram.



13. With NULL and Statistics but Without Histogram.

14. With NULL, Statistics and Histogram of Bucket 50.

15. With NULL, Statistics and Histogram of Bucket 250.

In the category 3, all the scenarios are not too relevant with the objective of this exercise (NULL is not

“Nothing”), those scenarios are here only to show that when we don’t have “special value” in the range of

the operated predicate, histogram doesn’t give significant impact to the cardinality and selectivity

calculation, and again looks like the result is better when we don’t have histogram.

Below table of comparison shows that when we don’t have histogram in the column, the result is close to

real number of rows, and since the “special value” is out-of-range, it doesn’t give any impact as well.

Conclusion

1. NULL is not “Nothing”, it is something and is created for some purposes, so please do not afraid to use

it in your table. Optimizer also includes NULL in the calculation of selectivity and cardinality.

2. Oracle dynamic sampling feature is good enough for “small” table or index which does not have

statistics in it. For bigger table, the result is unpredictable and can produces wrong execution plan.

3. Histogram can helps Oracle deciding better execution plan for the query, by giving better calculation

on the density value. Depends on the present of histogram, selectivity is calculated as

1/(num_distinct) or density.

4. The relation between number of bucket in the histogram and calculation of cardinality is linear. The

bigger the bucket the better cardinality is. The number of bucket in the histogram is hard limit to 254.

5. Another “popular bad-habit” is storing date value using varchar data type (for ex: storing date in string

formatted, like: YYYYDDMM). I would like to do this exercise as well, but it will be great if someone

already done this

-heri-

Download - 10053 - null is not nothing

Top Related