Download - 10053 - null is not nothing
10053 – NULL is not “NOTHING”
“How many time do you ignore NULL and replace it with a special value?”
Have you ever do that before? Or maybe you are doing it now?
Please, stop it now! Take a sit and read this small example before you continue what you are doing.
Mr. A is working for some research company and his job is to enter employee’s details into the database. He is
working with PERSON table. These are the details of that table:
- It has 10,000 rows with 4 columns
- Column NAME (varchar) has 10,000 distinct values, it holds the employee’s name
- Column BIRTH (number) has 12 distinct values, it represents employee’s month of birth
- Column ID (number) has 10,000 distinct values, it holds employee’s ID
- Column CATEGORY (number) has 1,000 distinct values, it holds employee’s category
Here is the situation:
- Mr. A finds that 5 rows in the CATEGORY do not have any number yet (it should be any value between
10,001 and 11,000)
- Since his manager is not in the office, Mr. A decides to store “0” for those un-defined employee’s
category (“0” is the special value in this case)
- As the result, column CATEGORY has 1,001 distinct values.
Is he doing something right, or wrong?
Let’s answer that question using below example Let’s build the word
p.s. Another good example: people uses “1 January 1900” as date of birth. Who live for 100 years now?
Start the Exercise
I ran 15 scenarios for this exercise and the table of comparison can be found below.Later I will go through the
scenarios one by one. In general there are 3 categories for the scenarios:
1. SELECT * FROM person WHERE category <= 10001
Query with “special values”in the range.
2. SELECT * FROM person WHERE category >= 11000
Query without “special values” in the range with bounded – closed predicate.
3. SELECT * FROM person WHERE category BETWEEN 10999 AND 11000
Query without “special values” in the range with bounded – closed – closed predicate.
The main objective of this exercise is to show how Oracle calculate cardinality in below situations:
- When there is no statistics (to see the impact when dynamic sampling feature is turn on and turn off)
- When there is “special value” (I use “0” as the“special value”) which has an extreme range from the
lower bound of the data
- When there is a histogram in the column
- The impact of bucket’ size of histogram
autotrace outputs.zip
10053 trace files.zip
The Scenarios
Category 1, query with less than or equal
1. With Special Value and Without Statistics.
Without statistics in the table, Oracle uses dynamic sampling to get the statistics during run time (dynamic
sampling is enabled by default). This can be seen in the output of auto trace or 10053 trace file.With only
10K rows in the table, the sample size is 100% (all rows). During dynamic sampling process, Oracle runs the
following query which gives the correct answer in this case. For bigger table, things can be different.
SELECT /* OPT_DYN_SAMP */ /*+ ALL_ROWS IGNORE_WHERE_CLAUSE NO_PARALLEL(SAMPLESUB)
opt_param('parallel_execution_enabled', 'false') NO_PARALLEL_INDEX(SAMPLESUB) NO_SQL_TUNE */
NVL(SUM(C1),0), NVL(SUM(C2),0) FROM (SELECT /*+ IGNORE_WHERE_CLAUSE NO_PARALLEL("PERSON")
FULL("PERSON") NO_PARALLEL_INDEX("PERSON") */ 1 AS C1, CASE WHEN "PERSON"."CATEGORY"<=10001 THEN 1
ELSE 0 END AS C2 FROM "PERSON" "PERSON") SAMPLESUB;
NVL(SUM(C1),0) NVL(SUM(C2),0)
-------------- --------------
10000 15
Dynamic sampling level
Sample size in % and number of row
Original and computed cardinality after predicate is applied
In the above example, Oracle calculates the cardinality perfectly when dynamic sampling feature is turned
on. As comparison, when dynamic sampling is disabled, the calculation is far from perfect. It can be seen in
below example, the cardinality went to 176 and the cost of full table scan (11.22) is very close to cost of
index scan (10.01) which is critical since Oracle can be easily switches to full table scan for different
predicate (in this example, index range scan is more efficient).
nostats_nodyn.LST
orcl10_ora_6584_nostats_nodyn.trc
Lastly, when the table is big (with a lot of rows), the sample size can be different. In the next example, the
same query is executed against table with 10,000,000 rows in it and based on the calculation, Oracle uses
full table scan instead of more efficient index range scan and calculation of cardinality is wrong.
nostats_big.LST
orcl10_ora_4008_nostats_big.trc
2. With Special Value and Statisticsbut Without Histogram.
In the second scenario, I gather both table and index statistics using below command.
In the absent of histogram in the column as we can see above, the density is simply 1/(num_distinct).
Another interesting fact is that Oracle decides to use all rows as sample (the same effectwhen we use
estimate_percent=>100 in dbms_stats.gather_table_stats).
The cardinality calculation in this scenario is the worst of all, computed cardinality is 9,102 and Oracle uses
full table scan as the access method. This is happened because of skewed data in column CATEGORY (data
is not evenly distributed).
We can create histogram on the skewed column so that Oracle will be able to calculate the cardinality
better than before.
3. With Special Value, Statistics and Histogram of Bucket 50.
I will combine the explanation of scenario 3 and 4, since the different is only in the number of bucket in the
histogram.
4. With Special Value, Statisticsand Histogram of Bucket 250.
In these scenarios, I create histogram on CATEGORY column with 50 and 250 buckets. With 50 buckets,
Oracle calculates the cardinality as 200 (10,000 * 0.019962) and with 250 buckets, Oracle calculates the
cardinality as 40 (10,000 * 0.0039988) values in the bracket are taken from 10053 trace file. So when
we have more bucket, the calculation of cardinality is likely getting closer to the actual number of filtered
rows.
For histogram with 50 buckets, the calculation of cardinality and selectivity is like below:
Selectivity = ((required range) / (high value – low value) + density) / number of bucket
= ((10,001 – 0) / (10,020 – 0) + 0.0009993) / 50
= (0.9981038 + 0.0009993) / 50
= 0.9991031 / 50
= 0.019982
Cardinality = selectivity * number of rows
= 0.019982* 10,000
= 199.82 = 200
The calculation of cardinality when the number of bucket is increased to 250 is 39.96, and here is the
details:
Selectivity = ((required range) / (high value – low value) + density) / number of bucket
= ((10,001 – 0) / (10,020 – 0) + 0.0009993) / 250
= (0.9981038 + 0.0009993) / 250
= 0.9991031 / 250
= 0.003996
Cardinality = selectivity * number of rows
= 0.003996 * 10,000
= 39.96 = 40
In above 2 scenarios (when there is bucket), Oracle can perfectly uses index range scan as the access
method.
5. With NULL and Statisticsbut Without Histogram.
Next we will update the “0” records with NULL value (which is good and recommended way for storing un-
defined value). This method will not create a huge gap in the data distribution.
Cardinality
Selectivity
Now we can see Oracle calculates the cardinality perfectly and also chooses index range scan as the access
method. The output of 10053 trace file is like below, and since there is no histogram in the column, density
will be 1/(num_distinct)= 1/1,000 = 0.001.
Selectivity = ((10,001 – 10,001) / (11,000 – 10,001) + 0.001)
= (0 + 0.001)
= 0.001
Cardinality = 0.001 * 10,000
= 10
6. With NULL, Statisticsand Histogram of Bucket 50.
If we have evenly distributed data with normal distribution and no popular value, histogram is not
neededin that kind of column, because in most of the cases Oracle is able to calculate the cardinality
correctly.In this scenario, histogram is created with 50 buckets and Oracle calculates the cardinality
perfectly. But if we take a look in the output of 10053 trace file, Oracle uses 0.001 as selectivity since
Oracle thinks that the predicate is out-of-range.
So, cardinality will be 0.001 * 10,000 = 10.
For the comparison, let’s try one more query with different predicate, for example: 10,019. When there is
histogram with 50 buckets, we can calculate the selectivity and cardinality as below.
Selectivity = ((10,019 – 10,001) / (10,021 – 10,001) + 0.001) / 50
= ((18/20) + 0.001) / 50=0.01802
Cardinality = 0.01802 * 10,000= 180.2 = 180
While when we don’t have histogram, the calculation is like below. Both of the results are close to the real
number of returned rows (185).
Selectivity = ((10,019 – 10,001) / (11,000 – 10,001) + 0.001)
= ((18/999) + 0.001)=0.01902
Cardinality = 0.01902* 10,000= 190.2 =190
Category 2, query with more than or equal
7. With Special Value and Statistics but Without Histogram.
8. With Special Value, Statistics and Histogram of Bucket 50.
9. With Special Value, Statistics and Histogram of Bucket 250.
I will combine the explanation of above 3 scenarios (7 – 9) here. In this query (where category >=
11000), there is no “special value”, and also the predicate is in the upper bound of the range, Oracle uses
prorated density as the selectivity, so I don’t need to show how selectivity and cardinality is calculated
Now, let’s again take another predicate to simulate the calculation, for example: 10,995. The calculation
when there is no histogram will be:
Selectivity = ((11,000 – 10,995) / (11,000 – 10,001) + 0.001)
= ((5/999) + 0.001) = 0.006005
Cardinality = 0.006005 * 10,000 = 60.05 = 60
And here is the calculation when we create histogram with 50 buckets:
Selectivity = ((11,000 – 10,995) / (11,000 – 10,981) + 0.001) / 50
= ((5/19) + 0.001) / 50
= 0.264158 / 50
= 0.005283
Cardinality = 0.005283* 10,000 = 52.83 = 53
So, it is similar to scenario 6, when we have perfectly distributed data in the column, we don’t need to
create any histogram in it. Just leave it and Oracle will does the job nicely.
Category 3
10. With Special Value and Statistics but Without Histogram.
11. With Special Value, Statistics and Histogram of Bucket 50.
12. With Special Value, Statistics and Histogram of Bucket 250.
13. With NULL and Statistics but Without Histogram.
14. With NULL, Statistics and Histogram of Bucket 50.
15. With NULL, Statistics and Histogram of Bucket 250.
In the category 3, all the scenarios are not too relevant with the objective of this exercise (NULL is not
“Nothing”), those scenarios are here only to show that when we don’t have “special value” in the range of
the operated predicate, histogram doesn’t give significant impact to the cardinality and selectivity
calculation, and again looks like the result is better when we don’t have histogram.
Below table of comparison shows that when we don’t have histogram in the column, the result is close to
real number of rows, and since the “special value” is out-of-range, it doesn’t give any impact as well.
Conclusion
1. NULL is not “Nothing”, it is something and is created for some purposes, so please do not afraid to use
it in your table. Optimizer also includes NULL in the calculation of selectivity and cardinality.
2. Oracle dynamic sampling feature is good enough for “small” table or index which does not have
statistics in it. For bigger table, the result is unpredictable and can produces wrong execution plan.
3. Histogram can helps Oracle deciding better execution plan for the query, by giving better calculation
on the density value. Depends on the present of histogram, selectivity is calculated as
1/(num_distinct) or density.
4. The relation between number of bucket in the histogram and calculation of cardinality is linear. The
bigger the bucket the better cardinality is. The number of bucket in the histogram is hard limit to 254.
5. Another “popular bad-habit” is storing date value using varchar data type (for ex: storing date in string
formatted, like: YYYYDDMM). I would like to do this exercise as well, but it will be great if someone
already done this
-heri-