billion goods in few categories - percona · mysql.innodb table stats mysql.innodb index stats...
TRANSCRIPT
Billion Goods in Few CategoriesHow Histograms Save a Life?
Sveta SmirnovaPercona
•Introduction•The Use Case
The Cardinality: Two LevelsExample
•Why the Difference?•Even Worse Use Case
ANALYZE TABLE LimitationsExample
•How Histograms Work?•Left Overs
Table of Contents
2
The column statistics data dictionary table stores histogram statistics aboutcolumn values, for use by the optimizer in constructing query execution plans
MySQL User Reference Manual
Optimizer Statistics aka Histograms
3
• MySQL Support engineer• Author of• MySQL Troubleshooting• JSON UDF functions• FILTER clause for MySQL
• Speaker• Percona Live, OOW, Fosdem,
DevConf, HighLoad...
Sveta Smirnova
4
Introduction
• Hardware•Wise options• Optimized queries• Brain
Everything can Be Resolved!
6
• This talk is about• How I spent the last three years• Resolving the same issue• For different customers
• Task was to speed up the query
Not Everything /
7
• This talk is about• How I spent the last three years• Resolving the same issue• For different customers
• Task was to speed up the query
Not Everything /
7
• Specific data distribution
• Access on different fields• ON goods.shop id = shop.id• WHERE shop.location IN (...)• GROUP BY goods.category, shop.profile• ORDER BY shop.distance, goods.quantity
• Index cannot be used effectively
Not All the Queries Can be Optimized
8
• Specific data distribution• Access on different fields• ON goods.shop id = shop.id• WHERE shop.location IN (...)• GROUP BY goods.category, shop.profile• ORDER BY shop.distance, goods.quantity
• Index cannot be used effectively
Not All the Queries Can be Optimized
8
• Specific data distribution• Access on different fields• ON goods.shop id = shop.id• WHERE shop.location IN (...)• GROUP BY goods.category, shop.profile• ORDER BY shop.distance, goods.quantity
• Index cannot be used effectively
Not All the Queries Can be Optimized
8
• Data distribution varies• Big difference between number of values
Red 1,000,000Green 2Blue 100,000
• Cardinality is not correct• Index maintenance is expensive• Optimizer does not work as we wish it
Examples in my talk @Percona Live Frankfurt
Latest Support Tickets
9
• Data distribution varies• Constantly changing
Red 100,000Green 1,000,000Blue 10
• Cardinality is not correct• Index maintenance is expensive• Optimizer does not work as we wish it
Examples in my talk @Percona Live Frankfurt
Latest Support Tickets
9
• Data distribution varies• Constantly changing
Red 1,000Green 2,000Blue 50,000
• Cardinality is not correct• Index maintenance is expensive• Optimizer does not work as we wish it
Examples in my talk @Percona Live Frankfurt
Latest Support Tickets
9
• Data distribution varies• Cardinality is not correct• Was not updated in time• Updates too often• Calculated wrongly
• Index maintenance is expensive• Optimizer does not work as we wish it
Examples in my talk @Percona Live Frankfurt
Latest Support Tickets
9
• Data distribution varies• Cardinality is not correct• Index maintenance is expensive• Hardware resources• Slow updates• Window to run CREATE INDEX
• Optimizer does not work as we wish itExamples in my talk @Percona Live Frankfurt
Latest Support Tickets
9
• Data distribution varies• Cardinality is not correct• Index maintenance is expensive• Optimizer does not work as we wish it
Examples in my talk @Percona Live Frankfurt
Latest Support Tickets
9
• Topic based on real Support cases• Couple of them are still in progress
• All examples are 100% fake• All examples are simplified• All disasters happened with version 5.7
Disclaimer
10
• Topic based on real Support cases• All examples are 100% fake• They are created so that• No customer can be identified• Everything generated
Table namesColumn namesData
• Use case itself is fictional
• All examples are simplified• All disasters happened with version 5.7
Disclaimer
10
• Topic based on real Support cases• All examples are 100% fake• All examples are simplified• Only columns, required to show the issue• Everything extra removed• Real tables usually store much more data
• All disasters happened with version 5.7
Disclaimer
10
• Topic based on real Support cases• All examples are 100% fake• All examples are simplified• All disasters happened with version 5.7
Disclaimer
10
The Use Case
• categories• Less than 20 rows
• goods• More than 1M rows• 20 unique cat id values• Many other fields
PriceDate: added, last updated, etc.CharacteristicsStore...
Two Tables
12
• categories• Less than 20 rows
• goods• More than 1M rows• 20 unique cat id values• Many other fields
PriceDate: added, last updated, etc.CharacteristicsStore...
Two Tables
12
select *
from
goods
join
categories
on
(categories.id=goods.cat_id)
where
date_added between ’2018-07-01’ and ’2018-08-01’
and
cat_id in (16,11)
and
price >= 1000 and <=10000 [ and ... ]
[ GROUP BY ... [ORDER BY ... [ LIMIT ...]]]
;
JOIN
13
• Select from the small table
• For each cat id select from the large table• Filter result on date added[ and price[...]]• Slow with many items in the category
Option 1: Select from the Small Table First
14
• Select from the small table• For each cat id select from the large table
• Filter result on date added[ and price[...]]• Slow with many items in the category
Option 1: Select from the Small Table First
14
• Select from the small table• For each cat id select from the large table• Filter result on date added[ and price[...]]
• Slow with many items in the category
Option 1: Select from the Small Table First
14
• Select from the small table• For each cat id select from the large table• Filter result on date added[ and price[...]]• Slow with many items in the category
Option 1: Select from the Small Table First
14
Option 1: Illustration
15
Option 1: Illustration
15
Option 1: Illustration
15
Option 1: Illustration
15
Option 1: Illustration
15
Option 1: Illustration
15
Option 1: Illustration
15
Option 1: Illustration
15
• Filter rows by date added[ and price[...]]
• Get cat id values• Retrieve rows from the small table• Slow if number of rows, filtered bydate added, is larger than number ofgoods in the selected categories
Option 2: Select From the Large Table First
16
• Filter rows by date added[ and price[...]]• Get cat id values
• Retrieve rows from the small table• Slow if number of rows, filtered bydate added, is larger than number ofgoods in the selected categories
Option 2: Select From the Large Table First
16
• Filter rows by date added[ and price[...]]• Get cat id values• Retrieve rows from the small table
• Slow if number of rows, filtered bydate added, is larger than number ofgoods in the selected categories
Option 2: Select From the Large Table First
16
• Filter rows by date added[ and price[...]]• Get cat id values• Retrieve rows from the small table• Slow if number of rows, filtered bydate added, is larger than number ofgoods in the selected categories
Option 2: Select From the Large Table First
16
Option 2: Illustration
17
Option 2: Illustration
17
Option 2: Illustration
17
Option 2: Illustration
17
Option 2: Illustration
17
• CREATE INDEX index everything(cat id, date added[, price[, ...]])• It resolves the issue
• But not in all cases
What if We use Combined Indexes?
18
• CREATE INDEX index everything(cat id, date added[, price[, ...]])• It resolves the issue• But not in all cases
What if We use Combined Indexes?
18
• Maintenance cost• Slower INSERT/UPDATE/DELETE• Disk space
• Index not useful for selecting rows• Tables may have wrong cardinality
The Problem
19
• Maintenance cost• Slower INSERT/UPDATE/DELETE• Disk space
• Index not useful for selecting rowsJOIN categories ON (categories.id=goods.cat_id)
JOIN shops ON (shops.id=goods.shop_id)
[ JOIN ... ]
WHERE
date_added between ’2018-07-01’ and ’2018-08-01’
AND
cat_id in (16,11) AND price >= 1000 AND price <=10000 [ AND ... ]
GROUP BY product_type
ORDER BY date_updated DESC
LIMIT 50,100
• Tables may have wrong cardinality
The Problem
19
• Maintenance cost• Slower INSERT/UPDATE/DELETE• Disk space
• Index not useful for selecting rows• Tables may have wrong cardinality
The Problem
19
The Use CaseThe Cardinality: Two Levels
The Query
Parser
Optimizer
Storage Engine
Data
MySQL Architecture
21
• Optimizer• Engine• MyRocks• InnoDB• Any
MySQL is Layered Architecture
22
• Number of unique values in the index• Optimizer uses for the query execution plan
• Example
Cardinality
23
• Number of unique values in the index• Optimizer uses for the query execution plan• Example• ID: 1,2,3,4,5• Number of rows: 5• Cardinality: 5
Cardinality
23
• Number of unique values in the index• Optimizer uses for the query execution plan• Example• Gender: m,f,f,f,f,m,m,m,m,m,m,f,f,m,f,m,f• Number of rows: 17• Cardinality: 2
Cardinality
23
• Stores statistics on disk• mysql.innodb table stats• mysql.innodb index stats
• Returns statistics to Optimizer• In ha innobase::info• handler/ha innodb.cc
•When opens table• Subsequent table accesses• flag = HA STATUS VARIABLE• Statistics from memory• Up to date Primary Key data
InnoDB: Overview
24
• Stores statistics on disk• Returns statistics to Optimizer
• In ha innobase::info• handler/ha innodb.cc
•When opens table• Subsequent table accesses• flag = HA STATUS VARIABLE• Statistics from memory• Up to date Primary Key data
InnoDB: Overview
24
• Stores statistics on disk• Returns statistics to Optimizer• In ha innobase::info• handler/ha innodb.cc
•When opens table• Subsequent table accesses• flag = HA STATUS VARIABLE• Statistics from memory• Up to date Primary Key data
InnoDB: Overview
24
• Stores statistics on disk• Returns statistics to Optimizer• In ha innobase::info• handler/ha innodb.cc
•When opens table• flag = HA STATUS CONST• Reads data from disk• Stores it in memory
• Subsequent table accesses• flag = HA STATUS VARIABLE• Statistics from memory• Up to date Primary Key data
InnoDB: Overview
24
• Stores statistics on disk• Returns statistics to Optimizer• In ha innobase::info• handler/ha innodb.cc
•When opens table• Subsequent table accesses• flag = HA STATUS VARIABLE• Statistics from memory• Up to date Primary Key data
InnoDB: Overview
24
• Table created with option STATS AUTO RECALC = 0
• Before ANALYZE TABLEmysql> show index from test\G
...
*************************** 2. row ***************************
Table: test
Non_unique: 1
Key_name: f1
Seq_in_index: 1
Column_name: f1
Collation: A
Cardinality: 64
...
• After restartmysql> show index from test\G
...
*************************** 2. row ***************************
Table: test
Non_unique: 1
Key_name: f1
Seq_in_index: 1
Column_name: f1
Collation: A
Cardinality: 2
...
InnoDB: Flow
25
• Table created with option STATS AUTO RECALC = 0
• After ANALYZE TABLEmysql> show index from test\G
...
*************************** 2. row ***************************
Table: test
Non_unique: 1
Key_name: f1
Seq_in_index: 1
Column_name: f1
Collation: A
Cardinality: 2
...
• After restartmysql> show index from test\G
...
*************************** 2. row ***************************
Table: test
Non_unique: 1
Key_name: f1
Seq_in_index: 1
Column_name: f1
Collation: A
Cardinality: 2
...
InnoDB: Flow
25
• Table created with option STATS AUTO RECALC = 0
• After inserting rowsmysql> show index from test\G
...
*************************** 2. row ***************************
Table: test
Non_unique: 1
Key_name: f1
Seq_in_index: 1
Column_name: f1
Collation: A
Cardinality: 16
...
• After restartmysql> show index from test\G
...
*************************** 2. row ***************************
Table: test
Non_unique: 1
Key_name: f1
Seq_in_index: 1
Column_name: f1
Collation: A
Cardinality: 2
...
InnoDB: Flow
25
• Table created with option STATS AUTO RECALC = 0
• After restartmysql> show index from test\G
...
*************************** 2. row ***************************
Table: test
Non_unique: 1
Key_name: f1
Seq_in_index: 1
Column_name: f1
Collation: A
Cardinality: 2
...
InnoDB: Flow
25
• Takes data from the engine
• Class ha statistics• sql/handler.h
• Does not have Cardinality field at all• Uses formula to calculate Cardinality
Optimizer: Overview
26
• Takes data from the engine• Class ha statistics• sql/handler.h
• Does not have Cardinality field at all• Uses formula to calculate Cardinality
Optimizer: Overview
26
• Takes data from the engine• Class ha statistics• sql/handler.h
• Does not have Cardinality field at all
• Uses formula to calculate Cardinality
Optimizer: Overview
26
• Takes data from the engine• Class ha statistics• sql/handler.h
• Does not have Cardinality field at all• Uses formula to calculate Cardinality
Optimizer: Overview
26
• n rows: number of rows in the table• Naturally up to date• Constantly changing!
• rec per key: number of duplicates per key• Calculated by InnoDB in time of ANALYZE• rec per key = n rows / unique values• Do not change!
• Cardinality = n rows / rec per key
Optimizer: Formula
27
• n rows: number of rows in the table• Naturally up to date• Constantly changing!
• rec per key: number of duplicates per key• Calculated by InnoDB in time of ANALYZE• rec per key = n rows / unique values• Do not change!
• Cardinality = n rows / rec per key
Optimizer: Formula
27
• n rows: number of rows in the table• Naturally up to date• Constantly changing!
• rec per key: number of duplicates per key• Calculated by InnoDB in time of ANALYZE• rec per key = n rows / unique values• Do not change!
• Cardinality = n rows / rec per key
Optimizer: Formula
27
• Engine stores persistent statisticsInnoDB
Storage TablesStatistics As Calculated
Row Count Only in Memory
• Optimizer calculates Cardinality every timewhen accesses engine statistics•Weak user control
Persistent Statistics Are Not Persistent
28
• Engine stores persistent statisticsInnoDB
Storage TablesStatistics As Calculated
Row Count Only in Memory• Optimizer calculates Cardinality every time
when accesses engine statistics
•Weak user control
Persistent Statistics Are Not Persistent
28
• Engine stores persistent statisticsInnoDB
Storage TablesStatistics As Calculated
Row Count Only in Memory• Optimizer calculates Cardinality every time
when accesses engine statistics•Weak user control
Persistent Statistics Are Not Persistent
28
The Use CaseExample
• EXPLAIN without histogramsmysql> explain select goods.* from goods
-> join categories on (categories.id=goods.cat_id)
-> where cat_id in (20,2,18,4,16,6,14,1,12,11,10,9,8,17)
-> and
-> date_added between ’2000-01-01’ and ’2001-01-01’ -- Large range
-> order by goods.cat_id
-> limit 10\G -- We ask for 10 rows only!
Example
30
• EXPLAIN without histograms*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: categories -- Small table first
partitions: NULL
type: index
possible_keys: PRIMARY
key: PRIMARY
key_len: 4
ref: NULL
rows: 20
filtered: 70.00
Extra: Using where; Using index;
Using temporary; Using filesort
Example
30
• EXPLAIN without histograms*************************** 2. row ***************************
id: 1
select_type: SIMPLE
table: goods -- Large table
partitions: NULL
type: ref
possible_keys: cat_id_2
key: cat_id_2
key_len: 5
ref: orig.categories.id
rows: 51827
filtered: 11.11 -- Default value
Extra: Using where
2 rows in set, 1 warning (0.01 sec)
Example
30
• Execution time without histogramsmysql> flush status;
Query OK, 0 rows affected (0.00 sec)
mysql> select goods.* from goods
-> join categories on (categories.id=goods.cat_id)
-> where cat_id in (20,2,18,4,16,6,14,1,12,11,10,9,8,17)
-> and
-> date_added between ’2000-01-01’ and ’2001-01-01’
-> order by goods.cat_id
-> limit 10;
ab9f9bb7bc4f357712ec34f067eda364 -
10 rows in set (56.47 sec)
Example
30
• Engine statistics without histogramsmysql> show status like ’Handler%’;
+----------------------------+--------+
| Variable_name | Value |
+----------------------------+--------+
...
| Handler_read_next | 964718 |
| Handler_read_prev | 0 |
| Handler_read_rnd | 10 |
| Handler_read_rnd_next | 951671 |
...
| Handler_write | 951670 |
+----------------------------+--------+
18 rows in set (0.01 sec)
Example
30
• Now let add the histogrammysql> analyze table goods update histogram on date_added;
+------------+-----------+----------+------------------------------+
| Table | Op | Msg_type | Msg_text |
+------------+-----------+----------+------------------------------+
| orig.goods | histogram | status | Histogram statistics created
for column ’date_added’. |
+------------+-----------+----------+------------------------------+
1 row in set (2.01 sec)
Example
30
• EXPLAIN with the histogrammysql> explain select goods.* from goods
-> join categories
-> on (categories.id=goods.cat_id)
-> where cat_id in (20,2,18,4,16,6,14,1,12,11,10,9,8,17)
-> and
-> date_added between ’2000-01-01’ and ’2001-01-01’
-> order by goods.cat_id
-> limit 10\G
Example
30
• EXPLAIN with the histogram*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: goods -- Large table first
partitions: NULL
type: index
possible_keys: cat_id_2
key: cat_id_2
key_len: 5
ref: NULL
rows: 10 -- Same as we asked
filtered: 98.70 -- True numbers
Extra: Using where
Example
30
• EXPLAIN with the histogram*************************** 2. row ***************************
id: 1
select_type: SIMPLE
table: categories -- Small table
partitions: NULL
type: eq_ref
possible_keys: PRIMARY
key: PRIMARY
key_len: 4
ref: orig.goods.cat_id
rows: 1
filtered: 100.00
Extra: Using index
2 rows in set, 1 warning (0.01 sec)
Example
30
• Execution time with the histogrammysql> flush status;
Query OK, 0 rows affected (0.00 sec)
mysql> select goods.* from goods
-> join categories on (categories.id=goods.cat_id)
-> where cat_id in (20,2,18,4,16,6,14,1,12,11,10,9,8,17)
-> and
-> date_added between ’2000-01-01’ and ’2001-01-01’
-> order by goods.cat_id
-> limit 10;
eeb005fae0dd3441c5c380e1d87fee84 -
10 rows in set (0.00 sec) -- 56/0 times faster!
Example
30
• Engine statistics with the histogrammysql> show status like ’Handler%’;
+----------------------------+-------++----------------------------+-------+
| Variable_name | Value || Variable_name | Value |
+----------------------------+-------++----------------------------+-------+
| Handler_commit | 1 || Handler_read_prev | 0 |
| Handler_delete | 0 || Handler_read_rnd | 0 |
| Handler_discover | 0 || Handler_read_rnd_next | 0 |
| Handler_external_lock | 4 || Handler_rollback | 0 |
| Handler_mrr_init | 0 || Handler_savepoint | 0 |
| Handler_prepare | 0 || Handler_savepoint_rollback | 0 |
| Handler_read_first | 1 || Handler_update | 0 |
| Handler_read_key | 3 || Handler_write | 0 |
| Handler_read_last | 0 |+----------------------------+-------+
| Handler_read_next | 9 |18 rows in set (0.00 sec)
Example
30
Why the Difference?
1 2 3 4 5 6 7 8 9 100
200
400
600
800
Indexes: Number of Items with Same Value
32
1 2 3 4 5 6 7 8 9 100
200
400
600
800
Indexes: Cardinality
33
1 2 3 4 5 6 7 8 9 100
200
400
600
800
Histograms: Number of Values in Each Bucket
34
1 2 3 4 5 6 7 8 9 100
0.2
0.4
0.6
0.8
1
Histograms: Data in the Histogram
35
Even Worse Use Case
Even Worse Use CaseANALYZE TABLE Limitations
• ANALYZE TABLE often• Use large number of STATS SAMPLE PAGES
Solutions in 5.7-
38
• Counts number of pages in the table
• Takes STATS SAMPLE PAGES• Counts number of unique values in
secondary index in these pages• Divides number of pages in the table on
number of sample pages and multipliesresult by number of unique values
How ANALYZE TABLE Works with InnoDB?
39
• Counts number of pages in the table• Takes STATS SAMPLE PAGES
• Counts number of unique values insecondary index in these pages• Divides number of pages in the table on
number of sample pages and multipliesresult by number of unique values
How ANALYZE TABLE Works with InnoDB?
39
• Counts number of pages in the table• Takes STATS SAMPLE PAGES• Counts number of unique values in
secondary index in these pages
• Divides number of pages in the table onnumber of sample pages and multipliesresult by number of unique values
How ANALYZE TABLE Works with InnoDB?
39
• Counts number of pages in the table• Takes STATS SAMPLE PAGES• Counts number of unique values in
secondary index in these pages• Divides number of pages in the table on
number of sample pages and multipliesresult by number of unique values
How ANALYZE TABLE Works with InnoDB?
39
• Number of pages in the table: 20,000• STATS SAMPLE PAGES: 20 (default)• Unique values in the secondary index:• In sample pages: 10• In the table: 11
• Cardinality: 20,000 * 10 / 20 = 10,000
Example
40
• Number of pages in the table: 20,000• STATS SAMPLE PAGES: 20 (default)• Unique values in the secondary index:• In sample pages: 10• In the table: 11
• Cardinality: 20,000 * 10 / 20 = 10,000
Example
40
• Number of pages in the table: 20,000• STATS SAMPLE PAGES: 5,000• Unique values in the secondary index:• In sample pages: 10• In the table: 11
• Cardinality: 20,000 * 10 / 5,000 = 40
Example 2
41
• Time consumingmysql> select count(*) from goods;
+----------+
| count(*) |
+----------+
| 80303000 |
+----------+
1 row in set (35.95 sec)
•With bigger number• 27.13/0.32 = 85 times slower!• Not always a solution
Use Larger STATS SAMPLE PAGES?
42
• Time consuming•With default STATS SAMPLE PAGES
mysql> analyze table goods;
+------------+---------+----------+----------+
| Table | Op | Msg_type | Msg_text |
+------------+---------+----------+----------+
| test.goods | analyze | status | OK |
+------------+---------+----------+----------+
1 row in set (0.32 sec)
•With bigger number• 27.13/0.32 = 85 times slower!• Not always a solution
Use Larger STATS SAMPLE PAGES?
42
• Time consuming•With bigger number
mysql> alter table goods STATS_SAMPLE_PAGES=5000;
Query OK, 0 rows affected (0.04 sec)
Records: 0 Duplicates: 0 Warnings: 0
mysql> analyze table goods;
+------------+---------+----------+----------+
| Table | Op | Msg_type | Msg_text |
+------------+---------+----------+----------+
| test.goods | analyze | status | OK |
+------------+---------+----------+----------+
1 row in set (27.13 sec)
• 27.13/0.32 = 85 times slower!• Not always a solution
Use Larger STATS SAMPLE PAGES?
42
• Time consuming•With bigger number• 27.13/0.32 = 85 times slower!
• Not always a solution
Use Larger STATS SAMPLE PAGES?
42
• Time consuming•With bigger number• 27.13/0.32 = 85 times slower!• Not always a solution
Use Larger STATS SAMPLE PAGES?
42
Even Worse Use CaseExample
• goods characteristicsCREATE TABLE ‘goods_characteristics‘ (
‘id‘ int(11) NOT NULL AUTO_INCREMENT,
‘good_id‘ varchar(30) DEFAULT NULL,
‘size‘ int(11) DEFAULT NULL,
‘manufacturer‘ varchar(30) DEFAULT NULL,
PRIMARY KEY (‘id‘),
KEY ‘good_id‘ (‘good_id‘,‘size‘,‘manufacturer‘),
KEY ‘size‘ (‘size‘,‘manufacturer‘)
) ENGINE=InnoDB AUTO_INCREMENT=196606
DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
Two Similar Tables
44
• goods shopsCREATE TABLE ‘goods_shops‘ (
‘id‘ int(11) NOT NULL AUTO_INCREMENT,
‘good_id‘ varchar(30) DEFAULT NULL,
‘location‘ varchar(30) DEFAULT NULL,
‘delivery_options‘ varchar(30) DEFAULT NULL,
PRIMARY KEY (‘id‘),
KEY ‘good_id‘ (‘good_id‘,‘location‘,‘delivery_options‘),
KEY ‘location‘ (‘location‘,‘delivery_options‘)
) ENGINE=InnoDB AUTO_INCREMENT=131071
DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
Two Similar Tables
44
• Sizemysql> select count(*) from goods_characteristics;
+----------+
| count(*) |
+----------+
| 131072 |
+----------+
1 row in set (0.08 sec)
mysql> select count(*) from goods_shops;
+----------+
| count(*) |
+----------+
| 65536 |
+----------+
1 row in set (0.04 sec)
Two Similar Tables
44
• Data Distribution: goods characteristicsmysql> select count(*) num_rows, good_id, size
-> from goods_characteristics group by good_id, size;
+----------+---------+------+
| num_rows | good_id | size |
+----------+---------+------+
| 65536 | laptop | 7 | | 8189 | laptop | 13 |
| 8187 | laptop | 8 | | 8191 | laptop | 14 |
| 8190 | laptop | 9 | | 8190 | laptop | 15 |
| 8188 | laptop | 10 | | 10 | laptop | 16 |
| 8192 | laptop | 11 | | 10 | laptop | 17 |
| 8189 | laptop | 12 | +----------+---------+------+
Two Similar Tables
44
• Data Distribution: goods characteristicsmysql> select count(*) num_rows, good_id, manufacturer
-> from goods_characteristics group by good_id, manufacturer order by num_rows desc;
+----------+---------+--------------+
| num_rows | good_id | manufacturer |
+----------+---------+--------------+
| 65536 | laptop | Noname | | 8189 | laptop | Toshiba |
| 8191 | laptop | Samsung | | 8189 | laptop | Apple |
| 8191 | laptop | Acer | | 8189 | laptop | Asus |
| 8189 | laptop | Dell | | 10 | laptop | Sony |
| 8189 | laptop | HP | | 10 | laptop | Casper |
| 8189 | laptop | Lenovo | +----------+---------+--------------+
Two Similar Tables
44
• Data Distribution: goods shopsmysql> select count(*) num_rows, good_id, location
-> from goods_shops group by good_id, location order by num_rows desc;
+----------+---------+---------------+
| num_rows | good_id | location |
+----------+---------+---------------+
| 8191 | laptop | New York | | 8189 | laptop | Tokio |
| 8191 | laptop | San Francisco | | 8189 | laptop | Istanbul |
| 8189 | laptop | Paris | | 8189 | laptop | London |
| 8189 | laptop | Berlin | | 10 | laptop | Moscow |
| 8189 | laptop | Brussels | | 10 | laptop | Kiev |
+----------+---------+---------------+
Two Similar Tables
44
• Data Distribution: goods shopsmysql> select count(*) num_rows, good_id, delivery_options
-> from goods_shops group by good_id, delivery_options order by num_rows desc;
+----------+---------+------------------+
| num_rows | good_id | delivery_options |
+----------+---------+------------------+
| 8192 | laptop | DHL | | 8189 | laptop | Gruzovichkof |
| 8191 | laptop | PTT | | 8188 | laptop | Courier |
| 8190 | laptop | Normal Post | | 8187 | laptop | No delivery |
| 8190 | laptop | Tracked | | 10 | laptop | Premium |
| 8189 | laptop | Fedex | | 10 | laptop | Urgent |
+----------+---------+------------------+
Two Similar Tables
44
Histogram statistics are useful primarily for nonindexed columns. Adding anindex to a column for which histogram statistics are applicable might also helpthe optimizer make row estimates. The tradeoffs are:
An index must be updated when table data is modified.A histogram is created or updated only on demand, so it adds no overhead
when table data is modified. On the other hand, the statistics become progres-sively more out of date when table modifications occur, until the next time theyare updated.
MySQL User Reference Manual
Optimizer Statistics aka Histograms
45
mysql> alter table goods_characteristics stats_sample_pages=5000;
Query OK, 0 rows affected (0.02 sec)
Records: 0 Duplicates: 0 Warnings: 0
mysql> alter table goods_shops stats_sample_pages=5000;
Query OK, 0 rows affected (0.05 sec)
Records: 0 Duplicates: 0 Warnings: 0
mysql> analyze table goods_characteristics, goods_shops;
+----------------------------+---------+----------+----------+
| Table | Op | Msg_type | Msg_text |
+----------------------------+---------+----------+----------+
| test.goods_characteristics | analyze | status | OK |
| test.goods_shops | analyze | status | OK |
+----------------------------+---------+----------+----------+
2 rows in set (0.35 sec)
Index Statistics is More than Good
46
• The querymysql> select count(*) from goods_shops join goods_characteristics
-> using (good_id)
-> where size < 12 and
-> manufacturer in (’Lenovo’, ’Dell’, ’Toshiba’, ’Samsung’, ’Acer’)
-> and (location in (’Moscow’, ’Kiev’) or
-> delivery_options in (’Premium’, ’Urgent’));
^C^C -- query aborted
ERROR 1317 (70100): Query execution was interrupted
Performance
47
• Handlersmysql> show status like ’Handler%’;
+----------------------------+-------------+
| Variable_name | Value |
+----------------------------+-------------+
| Handler_commit | 0 |
| Handler_delete | 0 |
| Handler_discover | 0 |
| Handler_external_lock | 4 |
| Handler_mrr_init | 0 |
| Handler_prepare | 0 |
| Handler_read_first | 1 |
| Handler_read_key | 13043 |
| Handler_read_last | 0 |
| Handler_read_next | 854,767,916 |
...
Performance
47
• Table ordermysql> explain select count(*) from goods_shops join goods_characteristics
-> using (good_id) where size < 12 and
-> manufacturer in (’Lenovo’, ’Dell’, ’Toshiba’, ’Samsung’, ’Acer’)
-> and (location in (’Moscow’, ’Kiev’) or
-> delivery_options in (’Premium’, ’Urgent’));
+----+-----------------------+-------+---------+--------+----------+---------------+
| id | table | type | key | rows | filtered | Extra |
+----+-----------------------+-------+---------+--------+----------+---------------+
| 1 | goods_characteristics | index | good_id | 131072 | 25.00 | Using... |
| 1 | goods_shops | ref | good_id | 65536 | 36.00 | Using... |
+----+-----------------------+-------+---------+--------+----------+---------------+
2 rows in set, 1 warning (0.00 sec)
Performance
47
• Table order mattersmysql> explain select count(*) from goods_shops straight_join goods_characteristics
-> using (good_id) where size < 12 and
-> manufacturer in (’Lenovo’, ’Dell’, ’Toshiba’, ’Samsung’, ’Acer’)
-> and (location in (’Moscow’, ’Kiev’) or
-> delivery_options in (’Premium’, ’Urgent’));
+----+-----------------------+-------+---------+--------+----------+---------------+
| id | table | type | key | rows | filtered | Extra |
+----+-----------------------+-------+---------+--------+----------+---------------+
| 1 | goods_shops | index | good_id | 65536 | 36.00 | Using... |
| 1 | goods_characteristics | ref | good_id | 131072 | 25.00 | Using... |
+----+-----------------------+-------+---------+--------+----------+---------------+
2 rows in set, 1 warning (0.00 sec)
Performance
47
• Table order mattersmysql> select count(*) from goods_shops straight_join goods_characteristics
-> using (good_id)
-> where size < 12 and
-> manufacturer in (’Lenovo’, ’Dell’, ’Toshiba’, ’Samsung’, ’Acer’)
-> and (location in (’Moscow’, ’Kiev’) or
-> delivery_options in (’Premium’, ’Urgent’));
+----------+
| count(*) |
+----------+
| 816640 |
+----------+
1 row in set (2.11 sec)
Performance
47
• Table order mattersmysql> show status like ’Handler_read_next’;
+-------------------+-----------+
| Variable_name | Value |
+-------------------+-----------+
| Handler_read_next | 5,308,416 |
+-------------------+-----------+
1 row in set (0.00 sec)
Performance
47
• Not for all datamysql> select count(*) from goods_shops straight_join goods_characteristics
-> using (good_id)
-> where (size > 15 or manufacturer in (’Sony’, ’Casper’))
-> and location in
-> (’New York’, ’San Francisco’, ’Paris’, ’Berlin’, ’Brussels’, ’London’)
-> and delivery_options in
-> (’DHL’,’Normal Post’, ’Tracked’, ’Fedex’, ’No delivery’);
^C^C -- query aborted
ERROR 1317 (70100): Query execution was interrupted
Performance
47
• Not for all datamysql> show status like ’Handler%’;
+----------------------------+------------+
| Variable_name | Value |
+----------------------------+------------+
| Handler_commit | 10 |
| Handler_delete | 0 |
| Handler_discover | 0 |
| Handler_external_lock | 28 |
| Handler_mrr_init | 0 |
| Handler_prepare | 0 |
| Handler_read_first | 1 |
| Handler_read_key | 143 |
| Handler_read_last | 0 |
| Handler_read_next | 16,950,265 |
Performance
47
mysql> analyze table goods_shops update histogram
-> on location, delivery_options;
+-------------+-----------+----------+--------------------------------+
| Table | Op | Msg_type | Msg_text |
+-------------+-----------+----------+--------------------------------+
| goods_shops | histogram | status | Histogram statistics created
for column ’delivery_options’. |
| goods_shops | histogram | status | Histogram statistics created
for column ’location’. |
+-------------+-----------+----------+--------------------------------+
2 rows in set (0.18 sec)
Histograms to The Rescue
48
mysql> analyze table goods_characteristics update histogram
-> on size, manufacturer ;
+-----------------------+-----------+----------+------------------------------+
| Table | Op | Msg_type | Msg_text |
+-----------------------+-----------+----------+------------------------------+
| goods_characteristics | histogram | status | Histogram statistics created
for column ’manufacturer’. |
| goods_characteristics | histogram | status | Histogram statistics created
for column ’size’. |
+-----------------------+-----------+----------+------------------------------+
2 rows in set (0.23 sec)
Histograms to The Rescue
48
• The querymysql> select count(*) from goods_shops join goods_characteristics
-> using (good_id)
-> where size < 12 and
-> manufacturer in (’Lenovo’, ’Dell’, ’Toshiba’, ’Samsung’, ’Acer’)
-> and (location in (’Moscow’, ’Kiev’) or
-> delivery_options in (’Premium’, ’Urgent’));
+----------+
| count(*) |
+----------+
| 816640 |
+----------+
1 row in set (2.16 sec)
Histograms to The Rescue
48
• The querymysql> show status like ’Handler_read_next’;
+-------------------+-----------+
| Variable_name | Value |
+-------------------+-----------+
| Handler_read_next | 5,308,418 |
+-------------------+-----------+
1 row in set (0.00 sec)
Histograms to The Rescue
48
• Filtering effectmysql> explain select count(*) from goods_shops join goods_characteristics
-> using (good_id)
-> where size < 12 and
-> manufacturer in (’Lenovo’, ’Dell’, ’Toshiba’, ’Samsung’, ’Acer’)
-> and (location in (’Moscow’, ’Kiev’) or
-> delivery_options in (’Premium’, ’Urgent’));
+----+-----------------------+-------+---------+--------+----------+----------+
| id | table | type | key | rows | filtered | Extra |
+----+-----------------------+-------+---------+--------+----------+----------+
| 1 | goods_shops | index | good_id | 65536 | 0.06 | Using... |
| 1 | goods_characteristics | ref | good_id | 131072 | 15.63 | Using... |
+----+-----------------------+-------+---------+--------+----------+----------+
2 rows in set, 1 warning (0.00 sec)
Histograms to The Rescue
48
How Histograms Work?
↓ sql/sql planner.cc
↓ calculate condition filter↓ Item func *::get filtering effect• get histogram selectivity• Seen as a percent of filtered rows inEXPLAIN
Low Level
50
↓ sql/sql planner.cc↓ calculate condition filter
↓ Item func *::get filtering effect• get histogram selectivity• Seen as a percent of filtered rows inEXPLAIN
Low Level
50
↓ sql/sql planner.cc↓ calculate condition filter↓ Item func *::get filtering effect
• get histogram selectivity• Seen as a percent of filtered rows inEXPLAIN
Low Level
50
↓ sql/sql planner.cc↓ calculate condition filter↓ Item func *::get filtering effect• get histogram selectivity
• Seen as a percent of filtered rows inEXPLAIN
Low Level
50
↓ sql/sql planner.cc↓ calculate condition filter↓ Item func *::get filtering effect• get histogram selectivity• Seen as a percent of filtered rows inEXPLAIN
Low Level
50
• Example datamysql> create table example(f1 int) engine=innodb;
mysql> insert into example values(1),(1),(1),(2),(3);
mysql> select f1, count(f1) from example group by f1;
+------+-----------+
| f1 | count(f1) |
+------+-----------+
| 1 | 3 |
| 2 | 1 |
| 3 | 1 |
+------+-----------+
3 rows in set (0.00 sec)
•With the histogram
Filtered Rows
51
•Without a histogrammysql> explain select * from example where f1 > 0\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: example
partitions: NULL
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 5
filtered: 33.33
Extra: Using where
1 row in set, 1 warning (0.00 sec)
•With the histogram
Filtered Rows
51
•Without a histogrammysql> explain select * from example where f1 > 1\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: example
partitions: NULL
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 5
filtered: 33.33
Extra: Using where
1 row in set, 1 warning (0.00 sec)
•With the histogram
Filtered Rows
51
•Without a histogrammysql> explain select * from example where f1 > 2\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: example
partitions: NULL
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 5
filtered: 33.33
Extra: Using where
1 row in set, 1 warning (0.00 sec)
•With the histogram
Filtered Rows
51
•Without a histogrammysql> explain select * from example where f1 > 3\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: example
partitions: NULL
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 5
filtered: 33.33
Extra: Using where
1 row in set, 1 warning (0.00 sec)
•With the histogram
Filtered Rows
51
•With the histogrammysql> analyze table example update histogram on f1 with 3 buckets;
+-----------------+-----------+----------+------------------------------+
| Table | Op | Msg_type | Msg_text |
+-----------------+-----------+----------+------------------------------+
| hist_ex.example | histogram | status | Histogram statistics created
for column ’f1’. |
+-----------------+-----------+----------+------------------------------+
1 row in set (0.03 sec)
Filtered Rows
51
•With the histogrammysql> select * from information_schema.column_statistics
-> where table_name=’example’\G
*************************** 1. row ***************************
SCHEMA_NAME: hist_ex
TABLE_NAME: example
COLUMN_NAME: f1
HISTOGRAM:
"buckets": [[1, 0.6], [2, 0.8], [3, 1.0]],
"data-type": "int", "null-values": 0.0, "collation-id": 8,
"last-updated": "2018-11-07 09:07:19.791470",
"sampling-rate": 1.0, "histogram-type": "singleton",
"number-of-buckets-specified": 3
1 row in set (0.00 sec)
Filtered Rows
51
•With the histogrammysql> explain select * from example where f1 > 0\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: example
partitions: NULL
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 5
filtered: 100.00 -- all rows
Extra: Using where
1 row in set, 1 warning (0.00 sec)
Filtered Rows
51
•With the histogrammysql> explain select * from example where f1 > 1\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: example
partitions: NULL
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 5
filtered: 40.00 -- 2 rows
Extra: Using where
1 row in set, 1 warning (0.00 sec)
Filtered Rows
51
•With the histogrammysql> explain select * from example where f1 > 2\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: example
partitions: NULL
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 5
filtered: 20.00 -- one row
Extra: Using where
1 row in set, 1 warning (0.00 sec)
Filtered Rows
51
•With the histogrammysql> explain select * from example where f1 > 3\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: example
partitions: NULL
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 5
filtered: 20.00 - one row
Extra: Using where
1 row in set, 1 warning (0.00 sec)
Filtered Rows
51
1 2 30
0.5
1
1.5
2
Indexes: Cardinality
52
1 2 30
0.2
0.4
0.6
0.8
1
Histograms
53
Left Overs
Histograms Indexes
Maintained by Optimizer Storage Engine
Updated On Demand On every DML ∗Storage Light Heavy
Optimizer Uses Real Numbers ∗∗ Cardinality
∗ Unless persistent statistics used∗∗ For up to 1024 buckets
Histograms vs Indexes
55
• CREATE INDEX• Metadata lock• Can be blocked by any query
• UPDATE HISTOGRAM• Backup lock• Can be locked only by a backup• Can be created any time without fear
Maintenance: Locking
56
• CREATE INDEX• Metadata lock• Can be blocked by any query
• UPDATE HISTOGRAM• Backup lock• Can be locked only by a backup• Can be created any time without fear
Maintenance: Locking
56
• CREATE INDEX• Locks writes• Locks reads ∗
PS-2503
Before Percona Server 5.6.38-83.0/5.7.20-18Upstream
• Every DML updates the index
• UPDATE HISTOGRAM• Uses up tohistogram generation max mem size• Persistent after creation• DML do not touch it
Maintenance: Load
57
• CREATE INDEX• Locks writes• Locks reads ∗• Every DML updates the index
• UPDATE HISTOGRAM• Uses up tohistogram generation max mem size• Persistent after creation• DML do not touch it
Maintenance: Load
57
• Helps if query plan can be changed• Not a replacement for the index:• GROUP BY• ORDER BY• Query on a single table ∗
Only if filtering effect can change the plan
Histograms
58
• Data distribution is uniform• Range optimization can be used• Full table scan is fast
When Histogram are Not Helpful?
59
• Index statistics collected by the engine• Optimizer calculates Cardinality each time
when it accesses statistics• Indexes don’t always improve performance• Histograms can help
� Still new feature• Histograms do not replace other
optimizations!
Conclusion
60
MySQL User Reference ManualBlog by Erik FrosethBlog by Frederic DescampsTalk by Oystein Grovlen @FosdemTalk by Sergei Petrunia @PerconaLiveWL #8707
More information
61
www.slideshare.net/SvetaSmirnova
twitter.com/svetsmirnova
github.com/svetasmirnova
Thank you!
62
Rate My Session!
63
Percona’s open source database experts aretrue superheroes, improving databaseperformance for customers across the globe.
Percona’s open source database experts aretrue superheroes, improving databaseperformance for customers across the globe.
Discover what it means to have a Perconacareer with the smartest people in thedatabase performance industries, solving themost challenging problems our customerscome across.
We’re Hiring!
64
Thank You