a12 vercelletto indexing_techniques

:: IBM Informix indexing techniques: which one to use when ?

Eric Vercelletto Session A12

Begooden IT Consulting 4/23/2013 3:35 PM

• Introduction to Response Time measuring

• Identify the relevant indexing techniques

• Describe implementation method

• Confirm/recognize its use by accurate monitoring

• Measure its efficency as response time and effective use in the database (sqltrace,sqexplain)

• Identify pros and cons

Agenda / methodology

4/24/2013 Session F12 2

Introduction

• Begooden IT Consulting is an IBM ISV company, mainly focused on Informix technology services.

• Our 15+ years experience within Informix Software France and Portugal helped us to acquire in depth product knowledge as well as solid field experience.

• Our services include Informix implementation auditing, performance tuning, issue management, administration mentoring …

• We also happen to be the Querix reseller for France and French speaking countries (except Québec and Louisiana)

• The company is based in Pont l’Abbé, Finistère, France

4/24/2013 3

Some basics not to forget about There are 2 ways to measure response times • The « cold » measure: the response time is measured just after

starting the engine, when data and index pages are not yet loaded into Shared Memory IFMX buffers. Disk IO must be performed to read the data and index pages, which will increase the RT.

• The « hot » measure: RT is measured when data and index pages are loaded into SHMEM. No or few disk IO => RT is much shorter.

• This point can often explain surprising RT differences according to how the data accessed.

• Broad range or DS queries most often access data and/or indexes in disk pages

• OLTP queries mostly access data and indexes in SHMEM pages

4

Derivated thoughts and facts

• Reading data pages and/or index pages on disk always take longer than in SHMEM. Full table scans can take minutes or more, according to table size

• Reading data pages in SHMEM is very fast. Full scan of a table in SHMEM take fractions of seconds or seconds, rarely more.

• Reading index pages in SHMEM is also very fast. Added to this, due to the B TREE structure, reading index pages generally handles more contents than reading data pages.

• This often makes difficult the comparison of the efficiency of 2 different indexes on the same table, when reading in SHMEM.

5

Derivated thoughts and facts (continued)

• When running hot measures on indexes, the differences can be as low as milliseconds BUT …

• Repeating millions of times 3 unuseful milliseconds can make a difference!

• When the Response Times get to such a low level, sqltrace is the tool you need to understand the query behaviour.

• In certain situations, saving milliseconds on a query will make the difference. In other situations, saving seconds will not make the difference.

• A bad response time can be caused by an unappropriate indexation, but can also be caused by some « unusual » logic adding unuseful efforts to be performed by the applications and the server.

6

Comparing cold measure with hot measure (1)

• full scan of a mid-sized table tpcc:order_line, containing 24 millions of rows select * from order_line

onstat -g his output

« Cold » read: performed just after oninit -v « Hot read: performed after the first scan

Many disk pages read zero disk pages read 47.4 secs 19,4 secs secs

All buffer reads

7

Comparing cold measure with hot measure (2)

• Cold use of a poor selectivity index select * from order_line where ol_w_id = 10 ( duplicate index on w_id, 50 distinct values)

Cold read Hot read

Few disk reads Many disk reads

Execution time: 5,9 secs Execution time: 1.1 secs

8

BATCHEDREAD_INDEX: description

• This feature has been taken from XPS and introduced in 11.50xC5.

• The purpose is to maximize the index keys access by grouping the reading of many index keys into large buffers, then fetching the rows associated with those keys

• This technique brings strong savings in terms of CPU and IO, therefore reducing Response Time.

• This technique is suitable and efficient for massive index reads (DS/OLAP), not for pinpoint-type (OLTP) index access.

9

BATCHEDREAD_INDEX: the test

• We will run the following query against a 30 millions rows clients table. The table has an index on ‘lastname’. Row size is 328 bytes

output to /dev/null

select lastname,count(*)

from clients

group by 1

• This query returns 2,188,286 rows

10

BATCHEDREAD_INDEX: facts

• All those response times are measured as « cold » AUTO_READAHEAD 0 BATCHEDREAD_INDEX 0

• AUTO_READAHEAD 0

BATCHEDREAD_INDEX 1

• AUTO_READAHEAD 1 BATCHEDREAD_INDEX 1

See the difference

11

BATCHEDREAD_INDEX: how ?

• BATCHEDREAD_INDEX can be set, as well as BATCHEDREAD_TABLE, either in the onconfig file

• Or used as an environment variable before launching the application export IFX_BATCHEDREAD_INDEX=1

• Or as an SQL statement SET ENVIRONMENT IFX_BATCHEDREAD_INDEX '1';

• Monitor index scan activity with onstat –g scn

•

12

Attached or Detached Index?

• The « Antique Informix Disk Layout » used to create the index pages in the same extents as the data pages for the attached indexes. The expected result was reducing disk IO.

• This layout happened to become a problem because the data pages were often located far from the index pages, causing the opposite effect of increasing disk IO. The official recommandation was at this time to create detached indexes for this reason.

• Nowadays, index pages are created in a different partition than the data pages, causing the attached indexes to have the same level of performance as the detached indexes.

• But.. If you have the possibility to create the data dbspaces and the index dbspaces on independant disks and channels , you will increase your disk IO performance by reducing disk contention.

• This gain will be observed mainly during intensive sessions doing massive data changes.

• Watch out the output of onstat –g iof and look for low IO thruput per second.

13

Few columns or many columns in the same index? Key points to consider

• Remember about « cold » reads and « hot » reads when testing the efficiency of an index. Results can be dramatically different between cold and hot.

• The choice is as often a hard to obtain trade-off, and definately a long subject to discuss!

• Many columns in a index can make it more selective, but it also will consume more CPU/disk resource when updating keys (see b-tree cleaner tuning)

• Few columns in an index can make it less selective, but it will consume less CPU/disk resource when updating keys

• Integrity constraints are not negotiable, but some integrity constraints indexes can be negotiated… 14

Few columns or many columns? Techniques to evaluate efficiency

• time dbaccess dbname queryfile gives an indication on the efficiency of an index, but can be misleading due to cold and hot measure huge differences.

• onmode –Y sessnum 1 will identify which index(es) are used, also will inform on how many rows have been scanned against how many rows have been returned

• onstat –g his (sqltrace) will give fine detail about response time, buffer and disk access, lock waits etc…

• A complete diagnostic will be done with the 3 tools.

15

Few columns or Many columns? Let’s analyze a real case: one column

16

Rows scanned: 4913

Response time: 0.0368’’

1 column index

buffer reads: 5900

Few columns or many columns? Same case, index with 2 columns

17

Rows scanned: 106


2 columns index

Buffer reads: 122

Highly duplicated lead columns indexes: how was life before?

• The Antique Informix Rule stated to avoid multi-columns indexes with low selectivity for the leading keys, due to poor efficiency. Ex: warehouse_id,district_id,order_id,order_line

• Querying on order_line required to specify the lead columns in the query predicate, or create another index with order_line as lead column

• Restructuring indexes following those rules was a complex, long and risky task, not to mention the fact that any downtime due to index rebuilding was poorly accepted by Operations Managers…

18

Index key first & self join : it’s magic!

• The key-first scan was introduced in 7.3. It has been enhanced so that an index can be used even the lead columns are not specified in the where clause

• The index self join technique has been introduced in IDS 11.10, although many DBA’s didn’t even notice it!

• By scanning subsets of the poorly selective composite index, the engine manages to use the non-subsequent index keys as index filters, transforming the index into a highly selective index.

• Hierarchical-like indexes with highly duplicated lead columns now need no redefinition to be efficient.

• You need not building new indexes with highly selective lead columns. This saves optimizer work and disk space.

• Index self join is enabled by default. You can, if you persist in not using it, disable it either by setting INDEX_SELFJOIN 0 in onconfig or with an optimizer directive {+AVOID_INDEX_SJ}

19

Index self-join: the test

• We will use the order_line TPC-C table, that contains 23,735,211 rows

• The index follows the hierarchy, which was formerly considered as a poor implementation: ol_w_id: warehouse id (50 distinct values) ol_d_id: district id (10 distinct values) ol_o_id: order number ( 9279 distinct values) ol_number: order line number (14 distinct values)

• The challenging query is SELECT ol_d_id,ol_o_id,avg(ol_quantity),avg(ol_amount) FROM order_line GROUP BY 1,2 ORDER BY 2,3

20

No Self join

• Use onmode -wm INDEX_SELFJOIN=0 to disable self join

21

Index is taken, but only key first

Many rows scanned


Self join: find the differences!

22

Key-first + self join access

Rows scanned: =~ 100 times less

RT: 3.313’’

The Antique Informix Rule says: “you will use only one index per table”

The AIR says: “you will use only one index per table” • The Antique Informix Rule stated that only one

index per table could be used • The optimizer had to choose only one index

among several indexes for the same table, although several indexes were needed.

• Many not so unrealistic query cases had to be drastically re-written in order to provide acceptable response times

• The trick was generally to use an UNION or a nested query, but the query code readability and maintenability suffered from that.

24

What A.I.R. obliged you to do

• Generally, the best way to workaround the RT issue was to use either UNION or nested queries

• The trick could be efficient in terms of Response Time, but the code got more complex to read and to maintain

• This workaround needed to strongly modify the application code, and needed detailed and accurate tests to obtain the same results as with the initial query

25

The optimizer constantly getting smarter across releases

• An optimizer enhancement introduced the use of several indexes on the same table, but only if the where clauses were linked with the ‘OR’ operator.

• The query path is like a usual INDEX PATH, the difference being the use of several indexes

26

Measure with INDEX PATH

Use of 3 indexes!

Simple INDEX PATH

Scanned rows: 376,000

RT: 2.489’’

27

Disk reads:: 34136

Multi index: different path 33% gain in RT

Multi-index /skip scan enabled

Response Time is shorter

3 indexes used

Disk reads: 1984

28

Multiple indexes: what should be done?

• Generally, the optimizer decides correctly which is the best path • You can compare the results with the use of UNION, then decide

between keeping hard to maintain code or not • You can nonetheless use optimizer directives to force the access

method, like {+ AVOID_MULTI_INDEX (clients)} To force INDEX PATH

• Or {+ MULTI_INDEX (clients)}

TO force multi index SKIP SCAN path • Can get tricky to make a self choice if AND and OR conditions are

set on the involved indexes • The difference is almost not visible in case of hot measure • Statistics on indexes are very important, the access method can

change according to them!

29

Star join

• Star join is an extension of the MULTI INDEX concept

• It combines this technique with DYNAMIC HASH JOINS

• The technique has been ported from XPS to IDS 11.70

• It is used exclusively for DS/OLAP queries where a FACT table is the center point of many dimension tables

• Requires PDQPRIORITY ( Ultimate Edition or Enterprise Edition )

• If you consider using Star Join, you are an excellent candidate to see a demo of Informix Warehouse Accelerator!

30

The A.I.R says: « you will avoid indexes with too many tree levels »

• Ok, but what could I do to solve that ? My indexes are built with the data they have inside, and nothing or almost nothing can be done

• Databases and tables are getting bigger and bigger, and splitting/archiving part of the data is not always an acceptable solution

31

FOREST OF TREES INDEXES

• The forest of trees index type has been introduced in 11.70 xC1

• It replicates the model of a traditionnal B-TREE, having several root nodes instead of only one root node

• The forest of trees brings benefits when contention against the root node is observed

32

Reducing b-tree levels number on index « lastname,firstname »

• create index "informix".id_clients_02 on "informix".clients (lastname,

firstname) using btree

=> The initial number of b-trees levels is 6 • create index "informix".id_clients_02 on "informix".clients (lastname,

firstname) using btree hash on (lastname) with 10 buckets

=> The number of b-trees levels decreased to 5







33

Tpcc with regular b-tree indexes • Index iu_stock_01 has 4 levels

Tpcc result is 14093 tpmC

High contention on iu_stock_01: 8,704,052 spins in 4 mn

34

Tpcc with FOT on iu_stock_01 • create unique index iu_stock_01 on stock (s_w_id,s_i_id)

using btree in data03 HASH on (s_w_id) with 50 buckets;

• Index iu_stock_01 has now 3 levels

Result grew to 16413 tpmC

Contention on iu_stock_01 decreased from 8,704,000 to 149,600 spins in 4mn

iu_oorder_01 is now a good candidate for FOT!

35

Main facts on FOT indexes

• FOT is very efficient on reducing concurrency on indexes access => Better RT in OLTP context

• FOT is very efficient to reduce levels of B-TREE => Better overall RT

• Ideal for primary keys and foreign keys in an high concurrency OLTP context

• Implementation is easy and fast • Supports main index functionality: ER, PK, FK, b-tree

cleaning… • Does not support aggregate queries, range scans on HASH

ON columns • Also does not support index clustering, index fillfactor and

functional(UDR based) indexes

36

Optimizing big index creation: PSORT_NPROCS

• The PSORT_NPROCS env variable is used to allocate more threads to the sort package, which is also used for parallel index creation.

• Significant performance improvements on index creation can be obtained on multi-core/multi-processor servers

• It can be used even with non PDQPRIORITY-enabled editions if the server has more than one core/CPU.

• PSORT_NPROCS can unleash the memory consumption: please check for available memory on the server.

• The onconfig parameter DS_NONPDQ_QUERY_MEM has to be checked if using PSORT_NPROCS.

37

Optimizing big index creation DBSPACETEMP or PSORT_DBTEMP

• The env variables DBSPACETEMP overrides the same onconfig parameter.

• Generally raw-device based temp dbspaces offer more performance than file system based files.

• PSORT_DBTEMP write temporary sort files in the specified file-system based directories instead of DBSPACETEMP.

• It is useful to spread the temporary sort files to a wider list of directories mounted on different spindles

38

PSORT_NPROCS/PSORT_DBTEMP: facts

• create index id_clients_02 on clients(lastname,firstname)

• unset PSORT_NPROCS

unset PSORT_DBTEMP

=> 13m28.709s • export PSORT_NPROCS=3

export PSORT_DBTEMP=

/tmp:/ids_chunks/ids_space01:/ids_chunks/ids_space02:/id

s_chunks/ids_space03 => 6m19

• A ram disk, or even a SSD drive can improve performance a lot: export PSORT_NPROCS=3

export PSORT_DBTEMP=/mnt/myramdisk => 4m22.030s

• To check the environment of the session: onstat –g env SessionNumber

39

Index disable: What happens?

• Disabling an existing index will prevent the server from using this index, but it will « remember » the index schema.

• This technique can be applied before executing massive data insert or update, since it will alleviate the index keys update workload.

• Heavy side effects can be expected: loss of key unicity, loss of performance…

• If you run a query on a disabled index, the optimizer will probably choose a sequential scan unless a better path is found.

• The index will be seen as ‘disabled’ in dbschema, but will not be seen in oncheck –pT no oncheck –pe

• Disabling an index will make its former disk space available in the dbspace

• Disabling an index is immediate • Syntax is: set indexes IndexName disabled

40

Index enable: what happens?

• Enabling an index will rebuild the index physically, with the same definition as before

• Enabling an index takes as much time as creating the same index

• But the enable statement is simpler to type than the create index statement

• + you do not have to remember the initial create index statement

• Syntax is: set indexes IndexName enabled

41

Digging for more performance: Disable foreign key indexes

• Many times, foreign key indexes are a part of the same table’s primary key.

• order_line primary key (ol_w_id,ol_d_id,ol_o_id,ol_number) order_line foreign key (ol_w_id,ol_d_id,ol_o_id)

• Using ‘disable index’ in the add constraint statement will save the creation of an ‘unuseful’ index, because its structure is already existing in the primary key.

• ALTER TABLE order_line ADD CONSTRAINT(FOREIGN KEY (ol_w_id,ol_d_id,ol_o_id)

REFERENCES oorder(o_w_id,o_d_id,o_id) CONSTRAINT ol2 INDEX DISABLED);

• This implementation will save disk space by dropping an index • CPU resource will be saved when updating/deleting/creating index keys, • and consequently disk IO will also be saved. • Check that disabling the constraint index has no hidden side effects, an

mistake can have expensive consequences!

42

I need to create a new index, but users are always connected to the table!

• Sometimes a new index needs to be created, but the tables are accessed by users or batches.

• IDS 11.10 introduced the possibility to create an index without putting an exclusive lock on the table, called index online.

• Users can SELECT, INSERT, UPDATE or DELETE rows in the table while the index is being created

• Syntax is: create index id_clients_01 on clients(lastname,firstname)ONLINE

• Drop index online is also available in the same conditions

43

Create index online: precautions & restrictions

• The create index online is a complex operation, involving table snapshot, base index build catch up and more.

• It will request additional resources, such as disk space, CPU and memory in order to make the operation safe and as fast as possible.

• Long transactions may happen: check logical logs size before diving

• The index pre-image pool memory size is managed with the onconfig parameter ONLIDX_MAXMEM, updatable with onmode –wm

• No appliable for cluster index, UDT columns, no UDR indexes

• Only one create index online per table at the same time

44

Index compression

• IDS introduced table compression in 11.50 xC4. This technology is now used successfully in large databases implementations.

• Index compression is a new feature of IDS 12.10. It is based on the same technology as table compression.

• The principle is to compress the key columns values at b-tree leaf level, but not the rowids attached to these key values

• Index compression is very effective for indexes having large key values: names, item names etc…

• The compression dictionary must contain at least 2000 unique key values

• Index compression is an excellent way to save disk space, and … • Since more key values fit in an index page, more key values can be read

in one IO cycle => IO is more efficient • Reducing IO must enhance index access performance in large queries

45

Index compression: Disk space gained

• Execute function task ("index compress", "id_clients_01", "staging");

• Or

execute function task(“index compress”, “j”,“testdb”);

• Or

create index id_clients_01 on clients(lastname,firstname) compressed

More than 50% compression rate

46

Cluster index

• The creation or alter of a cluster index will physically sort the table data by the first column of this index at creation time

• Accessing a table data with a cluster index will read already sorted data pages.

• Generally makes IO on data pages easier because they are contiguous => Decrease RT

• The cluster level will decrease as long as new rows are insert

• High cost of administration: re-clustering this index will rewrite the table data pages

• Cluster index can be good for stable tables accessed in a ordered sequential way

47

Statistics on indexes

• Introduced in 11.70: when one creates an index, the distributions for this index are automatically created

• High mode statistics are generated for the lead column

• Index levels statistics are also generated in low mode

• This will not stop you from regularly updating statistics for those indexes, but it is no more required to do it just after the index creation

Questions?

Indexing techniques: which one to use when Eric Vercelletto Begooden IT Consulting [email protected]

mailto:[email protected]



a12 vercelletto indexing_techniques

Technology

data pages andor index

duplicate index

index pagesare

detached index

index keys accessby

efficient formassive

table size reading data

index keys intolarge