tuning - rensselaer polytechnic institute

33
Tuning CSCI 4380 - Database Systems

Upload: others

Post on 21-Feb-2022

5 views

Category:

Documents


0 download

TRANSCRIPT

TuningCSCI 4380 - Database Systems

2

Database Tuning• Changing an application and DBMS environment to

improve system performance

• A workload consists of various operations performed by the system and their frequency

• Performance is usually measured in terms of response time

• Important to influence the system bottlenecks. What are the most time consuming operations?

Database Tuning• Step 1: buy more hardware

• memory is crucial for buffering query operations and caches for various operations

• hard disk speed is crucial, buy faster and more disks to improve the parallelism

• Step 2: tune the system installation

• databases provide a large number of tunable parameters, read database administration books

4

Disk caches• A cache is a set of buffer pages maintained by the DBMS for a specific

purpose

• Data cache for reading pages containing the index or the relation

• Procedure cache for storing previously constructed query plans

• Caches are usually shared between concurrent users

• Any requested item must be brought to cache from disk to read/modified

• If it is already in the cache, then the cache has a hit, otherwise the cache has a miss

• Since each hit is a savings in time, hit ratio must be maximized (some application designers seek 90% hit ratio)

5

Disk caches• If a new item has to be inserted into the cache, another item might need to

be removed.

• Cache replacement algorithm decides what should be removed, LRU (least recently used), MRU (most recently used)

• A recently used page may be used for an update in near future (LRU)

• A page read in table scan is no longer needed (MRU)

• Sophisticated caches may take the algorithm using the database into account

• How would you use the cache for an index page?

• A dirty page is a page modified by an uncommitted transaction -if this page is moved out of cache, it must be written back to disk

6

Tuning the cache• Divide the cache and bind a specific item to a cache (different tables may be

cached in different caches)

• Divide the cache into pools of varied size, 2K, 4K, 8K, etc.

• The query processor can choose the best available cache for a query (retrieve large sequences for table scans, even prefetch disk pages that are expected to be requested next)

• Procedure cache may use previously optimized query plans

• Hint: use program variables to increase possible reuse

SELECT P.name FROM Professor P WHERE P.deptId = :deptid

7

Partitioning• Step 3: partition your data

• Vertical partitioning divides the attributes in the relation and distributes them to different disks or tablespaces

• Frequently queried attributes could be separated from infrequently queried attributes.

• Horizontal partitioning divides the tuples in the relation to multiple disks

• Allows parallelism in reading data from disk

• Some optimizers are able to concentrate on a single partition given a specific query

8

Denormalization• Step 4: change your data model

• Normalization reduces redundancy and null values

• Lower storage requirements, simple queries and updates will be faster

• Results in more tables, hence complex queries need more joins

SELECT FAN.Title

FROM Films F, FilmAlternateNames FAN

WHERE F.filmid = FAN.filmid

• Denormalization stores relations in a non-optimal manner

• Store alternate names in a string and use application code to update and print the alternate names

9

Denormalization• Add extra columns for frequently accessed information

• Number of movies per actor:

SELECT A.stagename, COUNT(DISTINCT C.filmid)

FROM Actors A, Casts C

WHERE A.actorid=C.actorid

• Add a column “NumMovies” instead.

• This column must be updated in the application anytime an update is made to the casts relation

10

Denormalization• Certain attributes might be duplicated if they are used often

• Store stagename attribute in the casts relation

• Queries involving this attribute are now fully answered from casts (avoiding a costly join)

• Attributes other than stagename might be queried rarely but take a lot of space causing the stagename to take up a lot of space

• Anytime a new actor is added or stagename is changed, the changes must be reflected to the casts relation by updating multiple tuples (this may be rare compared to the queries)

• The CASTS relation now stores redundant information and is larger in size

11

Query Restructuring• Nested queries are hard to optimize.

• Inner and outer expressions are optimized separately.

• For correlated expressions, inner query is executed many times for each tuple in the outer expression.

• Certain possible optimizations could be missed with a nested query (suppose an index for casts on (actorid, filmid) existed)

SELECT DISTINCT F.title FROM Films F, Casts C

WHERE F.filmid=C.filmid AND

EXISTS (SELECT * FROM Actors A

WHERE A.stagename like ‘%Bacon’ AND A.actorid=C.actorid)

12

Restructure queriesStep 5: avoid nested queries, use joins whenever possible

All queries below are equivalent:

SELECT C.filmid FROM Casts C

WHERE EXISTS (SELECT * FROM Actors A

WHERE A.stagename=‘Kevin Bacon’ AND A.actorid=C.actorid)

SELECT C.filmid FROM Casts C

WHERE C.Actorid IN (SELECT A.actorid FROM Actors A WHERE A.stagename=‘Kevin Bacon’)

SELECT C.filmid FROM Casts C, Actors A

WHERE C.Actorid = A.actorid AND A.stagename=‘Kevin Bacon’

13

Drops of wisdom• Avoid sorts (distinct, order by, group by, union, except), they are costly

• Some queries do not need a distinct or can be rewritten to avoid sorts

• Avoid full table scans

• A search on a condition like A <> 3 or A like ‘%Bacon’ might result in a table scan

• A search like A in (1,2,4) might be preferable (depending on the availability of statistics)

• Avoid retrieving tuples into application code, use stored procedures and even complicated queries might be preferable to the added communication cost

14

Use views wisely• Even though views are useful in application development, use a view in an

application that is useful for the given query

CREATE VIEW together(actorid1, stagename1, actorid2, stagename2) AS

SELECT A1.actorid, A1.stagename, A2.actorid, A2.stagename

FROM Actors A1, Actors A2, Casts C1, Casts C2

WHERE A1.actorid=C1.actorid AND C1.filmid=C2.filmid AND

A2.actorid=C2.actorid AND A1.actorid<>A2.actorid

SELECT t.actorid1 FROM together t where t.stagename = ‘Kevin Bacon’

• None of the joins are necessary to answer this query. The optimizer might miss some faster query plans

15

The use of indices• Indices speed up query, but slow down insert/delete/

update operations

• A clustered index allows fast access to a range query

• There is only one clustered index per relation

• Databases usually create one for the primary key by default

• Reconstruction of clustered indices is costly

16

The use of indices• Step 5: choose indices

• Find the most useful clusters and use them if they are very useful for a range of queries and supported by the database

• Next, find the most selective indices to add

• Finally, find indices that might help with index only scans

Clustered Indices• We can create clustered indices in Oracle with:

CREATE CLUSTER xyz (A int, B char(10)) ;

CREATE INDEX xyz_idx ON CLUSTER xyz ;

CREATE TABLE myTable

(Id Int, Name Char(10), Phone Char(7), ...)

CLUSTER xyz (Id, Name) ;

Clusters in Postgresql• Clusters in postgresql are generated once and

are not modified incrementally.

• Clusters are created based on an index

• CLUSTER indexname ON tablename

• They need to be recreated everytime to reorganize -> may be too costly for an application.

19

Indices• Create a clustered index for attributes frequently queried with a range or has

multiple matching attributes for a value

SELECT A.firstname, A.lastname FROM Actors A WHERE A.stagename = ‘Kevin Bacon’

SELECT C.filmid FROM Actors A, Casts C

WHERE A.stagename = ‘Kevin Bacon’ AND A.actorid=C.actorid

SELECT F.title FROM Films F WHERE F.year < 1965

• For the Casts relation, suppose there exists a clustered index on (FILMID, ACTORID, ROLE). Is this the most useful clustering of these attributes?

20

Indices• Clustered indices also provide a sorted order to the relation

SELECT F.year, Count(DISTINCT F.filmid)

FROM Films F, Casts C, Actors A

WHERE C.filmid=F.filmid AND C.actorid = A.actorid AND

A.stagename = ‘Kevin Bacon’

GROUP BY F.year

• Suppose the FILMS relation is clustered with respect to YEAR. Then, using an index on “filmid” might not be the best query execution choice.

21

Indices• Create unclustered indices on attributes with high selectivity

SELECT A.stagename FROM Actors A WHERE A.gender = ‘F’

SELECT A.firstname, A.lastname FROM Actors A WHERE A.stagename = ‘Kevin Bacon’

• Index nested loop join is also beneficial when there is a highly selective index

SELECT C.filmid FROM Actors A, Casts C

WHERE A.stagename = ‘Kevin Bacon’ AND A.actorid=C.actorid

22

Indices

• For frequently asked queries, indices might be created to allow index only searches.

• For example, given (stagename, actorid) for actors, answering a query like one below now requires only an index search for actors.

SELECT C.filmid FROM Actors A, Casts C

WHERE A.stagename = ‘Kevin Bacon’ AND A.actorid=C.actorid

23

Indices• For example, given a query like the one below:

SELECT A2.stagename

FROM Actors A1, Casts C1, Casts C2, Actors A2

WHERE A1.stagename = ‘Kevin Bacon’ AND A1.actorid=C1.actorid AND

C1.filmid = C2.filmid AND A2.actorid=C2.actorid

for A1, the index is searched in the usual way. For A2, the index on(stagename, actorid) can be searched fully instead of the relation to speed up the query.

24

Indices• Indices do not always help reduce the cost of queries,

• they must be selective

• they must be significantly smaller in size than the relation they are indexing

• they must be used often in queries where they make a difference

• Foreign keys introduce hidden costs to updates since they must be checked for all updates that relate to them

• Count queries can be answered using indices on attributes with a “NOT NULL” constraint (check if the index indices null values)

25

Other hints• Partition data to multiple disks

• Place data that is accessed sequentially on its own disk

• Invoke parallel query processing when multiple CPUs are available

• Create more detailed statistics (histograms)

• Recompute statistics periodically as needed

• Examine the query plans generated by the system and influence them as necessary

Oracle Optimizer

• Oracle's CBO (cost-based-optimizer) relies heavily on table statistics being available for all tables used in a query.

• A table does not have statistics available until you ask the database to compute them for you.

• In Oracle 9i this can be accomplished by running the DBMS_STATS package:

SQL> EXEC DBMS_STATS.GATHER_SCHEMA_STATS(USER);

• This will gather statistics for every table in your USER’s schema.

Postgres optimizer• ANALYZE [ VERBOSE ] [ table [ (column

[, ...] ) ] ]

• The statistics collected by ANALYZE usually include a list of some of the most common values in each column and a histogram showing the approximate data distribution in each column.

• Must be run periodically for updated statistics

Oracle Optimizer• You can ask the optimizer to give you the query plan for a

query.

• You need to first create a table called PLAN_TABLE with a specific schema.

• For each query, you can now run the command:

• EXPLAIN PLAN FOR QUERY-STATEMENT ;

• which will insert the query plan into the plan_table.

Oracle optimizerExample query for reading from the plan table:

SELECT

SUBSTR(LPAD(' ', LEVEL-1)

|| operation

|| NVL2(options,' (O:' || options || ')','')

|| NVL2(filter_predicates,' (F:' || filter_predicates || ')','')

|| NVL2(access_predicates,' (A:' || access_predicates || ')','')

, 1,40) AS operation

,object_name

FROM plan_table

START WITH id = 0

CONNECT BY PRIOR id = parent_id

Oracle optimizerExample result from the PLAN_TABLE:

EXPLAIN PLAN FOR SELECT * FROM dual WHERE 1=2;

OPERATION OBJECT_NAME

---------------------------------------- ------------------------------

SELECT STATEMENT

FILTER (F:1=2)

TABLE ACCESS (O:FULL) DUAL

Postgres optimizer• The same method (no extra table is required for plans)

explain analyze select * from books b, book_comments bc where b.id=bc.book_id ;

QUERY PLAN

---------------------------------------------------------------------------

Hash Join (cost=1.04..2.11 rows=2 width=1154) (actual time=0.285..0.306 rows=2 loops=1)

Hash Cond: (b.id = bc.book_id)

-> Seq Scan on books b (cost=0.00..1.03 rows=3 width=371) (actual time=0.073..0.080 rows=3 loops=1)

-> Hash (cost=1.02..1.02 rows=2 width=783) (actual time=0.084..0.084 rows=2 loops=1)

-> Seq Scan on book_comments bc (cost=0.00..1.02 rows=2 width=783) (actual time=0.058..0.063 rows=2 loops=1)

Total runtime: 0.440 ms

Oracle optimizer• You can also provide optimizer hints, telling it to use specific indices for example.

• Optimizer may choose to ignore your hints freely.

• select /*+ INDEX(dept) */ .... Use an index on

table dept

• select /*+ ALL_ROWS */ ... Execute to return all

rows asap

• select /*+ FIRST_ROWS */ ... Execute to return first

row asap

• select /*+ USE_NL */ ... Use nested loop join

on all tables

Postgresql optimizer

• Same type of hints is also possible in Postgresql:

• explain analyze /*+ enable_nestloop */ select * from books b, book_comments bc where b.id=bc.book_id ;

• You must know something the optimizer does not for this to make a difference.