awr db performance data mining - collaborate 2015
TRANSCRIPT
AWR DB performanceData Mining
Yury Velikanov Oracle DBA
Mission
Let you remember/consider AWR next time you troubleshoot
Performance issue!
AWR Agenda
• Introduction & Background
• Examples, Examples, Examples
• Concept & Approach
• More examples
• Q & A
[LinkedIn, twitter, slideshare, blog, email, mobile, …]
Few words about Yury
Yury Oracle
Few words about Google
Google careers
Few words about Google
Background• AWR is one of many RDBMS performance data sources
• Sometimes it isn’t the best source (aggregation)• SQL Extended trace (event 10046)
• RAW trace• tkprof• TRCAnlzr [ID 224270.1]• Method-R state of art tools
• PL/SQL Profiler• LTOM (Session Trace Collector)• others
• Sometimes it is the best/efficient source!• Sometimes it is the only one available!
Background
• Once I was called to troubleshoot high load• Connected to the database I saw 8 active processes running for 6
hours in average• Used 10046 event for all 8 processes for 15 minutes• Found several SQLs returning 1 row million times• Passed the results to development asking to fix the logic• Spent ~2 hours to find where the issue was
• Next day a colleague asked me• Why did you use 10046 and spent 2 hours?• He used AWR report and came up with the same
answer in less than 5 minutes
• Lesson learned: Right tool for the right case !
When should you consider AWR mining?• General resource tuning (high CPU, IO utilization)
• Find TOP resource consuming SQLs• You are asked to reduce server load X times
• You would like to analyze load patterns/trends
• You need to travel back in time and see how things progressed
• You don’t have any other source of performance information
• AWR report doesn’t provide you information at the right angle/dimension or are not available (Grid Control, awrrpt.sql)
• AWR SQL Execution Plans historical information analysis
When it is better to use other methods?
• You need to tune a procedure/function/activity
• You have a repeatable test case
• The problem could be repeated in an idle environment
• There is no concurrent resource usage
• SQL Trace (10046) is way better troubleshooting method in such cases
• When application doesn’t use bind variables
TOP CPU/IO Consuming SQLs ?select
s.SQL_ID,
sum(CPU_TIME_DELTA),
sum(DISK_READS_DELTA),
count(*)
from
DBA_HIST_SQLSTAT
group by
SQL_ID
order by
sum(CPU_TIME_DELTA) desc
/
SQL_ID SUM(CPU_TIME_DELTA) SUM(DISK_READS_DELTA) COUNT(*)
------------- ------------------- --------------------- ----------
05s9358mm6vrr 27687500 2940 1
f6cz4n8y72xdc 7828125 4695 2
5dfmd823r8dsp 6421875 8 15
3h1rjtcff3wy1 5640625 113 1
92mb1kvurwn8h 5296875 0 1
bunssq950snhf 3937500 18 15
7xa8wfych4mad 2859375 0 2
...
TOP CPU Consuming SQLs ?
select
s.SQL_ID,
sum(s.CPU_TIME_DELTA),
sum(s.DISK_READS_DELTA),
count(*)
from
DBA_HIST_SQLSTAT s
group by
s.SQL_ID
order by
sum(s.CPU_TIME_DELTA) desc
TOP CPU Consuming SQLs ?select * from
(
select
s.SQL_ID,
sum(s.CPU_TIME_DELTA),
sum(s.DISK_READS_DELTA),
count(*)
from
DBA_HIST_SQLSTAT s
group by
s.SQL_ID
order by
sum(s.CPU_TIME_DELTA) desc
)
where rownum < 11
/
TOP CPU Consuming SQLs ?select * from
(
select
s.SQL_ID,
sum(s.CPU_TIME_DELTA),
sum(s.DISK_READS_DELTA),
count(*)
from
DBA_HIST_SQLSTAT s, DBA_HIST_SNAPSHOT p
where 1=1
and s.SNAP_ID = p.SNAP_ID
and EXTRACT(HOUR FROM p.END_INTERVAL_TIME) between 8 and 16
group by
s.SQL_ID
order by
sum(s.CPU_TIME_DELTA) desc
)
where rownum < 11
/
TOP CPU Consuming SQLs ?select * from
(
select
s.SQL_ID,
sum(s.CPU_TIME_DELTA),
sum(s.DISK_READS_DELTA),
count(*)
from
DBA_HIST_SQLSTAT s, DBA_HIST_SNAPSHOT p
where 1=1
and s.SNAP_ID = p.SNAP_ID
and EXTRACT(HOUR FROM p.END_INTERVAL_TIME) between 8 and 16
and p.END_INTERVAL_TIME between SYSDATE-7 and SYSDATE
group by
s.SQL_ID
order by
sum(s.CPU_TIME_DELTA) desc
)
where rownum < 11
/
TOP CPU Consuming SQLs ?select * from
(
select
s.SQL_ID,
sum(s.CPU_TIME_DELTA),
sum(s.DISK_READS_DELTA),
count(*)
from
DBA_HIST_SQLSTAT s, DBA_HIST_SNAPSHOT p, DBA_HIST_SQLTEXT t
where 1=1
and s.SNAP_ID = p.SNAP_ID
and s.SQL_ID = t.SQL_ID
and EXTRACT(HOUR FROM p.END_INTERVAL_TIME) between 8 and 16
and t.COMMAND_TYPE != 47 –- Exclude PL/SQL blocks from output
and p.END_INTERVAL_TIME between SYSDATE-7 and SYSDATE
group by
s.SQL_ID
order by
sum(s.CPU_TIME_DELTA) desc
)
where rownum < 11
/
52.8 %
1.2. 3.
4.
5.
TOP CPU Consuming SQLs ?select
SQL_ID,
sum(CPU_TIME_DELTA),
sum(DISK_READS_DELTA),
count(*)
from
DBA_HIST_SQLSTAT
group by
SQL_ID
order by
sum(CPU_TIME_DELTA) desc
/
SQL_ID SUM(CPU_TIME_DELTA) SUM(DISK_READS_DELTA) COUNT(*)
------------- ------------------- --------------------- ----------
05s9358mm6vrr 27687500 2940 1
f6cz4n8y72xdc 7828125 4695 2
5dfmd823r8dsp 6421875 8 15
3h1rjtcff3wy1 5640625 113 1
92mb1kvurwn8h 5296875 0 1
bunssq950snhf 3937500 18 15
7xa8wfych4mad 2859375 0 2
...
5 SlidesConcept & Approach
AWR = DBA_HIST_% objects• 223 => 11.2.0.4.0• 243 => 12.1.0.1.0
• I use just few on a regular basis• DBA_HIST_ACTIVE_SESS_HISTORY• DBA_HIST_SEG_STAT• DBA_HIST_SQLSTAT• DBA_HIST_SQL_PLAN• DBA_HIST_SYSSTAT• DBA_HIST_SYSTEM_EVENT
• Most of the views contain data snapshots from V$___ views
• DELTA columns (e.g. DISK_READS_DELTA)• DBA_HIST_SEG_STAT• DBA_HIST_SQLSTAT
- V$ACTIVE_SESSION_HISTORY- V$SEGMENT_STATISTICS- V$SQL- V$SQL_PLAN- V$SYSSTAT ( ~SES~ )- V$SYSTEM_EVENT ( ~SESSION~ )
AWR Things to keep in mind …• The data are just snapshots of V$ views
• Data collected based on thresholds (default top 30)
• Some data is excluded based on thresholds
• Some data may not be in SGA at the time of snapshot
• Longer time difference between snapshots more data got excluded
• For data mining use ALL snapshots available
Begin
Endt
AWR Things to keep in mind …
• Forget about AWR if there are literals in the code• Indicator is high parse count (hard) (10-50 per/sec)
• cursor_sharing = FORCE (use very carefully)
• In RAC configuration do not forget INST_ID column in joins
• Most of the V$ (DBA_HIST) performance views have incremental counters. END - BEGIN values
• You may get wrong results (sometimes negative)• Sometimes counters reach max value and get reset• Counters got reset at instance restart time
• Time between snapshots may be different• Suggestion (ENDv - BEGINv)/(ENDs - BEGINs)=value/sec
AWR Things to keep in mind …
AWR Things to keep in mind …• Seconds count between 2 snapshots
select
s.BEGIN_INTERVAL_TIME,
s.END_INTERVAL_TIME,
s.END_INTERVAL_TIME-s.BEGIN_INTERVAL_TIME DTIME, -- Returns “Interval”
EXTRACT(HOUR FROM s.END_INTERVAL_TIME-s.BEGIN_INTERVAL_TIME) H,
EXTRACT(MINUTE FROM s.END_INTERVAL_TIME-s.BEGIN_INTERVAL_TIME) M,
EXTRACT(SECOND FROM s.END_INTERVAL_TIME-s.BEGIN_INTERVAL_TIME) S,
EXTRACT(HOUR FROM s.END_INTERVAL_TIME-s.BEGIN_INTERVAL_TIME)*60*60+
EXTRACT(MINUTE FROM s.END_INTERVAL_TIME-s.BEGIN_INTERVAL_TIME)*60+
EXTRACT(SECOND FROM s.END_INTERVAL_TIME-s.BEGIN_INTERVAL_TIME) SECS,
phy_get_secs(s.END_INTERVAL_TIME,s.BEGIN_INTERVAL_TIME), -– Write you own fun()
(cast(s.END_INTERVAL_TIME as date) - cast(s.BEGIN_INTERVAL_TIME as date))
*24*60*60
from
DBA_HIST_SNAPSHOT s
where 1=1
and s.INSTANCE_NUMBER = (select INSTANCE_NUMBER from V$INSTANCE)
and s.DBID = (select DBID from V$DATABASE)
order by
s.BEGIN_INTERVAL_TIME;
AWR Things to keep in mind …
select SNAP_INTERVAL, RETENTION
from
DBA_HIST_WR_CONTROL c, V$DATABASE d
where
c.DBID = d.DBID;
SNAP_INTERVAL RETENTION
------------------------------ ------------------------------
+00000 01:00:00.0 +00007 00:00:00.0
select DBID, INSTANCE_NUMBER, count(*) C,
min(BEGIN_INTERVAL_TIME) OLDEST, max(BEGIN_INTERVAL_TIME) YUNGEST
from
DBA_HIST_SNAPSHOT
group by
DBID,
INSTANCE_NUMBER;
DBID INSTANCE_NUMBER C OLDEST YOUNGEST
---------- --------------- ---------- ------------------------- -------------------------
3244685755 1 179 13-AUG-13 07.00.30.233 PM 21-AUG-13 05.00.01.855 AM
3244685755 2 179 13-AUG-13 07.00.30.309 PM 21-AUG-13 05.00.01.761 AM
Trends Analysis Example (1) …
select
s.BEGIN_INTERVAL_TIME, s.END_INTERVAL_TIME,
(
t.VALUE-
LAG (t.VALUE) OVER (ORDER BY s.BEGIN_INTERVAL_TIME)
) DVALUE,
(t.VALUE-LAG (t.VALUE) OVER (ORDER BY s.BEGIN_INTERVAL_TIME))/
phy_get_secs(s.END_INTERVAL_TIME, s.BEGIN_INTERVAL_TIME) VAL_SEC
from
DBA_HIST_SNAPSHOT s,
DBA_HIST_SYSSTAT t
where 1=1
and s.SNAP_ID = t.SNAP_ID
and s.DBID = t.DBID
and s.INSTANCE_NUMBER = t.INSTANCE_NUMBER
and s.INSTANCE_NUMBER = (select INSTANCE_NUMBER from V$INSTANCE)
and s.DBID = (select DBID from V$DATABASE)
and t.STAT_NAME = 'parse count (hard)'
order by
s.BEGIN_INTERVAL_TIME;
DBA_HIST_SYSSTAT & DBA_HIST_SYSTEM_EVENT
Trends Analysis Example (1) …
select
s.BEGIN_INTERVAL_TIME, s.END_INTERVAL_TIME,
(
t.VALUE-
LAG (t.VALUE) OVER (ORDER BY s.END_INTERVAL_TIME)
) DVALUE,
(t.VALUE-LAG (t.VALUE) OVER (ORDER BY s.END_INTERVAL_TIME))/
phy_get_secs(s.END_INTERVAL_TIME-s.BEGIN_INTERVAL_TIME) VAL_SEC
from
DBA_HIST_SNAPSHOT s,
DBA_HIST_SYSSTAT t
where 1=1
and s.SNAP_ID = t.SNAP_ID
and s.DBID = t.DBID
and s.INSTANCE_NUMBER = t.INSTANCE_NUMBER
and s.INSTANCE_NUMBER = (select INSTANCE_NUMBER from V$INSTANCE)
and s.DBID = (select DBID from V$DATABASE)
and t.STAT_NAME = 'parse count (hard)'
order by
s.END_INTERVAL_TIME;
DBA_HIST_SYSSTAT & DBA_HIST_SYSTEM_EVENT
Trends Analysis Example (1) …
SQL Bad performance Example (2) …
• Called by a user to troubleshoot a badly performing SQL
• Sometimes the SQL hangs (never finishes) and needs to be killed
and re-executed
• Upon re-execution, it always finishes successfully in a few
minutes
• The client demanded a resolution ASAP …
select
st.SQL_ID
, st.PLAN_HASH_VALUE
, sum(st.EXECUTIONS_DELTA) EXECUTIONS
, sum(st.ROWS_PROCESSED_DELTA) CROWS
, trunc(sum(st.CPU_TIME_DELTA)/1000000/60) CPU_MINS
, trunc(sum(st.ELAPSED_TIME_DELTA)/1000000/60) ELA_MINS
from DBA_HIST_SQLSTAT st
where st.SQL_ID in (
'5ppdcygtcw7p6'
,'gpj32cqd0qy9a'
)
group by st.SQL_ID , st.PLAN_HASH_VALUE
order by st.SQL_ID, CPU_MINS;
DBA_HIST_SQLSTAT
SQL Bad performance Example (2) …
SQL_ID PLAN_HASH_VALUE EXECUTIONS CROWS CPU_MINS ELA_MINS
------------- --------------- ---------- ---------- ---------------- ----------------
5ppdcygtcw7p6 436796090 20 82733 1 3
5ppdcygtcw7p6 863350916 71 478268 5 11
5ppdcygtcw7p6 2817686509 9 32278 2,557 2,765
gpj32cqd0qy9a 3094138997 30 58400 1 3
gpj32cqd0qy9a 1700210966 36 69973 1 7
gpj32cqd0qy9a 1168845432 2 441 482 554
gpj32cqd0qy9a 2667660534 4 1489 1,501 1,642
DBA_HIST_SQLSTAT
SQL Bad performance Example (2) …
select
st.SQL_ID
, st.PLAN_HASH_VALUE
, sum(st.EXECUTIONS_DELTA) EXECUTIONS
, sum(st.ROWS_PROCESSED_DELTA) CROWS
, trunc(sum(st.CPU_TIME_DELTA)/1000000/60) CPU_MINS
, trunc(sum(st.ELAPSED_TIME_DELTA)/1000000/60) ELA_MINS
from DBA_HIST_SQLSTAT st
where st.SQL_ID in (
'5ppdcygtcw7p6'
,'gpj32cqd0qy9a'
)
group by st.SQL_ID , st.PLAN_HASH_VALUE
order by st.SQL_ID, CPU_MINS;
DBA_HIST_SQLSTAT
SQL Bad performance Example (2) …
• In the result …
• Two different jobs were gathering statistics on a daily basis
1. “ANALYZE …” part of other batch job (developer)
2. “DBMS_STATS…” traditional (DBA)
• Sometimes “DBMS_STATS…“ did not complete before the
batch job starts (+/- 10 minutes).
• After the job got killed (typically after 10 min since it started) the
new “correct” statistics were in place.
• Takeaways …
A. Don’t change your statistics that frequently (should be consistent)
B. AWR data helps to spot such issues easily
SQL Bad performance Example (2) …
SQL Plan flipping Example (3) …
• I asked myself: Well !
• If we find that the execution plan for one SQL has changed
from a good (fast) to a bad one (slow), are there other SQLs
affected by an issue alike?
• And if there are, how many are there?
• Would SQL Profiles (baselines, outlines) help address
those?
SELECT st2.SQL_ID ,
st2.PLAN_HASH_VALUE ,
st_long.PLAN_HASH_VALUE l_PLAN_HASH_VALUE ,
st2.CPU_MINS ,
st_long.CPU_MINS l_CPU_MINS ,
st2.ELA_MINS ,
st_long.ELA_MINS l_ELA_MINS ,
st2.EXECUTIONS ,
st_long.EXECUTIONS l_EXECUTIONS ,
st2.CROWS ,
st_long.CROWS l_CROWS ,
st2.CPU_MINS_PER_ROW ,
st_long.CPU_MINS_PER_ROW l_CPU_MINS_PER_ROW
FROM
(SELECT st.SQL_ID ,
st.PLAN_HASH_VALUE ,
SUM(st.EXECUTIONS_DELTA) EXECUTIONS ,
SUM(st.ROWS_PROCESSED_DELTA) CROWS ,
TRUNC(SUM(st.CPU_TIME_DELTA) /1000000/60) CPU_MINS ,
DECODE( SUM(st.ROWS_PROCESSED_DELTA), 0 , 0 , (SUM(st.CPU_TIME_DELTA)/1000000/60)/SUM(st.ROWS_PROCESSED_DELTA) ) CPU_MINS_PER_ROW ,
TRUNC(SUM(st.ELAPSED_TIME_DELTA) /1000000/60) ELA_MINS
FROM DBA_HIST_SQLSTAT st
WHERE 1 =1
AND ( st.CPU_TIME_DELTA !=0
OR st.ROWS_PROCESSED_DELTA !=0)
GROUP BY st.SQL_ID,
st.PLAN_HASH_VALUE
) st2,
(SELECT st.SQL_ID ,
st.PLAN_HASH_VALUE ,
SUM(st.EXECUTIONS_DELTA) EXECUTIONS ,
SUM(st.ROWS_PROCESSED_DELTA) CROWS ,
TRUNC(SUM(st.CPU_TIME_DELTA) /1000000/60) CPU_MINS ,
DECODE( SUM(st.ROWS_PROCESSED_DELTA), 0 , 0 , (SUM(st.CPU_TIME_DELTA)/1000000/60)/SUM(st.ROWS_PROCESSED_DELTA) ) CPU_MINS_PER_ROW ,
TRUNC(SUM(st.ELAPSED_TIME_DELTA) /1000000/60) ELA_MINS
FROM DBA_HIST_SQLSTAT st
WHERE 1 =1
AND ( st.CPU_TIME_DELTA !=0
OR st.ROWS_PROCESSED_DELTA !=0)
HAVING TRUNC(SUM(st.CPU_TIME_DELTA)/1000000/60) > 10
GROUP BY st.SQL_ID,
st.PLAN_HASH_VALUE
) st_long
WHERE 1 =1
AND st2.SQL_ID = st_long.SQL_ID
AND st_long.CPU_MINS_PER_ROW/DECODE(st2.CPU_MINS_PER_ROW,0,1,st2.CPU_MINS_PER_ROW) > 2
ORDER BY l_CPU_MINS DESC,
st2.SQL_ID,
st_long.CPU_MINS DESC,
st2.PLAN_HASH_VALUE;
SQL Plan flipping Example (3) …
SELECT
...
FROM
(SELECT st.SQL_ID ,
st.PLAN_HASH_VALUE ,
...
DECODE( SUM(st.ROWS_PROCESSED_DELTA), 0 , 0 ,
(SUM(st.CPU_TIME_DELTA)/1000000/60)/SUM(st.ROWS_PROCESSED_DELTA) ) CPU_MINS_PER_ROW ,
...
FROM DBA_HIST_SQLSTAT st
WHERE 1 =1
...
GROUP BY st.SQL_ID,
st.PLAN_HASH_VALUE
) st2,
(SELECT st.SQL_ID ,
st.PLAN_HASH_VALUE ,
...
HAVING trunc(sum(st.CPU_TIME_DELTA)/1000000/60) > 10
GROUP BY st.SQL_ID,
st.PLAN_HASH_VALUE
) st_long
WHERE 1 =1
AND st2.SQL_ID =
st_long.SQL_ID
AND st_long.CPU_MINS_PER_ROW/DECODE(st2.CPU_MINS_PER_ROW,0,1,st2.CPU_MINS_PER_ROW) > 2
ORDER BY l_CPU_MINS DESC,
st2.SQL_ID,
st_long.CPU_MINS DESC,
st2.PLAN_HASH_VALUE;
SQL Plan flipping Example (3) …
SQL_ID PLAN_HASH_VALUE L_PLAN_HASH_VALUE CPU_MINS L_CPU_MINS ELA_MINS L_ELA_MINS EXECUTIONS L_EXECUTIONS CROW
------------- --------------- ----------------- ---------- ---------- ---------- ---------- ---------- ------------ ----------
db8yz0rfhvufm 3387634876 619162475 17 2673 21 4074 3106638 193 212138
5ppdcygtcw7p6 436796090 2817686509 1 2557 3 2765 20 9 8273
5ppdcygtcw7p6 863350916 2817686509 5 2557 11 2765 71 9 47826
1tab7mjut8j9h 875484785 911605088 9 2112 23 2284 980 1436 80
1tab7mjut8j9h 2484900321 911605088 6 2112 6 2284 1912 1436 151
1tab7mjut8j9h 3141038411 911605088 50 2112 57 2284 32117 1436 2604
gpj32cqd0qy9a 1700210966 2667660534 1 1501 7 1642 36 4 6997
gpj32cqd0qy9a 3094138997 2667660534 1 1501 3 1642 30 4 5840
2tf4p2anpwpk2 825403357 1679851684 6 824 71 913 17 13 2155
csvwu3kqu43j4 3860135778 2851322291 0 784 0 874 1 2 154
0q9hpmtk8c1hf 3860135778 2851322291 0 779 0 867 1 2 407
2frwhbxvg1j69 3860135778 2851322291 0 776 0 865 1 2 195
4nzsxm3d9rspt 3860135778 2851322291 0 754 0 846 1 2 190
1pc2npdb1kbp6 9772089 2800812079 0 511 0 3000 7 695 38
gpj32cqd0qy9a 1700210966 1168845432 1 482 7 554 36 2 6997
gpj32cqd0qy9a 3094138997 1168845432 1 482 3 554 30 2 5840
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
4bcx6kbbrg6bv 3781789023 2248191382 0 11 0 41 2 2 3
6wh3untj05apd 3457450300 3233890669 0 11 0 131 1 20 2
6wh3untj05apd 3477405755 3233890669 0 11 1 131 2 20 1
8pzsjt5p64xfu 3998876049 3667423051 0 11 5 44 3 18 1
bpfzx2hxf5x7f 1890295626 774548604 0 11 0 26 1 24 48858
g67nkxd2nqqqd 1308088852 4202046543 0 11 1 57 1 49 3
g67nkxd2nqqqd 1308088852 1991738870 0 11 1 39 1 38 3
g67nkxd2nqqqd 2154937993 1991738870 1 11 27 39 72 38 37
g67nkxd2nqqqd 2154937993 4202046543 1 11 27 57 72 49 37
92 rows selected.
Elapsed: 00:00:02.53
SQL>
SQL Plan flipping Example (3) …
• In the result …
• Load on the system was reduced by 5 times
• Takeaways …
A. SQL Plans may flip from good plans to …
B. SQL Outlines/Profiles may help some times
C. AWR provides good input for such analysis
• Why SQL Plans may flip?
1. Bind variable peeking / adaptive cursor sharing
2. Statistics change (including difference in partitions stats)
3. Adding/Removing indexes
4. Session/System init.ora parameters (nls_sort/optimizer_mode)
5. Dynamic statistics gathering (sampling)
6. Profiles/Outlines/Baselines evolution
SQL Plan flipping Example (3) …
• AWR = DBA_HIST% views ( snapshots from V$% views )
• Sometimes it is the only source of information
• AWR contains much more information that default AWR reports
and Grid Control could provide you
• Be careful mining data (there are some gotchas)
• Don’t be afraid to discover/mine the AWR data
I can show you the door …
… but it is you who should walk through it
Conclusions …
Additional Resources
• www.oracle.com/scan• www.pythian.com/exadata• www.pythian.com/news/tag/exadata - Exadata
Blog• www.pythian.com/news_and_events/in_the_news
Article: “Making the Most of Oracle Exadata”
My Oracle Support notes 888828.1 and 757552.1
Thank you!
Mission
Let you remember/consider AWR next time you troubleshoot
Performance issue!
Google careers