tanel poder - troubleshooting complex oracle performance issues - part 2

www.enkitec.com 1

More Complex Performance Issues I’ve seen (Part 2)

Tanel Põder Enkitec

h.p://www.enkitec.com

h.p://blog.tanelpoder.com

www.enkitec.com 2

Intro: About me

•  Tanel Põder •  Former Oracle Database Performance geek •  Present Exadata Performance geek ;-‐)

•  My Exadata experience •  … in the last couple of years… •  … I have had the luck to work with all

possible Exadata configuraNons out there •  Exadata V1 … X2-‐8 •  MulN-‐rack Exadatas … •  Even a mixed rack Exadata (V2 <-‐> X2-‐2 :)

Expert Oracle Exadata book

(with Kerry Osborne and Randy Johnson)

www.enkitec.com 3

Random Session Hangs

•  Environment: •  Oracle 11.1 Database on Linux •  4 CPUs •  Running on ASM

•  Symptoms: •  Sudden slowness of the database, lasNng for a few minutes •  Some long TX, TM, CF enqueue waits •  Some sessions completely stuck, some showing IO waits, some CPU

•  First hunch:

•  Seemed quite random and weird •  Maybe slow/hung IO?

www.enkitec.com 4

AWR Report

•  ApplicaNon locking problem, right?

•  Very long IO waits… •  The applicaNon locks may be held for so long because of

some other task (while holding the lock) takes so long?

www.enkitec.com 5

Finding the blockers from ASH SELECT SESSION_STATE , EVENT , P1TEXT ||'='||P1 p1 , BLOCKING_SESSION_STATUS , BLOCKING_INST_ID , BLOCKING_SESSION , BLOCKING_SESSION_SERIAL# , COUNT(DISTINCT session_id) , COUNT(*)FROM dba_hist_active_sess_history wWHERE sample_time BETWEEN timestamp'2011-03-22 13:48:00' AND timestamp'2011-03-22 13:50:00'AND event LIKE 'enq%'GROUP BY SESSION_STATE , EVENT , P1TEXT ||'='||P1 , BLOCKING_SESSION_STATUS , BLOCKING_INST_ID , BLOCKING_SESSION , BLOCKING_SESSION_SERIAL#ORDER BY COUNT(*) DESC

You can use one of the many lock tree scripts / tools for finding the ulNmate blocker. Note that Oracle 11.2 has a FINAL_BLOCKING_SESSION column in V$SESSION and ASH showing the ulNmate blocker.

Finding the blocking session SID is just the first step, you really want to know what is that blocking session itself doing. In other words, where is it stuck so that it hasn’t released that lock yet.

www.enkitec.com 6

Blocking Tree

Many blocked sessions are waiNng for read by other session IO wait. Such wait had lasted for minutes

Many of the blockers ulNmately wait for SQL*Net message to client wait event ?!!!

www.enkitec.com 7

More blocking trees

Looks like an IO issue!

www.enkitec.com 8

SuspecNng an I/O issue?

•  As these I/O waits suggested…

•  The obvious next things to check: •  How much I/O are you actually doing – from OS level metrics

•  The OS level metrics (iostat and sar) did not indicate an obvious IO problem! •  The IO Nmes (in iostat) were OK:

$ iostat –x 5Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %utilsda 49.72 43.03 55.69 22.40 3166.72 553.49 47.64 0.24 3.11 0.23 1.80sda1 0.01 0.03 0.00 0.00 0.01 0.07 60.26 0.00 2.18 0.66 0.00sda2 49.71 42.99 55.69 22.40 3166.70 553.43 47.64 0.24 3.11 0.23 1.80dm-0 0.00 0.00 30.91 10.46 2570.75 106.94 64.71 0.07 1.58 0.25 1.03dm-1 0.00 0.00 74.49 55.81 595.95 446.48 8.00 0.62 4.77 0.08 1.05

NB! This example is from my laptop as I don’t have the real output anymore.

www.enkitec.com 9

OEM ASH AcNvity breakdown

•  We know that many of the sessions were hung waiNng for IO related wait events •  …but some were ulNmately blocked by SQL*Net message to client

•  Let’s peek around further in the database:

What, CPU load over 50 on a 4 CPU machine?!

www.enkitec.com 10

OEM Performance page

•  No CPU problem. User I/O, Commit and ApplicaNon waits

But the Performance Page doesn’t show CPU load that we saw on the (ASH) AcNvity Detail?

This data comes from OS

(v$osstat) so should be reliable

www.enkitec.com 11

Go to the source (OS)

•  Sar & vmstat data: No CPU starvaNon!

•  Whom to trust? •  OS tools, V$OSSTAT and Load figures in EM Performance Page ask for the

CPU usage figures from OS – so are ok •  The ASH data (and EM AcNvity page) assume that any session not marked

as “waiNng” in V$SESSION is on CPU!

•  h.p://blog.tanelpoder.com/2007/06/18/advanced-‐oracle-‐troubleshooNng-‐guide-‐when-‐the-‐wait-‐interface-‐is-‐not-‐enough-‐part-‐1/

•  h.p://oracledoug.com/serendipity/index.php?/archives/1432-‐Time-‐Ma.ers-‐DB-‐CPU.html

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------ r b swpd free buff cache si so bi bo in cs us sy id wa st 5 0 143272 264980 261616 6970896 0 0 3164 319 3 1 14 3 77 6 0 0 0 143272 257376 261616 6970568 0 0 840 23 6351 8484 11 8 73 8 0 0 0 143272 261748 261616 6970952 0 0 0 3535 5303 7740 7 1 89 2 0

www.enkitec.com 12

Checkpoint

•  Something weird is going on! •  Oracle says many sessions are on on CPU, when they really

aren’t

•  Oracle says sessions are waiNng for IO wait events a lot, while not much IO acNvity was going on •  … and the OS level IO metrics didn’t show long IO

compleNon Nmes either

•  Many sessions reported doing IO by Oracle are blocked by a session waiNng for SQL*Net message to client

The IO hangs can be tricky to diagnose as the iostat figures are updated when the IO actually completes*

www.enkitec.com 13

Time to get systemaNc

•  … and see what these hung processes are really doing! •  Did I hear someone saying process stack?! ;-‐)

•  Reminder: Oracle is just another program wri.en in C

•  The process stack will show in which of those C funcNons the problemaNc process execuNon (hang) currently is!

•  All major Unix plavorms have pstack (procstack on AIX) •  Oradebug short_stack / dump errorstack would also do it *

•  However the hangs were sporadic… I prefer OS pstack as it’s

cleaner

www.enkitec.com 14

@mon_stack.sql for waiNng for lisNng the blocker

•  When the Nme is right: -- STEP 1DEFINE threshold=&1-- First sleep and monitor V$SESSION to find long waits-- in the database-- This PL/SQL block will just keep running until a-- long enough wait is seenDECLARE l_threshold NUMBER := &threshold; l_max_wait NUMBER;BEGIN WHILE TRUE LOOP SELECT MAX(seconds_in_wait) INTO l_max_wait FROM v$session WHERE state = 'WAITING' AND

wait_class != 'Idle'; IF l_max_wait > l_threshold THEN EXIT; END IF; DBMS_LOCK.SLEEP(30); END LOOP;END;/

-- STEP 2SELECT 'ULTIMATE_BLOCKER_'||TRIM(s.type) ||'= '||TRIM(osid) blocking_spid , w.in_wait_secs , w.pid , w.sid , w.in_wait , w.wait_event , w.p1 , w.p2 , w.p3FROM v$wait_chains w , v$session sWHERE w.sid = s.sidAND w.sess_serial# = s.serial#AND w.blocker_sid IS NULLAND w.num_waiters > 0/-- Exit scriptEXIT

www.enkitec.com 15

mon_hang_stack.sh sqlplus -s $ORA_LOGON @mon_stack $THRESHOLD > $TMPFILEULTIMATE_BLOCKERS=`grep ULTIMATE_BLOCKER_USER $TMPFILE | awk '{ print $2 }'`echo >> $LOGFILEcat $TMPFILE >> $LOGFILEecho >> $LOGFILEecho DATE=`date +"%Y-%d-%m %H:%M:%S"` ULTIMATE_BLOCKERS=$ULTIMATE_BLOCKERS >> $LOGFILEfor i in $ULTIMATE_BLOCKERS ; do echo >> $LOGFILE echo DATE=`date +"%Y-%d-%m %H:%M:%S"` running pstack on PID=$i >> $LOGFILE echo >> $LOGFILE echo Stack trace 1 >> $LOGFILE echo >> $LOGFILE pstack $i >> $LOGFILE sleep 1 echo >> $LOGFILE echo Stack trace 2 >> $LOGFILE echo >> $LOGFILE pstack $i >> $LOGFILE sleep 1 echo >> $LOGFILE echo Stack trace 3 >> $LOGFILE echo >> $LOGFILE pstack $i >> $LOGFILEdone

Wait unNl the hang happens

Wait unNl the hang happens (the mon_stack script exits only when it

detects a long enough wait and then proceed to

pstack’ing the ulNmate blocker PIDs

www.enkitec.com 16

The stack trace

#1 0x00000000032ed6d7 in ntevpque ()#2 0x00000000032e9e30 in ntevqone ()#3 0x0000000003281c10 in nsevwait ()#4 0x00002af99df1758d in clsc_nswait () #5 0x00002af99df13daa in clsc_select_ext () #6 0x00002af99df124e6 in clsc_receive_wait () #7 0x00002af99df120a3 in clscreceive () #8 0x00002af99df10c18 in clscconnect ()#9 0x00002af99defa9e0 in clsssInitNative ()#10 0x00002af99defbad3 in clsssinit ()#11 0x0000000006eaf951 in kgxgncin ()#12 0x0000000003aff6c2 in kfmsInit ()#13 0x0000000003b00f07 in kfmsSlvReg ()#14 0x0000000003ae342c in kfmdSlvOpPriv ()#15 0x0000000003adcf9f in kfmEnslave ()#16 0x0000000003a0f19b in kfddsGet ()#17 0x0000000005c7fb57 in kfioTranslateIO () #18 0x0000000005c81f3a in kfioRqSetPrepare ()#19 0x0000000005c7e5e9 in kfioSubmitIO ()#20 0x0000000005c7b9ad in kfioRequestPriv ()#21 0x0000000005c7b366 in kfioRequest ()#22 0x0000000005c4c1db in ksfd_kfioRequest ()#23 0x0000000005c48574 in ksfd_osmio ()#24 0x0000000007c3f837 in ksfd_io ()#25 0x0000000007c3df1d in ksfdread1 ()

#26 0x0000000001a01286 in kcfrbd ()#27 0x0000000000e01407 in kcbzib ()#28 0x0000000007a7451b in kcbgcur ()#29 0x0000000000d130de in ktbgcur ()#30 0x0000000007a15b5f in ktspfpblk ()#31 0x0000000007a1458d in ktspfsrch ()#32 0x0000000007a13f01 in ktspscan_bmb ()#33 0x0000000007a1351f in ktspgsp_main ()#34 0x0000000001427ec4 in kdisnew ()#35 0x00000000014258cc in kdisnewle ()#36 0x000000000140ca16 in kdisle ()#37 0x00000000013c6899 in kdiins0 ()#38 0x00000000013d7d6c in kdiinsp ()#39 0x0000000007aa8516 in kauxsin ()#40 0x0000000007ca0881 in qesltcLoadIndexList ()#41 0x0000000007ca04f9 in qesltcLoadIndexes ()#42 0x0000000007c806fc in

__PGOSF606_qerltcNoKdtBufferedInsRowCBK ()#43 0x0000000007c7e753 in qerltcSingleRowLoad ()#44 0x0000000007c7d5e7 in qerltcFetch ()#45 0x0000000007bc742d in insexe ()#46 0x0000000007c8f4c7 in opiexe ()......#82 0x0000000001762cea in sou2o ()#83 0x0000000000975483 in opimai_real ()#84 0x00000000017682a1 in ssthrdmain ()#85 0x00000000009753af in main ()

This is the “outmost” funcNon in the stack

This is where the process execuNon

currently is

ASM file -‐> Physical device address translaNon!

Cluster services communicaNon

funcNons

www.enkitec.com 17

Intermission: How ASM works

DISK DEVICE /dev/rdsk/xyz

DISK DEVICE /dev/rdsk/xyz

DATABASE INSTANCE ASM

INSTANCE

DB Server Process

3. Fetch ASM -‐> Physical disk block mapping from ASM instance! (IPC call!)

1. “Read ASM datafile X block Y”

2. Check ASM extent pointer

cache in the DB instance

ASM Server Process

4. Read the block from disk device

SQL> @sgastat asmPOOL NAME BYTES------------ -------------------------- ----------shared pool ASM KFFD SO 2648shared pool ASM extent pointer array 1784672shared pool ASM generic network state 2584shared pool ASM kfk state object 31544shared pool ASM map operations 30368large pool ASM map operations hashta 393216shared pool ASM rollback operations 2648shared pool ASM scan context 3544

www.enkitec.com 18

A bug!

Oyen the already

diagnosed bugs have a relevant

stack trace listed in the

note!

www.enkitec.com 19

A bug!

•  Bug 8429716 Clusteware process (ocssd/crsd/etc) may get too low file descriptor limit (hangs etc..)

•  11gR2 Grid Infrastructure Does not Use ulimit Sezng Appropriately [ID 983715.1]

# ls –l /proc/`pgrep cssd.bin`/fd…lrwx------ 1 root root 64 Mar 29 04:11 96 -> socket:[55795]lrwx------ 1 root root 64 Mar 29 04:11 97 -> socket:[55802]lrwx------ 1 root root 64 Mar 29 04:11 98 -> socket:[55835]lrwx------ 1 root root 64 Mar 29 04:11 99 -> socket:[55842]…

Even though the file descriptor array and maximum limits for user processes had been configured correctly (/etc/sysctl.conf and /etc/security/limits.conf), thanks to the Oracle bug, the cssd daemon didn't pick up such a limit (as it's scheduled via rc startup scripts as root, not regular users logging on). Thus the fix would have to be applied into cssd startup script to bump up the file descriptor limit for itself.

Number of fds in use reached 1024!

www.enkitec.com 20

Unistrumented Wait events?!

•  None of these wait events exist in Oracle 11.1: SQL> @sed cssShow wait event descriptions matching %css%..EVENT# EVENT_NAME WAIT_CLASS PARAMETER1 ------ ------------------------------------------------------- -------------------- ----------- 848 CSS Xgrp shared operation Other function_id 845 CSS group membership query Other group_name1 844 CSS group registration Other group_name1 843 CSS initialization Other 850 CSS operation: action Other function_id 846 CSS operation: data query Other function_id 847 CSS operation: data update Other function_id 851 CSS operation: diagnostic Other function_id 849 CSS operation: query Other function_id 571 GCR CSS join retry Other retry count 572 GCR member Data from CSS Other retry count 565 IMR CSS join retry Other retry countSQL> @sed crsShow wait event descriptions matching %crs%..EVENT# EVENT_NAME WAIT_CLASS PARAMETER1 ------ ------------------------------------------------------- -------------------- ----------- 861 CRS call completion Other clsrrestype

www.enkitec.com 21

Thank you – and oh, wait!

•  New seminars in 2013 •  Advanced Oracle TroubleshooNng v2.1 (Feb-‐March) •  Advanced Oracle SQL Tuning v2.0 (March-‐April) •  Advanced Exadata Performance: TroubleshooNng and OpNmizaNon

•  h.p://blog.tanelpoder.com/seminar/

•  QuesNons: [email protected] •  Blog: h.p://blog.tanelpoder.com •  ConsulNng: h.p://www.enkitec.com

tanel poder - troubleshooting complex oracle performance issues - part 2

Technology