tanel poder - troubleshooting complex oracle performance issues - part 2

21
www.enkitec.com 1 More Complex Performance Issues I’ve seen (Part 2) Tanel Põder Enkitec h.p://www.enkitec.com h.p://blog.tanelpoder.com

Upload: tanel-poder

Post on 12-Apr-2017

8.515 views

Category:

Technology


9 download

TRANSCRIPT

Page 1: Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 2

www.enkitec.com     1      

More  Complex  Performance  Issues  I’ve  seen  (Part  2)  

Tanel  Põder  Enkitec  

 h.p://www.enkitec.com  

h.p://blog.tanelpoder.com  

Page 2: Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 2

www.enkitec.com     2      

Intro:  About  me  

•  Tanel  Põder  •  Former  Oracle  Database  Performance  geek  •  Present  Exadata  Performance  geek  ;-­‐)  

•  My  Exadata  experience  •  …  in  the  last  couple  of  years…  •  …  I  have  had  the  luck  to  work  with  all  

possible  Exadata  configuraNons  out  there  •  Exadata  V1  …  X2-­‐8  •  MulN-­‐rack  Exadatas  …  •  Even  a  mixed  rack  Exadata  (V2  <-­‐>  X2-­‐2  :)  

Expert  Oracle  Exadata  book  

(with  Kerry  Osborne  and  Randy  Johnson)  

Page 3: Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 2

www.enkitec.com     3      

Random  Session  Hangs  

•  Environment:  •  Oracle  11.1  Database  on  Linux  •  4  CPUs  •  Running  on  ASM  

•  Symptoms:  •  Sudden  slowness  of  the  database,  lasNng  for  a  few  minutes  •  Some  long  TX,  TM,  CF  enqueue  waits  •  Some  sessions  completely  stuck,  some  showing  IO  waits,  some  CPU  

 •  First  hunch:  

•  Seemed  quite  random  and  weird  •  Maybe  slow/hung  IO?  

Page 4: Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 2

www.enkitec.com     4      

AWR  Report  

•  ApplicaNon  locking  problem,  right?  

•  Very  long  IO  waits…  •  The  applicaNon  locks  may  be  held  for  so  long  because  of  

some  other  task  (while  holding  the  lock)  takes  so  long?    

Page 5: Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 2

www.enkitec.com     5      

Finding  the  blockers  from  ASH  SELECT SESSION_STATE , EVENT , P1TEXT ||'='||P1 p1 , BLOCKING_SESSION_STATUS , BLOCKING_INST_ID , BLOCKING_SESSION , BLOCKING_SESSION_SERIAL# , COUNT(DISTINCT session_id) , COUNT(*)FROM dba_hist_active_sess_history wWHERE sample_time BETWEEN timestamp'2011-03-22 13:48:00' AND timestamp'2011-03-22 13:50:00'AND event LIKE 'enq%'GROUP BY SESSION_STATE , EVENT , P1TEXT ||'='||P1 , BLOCKING_SESSION_STATUS , BLOCKING_INST_ID , BLOCKING_SESSION , BLOCKING_SESSION_SERIAL#ORDER BY COUNT(*) DESC

You  can  use  one  of  the  many  lock  tree  scripts  /  tools  for  finding  the  ulNmate  blocker.    Note  that  Oracle  11.2  has  a  FINAL_BLOCKING_SESSION  column  in  V$SESSION  and  ASH  showing  the  ulNmate  blocker.  

Finding  the  blocking  session  SID  is  just  the  first  step,  you  really  want  to  know  what  is  that  blocking  session  itself  doing.      In  other  words,  where  is  it  stuck  so  that  it  hasn’t  released  that  lock  yet.  

Page 6: Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 2

www.enkitec.com     6      

Blocking  Tree  

Many  blocked  sessions  are  waiNng  for  read  by  other  session  IO  wait.  Such  wait  had  lasted  for  minutes  

Many  of  the  blockers  ulNmately  wait  for  SQL*Net  message  to  client  wait  event  ?!!!  

Page 7: Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 2

www.enkitec.com     7      

More  blocking  trees  

Looks  like  an  IO  issue!  

Page 8: Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 2

www.enkitec.com     8      

SuspecNng  an  I/O  issue?  

•  As  these  I/O  waits  suggested…  

•  The  obvious  next  things  to  check:  •  How  much  I/O  are  you  actually  doing  –  from  OS  level  metrics    

•  The  OS  level  metrics  (iostat  and  sar)  did  not  indicate  an  obvious  IO  problem!  •  The  IO  Nmes  (in  iostat)  were  OK:    

$ iostat –x 5Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %utilsda 49.72 43.03 55.69 22.40 3166.72 553.49 47.64 0.24 3.11 0.23 1.80sda1 0.01 0.03 0.00 0.00 0.01 0.07 60.26 0.00 2.18 0.66 0.00sda2 49.71 42.99 55.69 22.40 3166.70 553.43 47.64 0.24 3.11 0.23 1.80dm-0 0.00 0.00 30.91 10.46 2570.75 106.94 64.71 0.07 1.58 0.25 1.03dm-1 0.00 0.00 74.49 55.81 595.95 446.48 8.00 0.62 4.77 0.08 1.05

NB!  This  example  is  from  my  laptop  as  I  don’t  have  the  real  output  anymore.  

Page 9: Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 2

www.enkitec.com     9      

OEM  ASH  AcNvity  breakdown    

•  We  know  that  many  of  the  sessions  were  hung  waiNng  for  IO  related  wait  events  •  …but  some  were  ulNmately  blocked  by  SQL*Net  message  to  client  

•  Let’s  peek  around  further  in  the  database:  

What,  CPU  load  over  50  on  a  4  CPU  machine?!  

Page 10: Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 2

www.enkitec.com     10      

OEM  Performance  page  

•  No  CPU  problem.  User  I/O,  Commit  and  ApplicaNon  waits  

But  the  Performance  Page  doesn’t  show  CPU  load  that  we  saw  on  the  (ASH)  AcNvity  Detail?  

This  data  comes  from  OS    

(v$osstat)  so  should  be  reliable  

Page 11: Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 2

www.enkitec.com     11      

Go  to  the  source  (OS)  

•  Sar  &  vmstat  data:  No  CPU  starvaNon!  

•  Whom  to  trust?  •  OS  tools,  V$OSSTAT  and  Load  figures  in  EM  Performance  Page  ask  for  the  

CPU  usage  figures  from  OS  –  so  are  ok  •  The  ASH  data  (and  EM  AcNvity  page)  assume  that  any  session  not  marked  

as  “waiNng”  in  V$SESSION  is  on  CPU!  

•  h.p://blog.tanelpoder.com/2007/06/18/advanced-­‐oracle-­‐troubleshooNng-­‐guide-­‐when-­‐the-­‐wait-­‐interface-­‐is-­‐not-­‐enough-­‐part-­‐1/  

•  h.p://oracledoug.com/serendipity/index.php?/archives/1432-­‐Time-­‐Ma.ers-­‐DB-­‐CPU.html  

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------ r b swpd free buff cache si so bi bo in cs us sy id wa st 5 0 143272 264980 261616 6970896 0 0 3164 319 3 1 14 3 77 6 0 0 0 143272 257376 261616 6970568 0 0 840 23 6351 8484 11 8 73 8 0 0 0 143272 261748 261616 6970952 0 0 0 3535 5303 7740 7 1 89 2 0

Page 12: Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 2

www.enkitec.com     12      

Checkpoint  

•  Something  weird  is  going  on!    •  Oracle  says  many  sessions  are  on  on  CPU,  when  they  really  

aren’t  

•  Oracle  says  sessions  are  waiNng  for  IO  wait  events  a  lot,  while  not  much  IO  acNvity  was  going  on  •  …  and  the  OS  level  IO  metrics  didn’t  show  long  IO  

compleNon  Nmes  either  

•  Many  sessions  reported  doing  IO  by  Oracle  are  blocked  by  a  session  waiNng  for  SQL*Net  message  to  client  

The  IO  hangs  can  be  tricky  to  diagnose  as  the  iostat  figures  are  updated  when  the  IO  actually  completes*  

Page 13: Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 2

www.enkitec.com     13      

Time  to  get  systemaNc  

•  …  and  see  what  these  hung  processes  are  really  doing!  •  Did  I  hear  someone  saying  process  stack?!  ;-­‐)  

•  Reminder:  Oracle  is  just  another  program  wri.en  in  C  

•  The  process  stack  will  show  in  which  of  those  C  funcNons  the  problemaNc  process  execuNon  (hang)  currently  is!  

•  All  major  Unix  plavorms  have  pstack  (procstack  on  AIX)  •  Oradebug  short_stack  /  dump  errorstack  would  also  do  it  *  

•  However  the  hangs  were  sporadic…  I  prefer  OS  pstack  as  it’s  

cleaner  

Page 14: Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 2

www.enkitec.com     14      

@mon_stack.sql  for  waiNng  for  lisNng  the  blocker  

•  When  the  Nme  is  right:  -- STEP 1DEFINE threshold=&1-- First sleep and monitor V$SESSION to find long waits-- in the database-- This PL/SQL block will just keep running until a-- long enough wait is seenDECLARE l_threshold NUMBER := &threshold; l_max_wait NUMBER;BEGIN WHILE TRUE LOOP SELECT MAX(seconds_in_wait) INTO l_max_wait FROM v$session WHERE state = 'WAITING' AND

wait_class != 'Idle'; IF l_max_wait > l_threshold THEN EXIT; END IF; DBMS_LOCK.SLEEP(30); END LOOP;END;/

-- STEP 2SELECT 'ULTIMATE_BLOCKER_'||TRIM(s.type) ||'= '||TRIM(osid) blocking_spid , w.in_wait_secs , w.pid , w.sid , w.in_wait , w.wait_event , w.p1 , w.p2 , w.p3FROM v$wait_chains w , v$session sWHERE w.sid = s.sidAND w.sess_serial# = s.serial#AND w.blocker_sid IS NULLAND w.num_waiters > 0/-- Exit scriptEXIT

Page 15: Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 2

www.enkitec.com     15      

mon_hang_stack.sh  sqlplus -s $ORA_LOGON @mon_stack $THRESHOLD > $TMPFILEULTIMATE_BLOCKERS=`grep ULTIMATE_BLOCKER_USER $TMPFILE | awk '{ print $2 }'`echo >> $LOGFILEcat $TMPFILE >> $LOGFILEecho >> $LOGFILEecho DATE=`date +"%Y-%d-%m %H:%M:%S"` ULTIMATE_BLOCKERS=$ULTIMATE_BLOCKERS >> $LOGFILEfor i in $ULTIMATE_BLOCKERS ; do echo >> $LOGFILE echo DATE=`date +"%Y-%d-%m %H:%M:%S"` running pstack on PID=$i >> $LOGFILE echo >> $LOGFILE echo Stack trace 1 >> $LOGFILE echo >> $LOGFILE pstack $i >> $LOGFILE sleep 1 echo >> $LOGFILE echo Stack trace 2 >> $LOGFILE echo >> $LOGFILE pstack $i >> $LOGFILE sleep 1 echo >> $LOGFILE echo Stack trace 3 >> $LOGFILE echo >> $LOGFILE pstack $i >> $LOGFILEdone

Wait  unNl  the  hang  happens  

Wait  unNl  the  hang  happens  (the  mon_stack  script  exits  only  when  it  

detects  a  long  enough  wait  and  then  proceed  to  

pstack’ing  the  ulNmate  blocker  PIDs  

Page 16: Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 2

www.enkitec.com     16      

The  stack  trace  

#1 0x00000000032ed6d7 in ntevpque ()#2 0x00000000032e9e30 in ntevqone ()#3 0x0000000003281c10 in nsevwait ()#4 0x00002af99df1758d in clsc_nswait () #5 0x00002af99df13daa in clsc_select_ext () #6 0x00002af99df124e6 in clsc_receive_wait () #7 0x00002af99df120a3 in clscreceive () #8 0x00002af99df10c18 in clscconnect ()#9 0x00002af99defa9e0 in clsssInitNative ()#10 0x00002af99defbad3 in clsssinit ()#11 0x0000000006eaf951 in kgxgncin ()#12 0x0000000003aff6c2 in kfmsInit ()#13 0x0000000003b00f07 in kfmsSlvReg ()#14 0x0000000003ae342c in kfmdSlvOpPriv ()#15 0x0000000003adcf9f in kfmEnslave ()#16 0x0000000003a0f19b in kfddsGet ()#17 0x0000000005c7fb57 in kfioTranslateIO () #18 0x0000000005c81f3a in kfioRqSetPrepare ()#19 0x0000000005c7e5e9 in kfioSubmitIO ()#20 0x0000000005c7b9ad in kfioRequestPriv ()#21 0x0000000005c7b366 in kfioRequest ()#22 0x0000000005c4c1db in ksfd_kfioRequest ()#23 0x0000000005c48574 in ksfd_osmio ()#24 0x0000000007c3f837 in ksfd_io ()#25 0x0000000007c3df1d in ksfdread1 ()

#26 0x0000000001a01286 in kcfrbd ()#27 0x0000000000e01407 in kcbzib ()#28 0x0000000007a7451b in kcbgcur ()#29 0x0000000000d130de in ktbgcur ()#30 0x0000000007a15b5f in ktspfpblk ()#31 0x0000000007a1458d in ktspfsrch ()#32 0x0000000007a13f01 in ktspscan_bmb ()#33 0x0000000007a1351f in ktspgsp_main ()#34 0x0000000001427ec4 in kdisnew ()#35 0x00000000014258cc in kdisnewle ()#36 0x000000000140ca16 in kdisle ()#37 0x00000000013c6899 in kdiins0 ()#38 0x00000000013d7d6c in kdiinsp ()#39 0x0000000007aa8516 in kauxsin ()#40 0x0000000007ca0881 in qesltcLoadIndexList ()#41 0x0000000007ca04f9 in qesltcLoadIndexes ()#42 0x0000000007c806fc in

__PGOSF606_qerltcNoKdtBufferedInsRowCBK ()#43 0x0000000007c7e753 in qerltcSingleRowLoad ()#44 0x0000000007c7d5e7 in qerltcFetch ()#45 0x0000000007bc742d in insexe ()#46 0x0000000007c8f4c7 in opiexe ()......#82 0x0000000001762cea in sou2o ()#83 0x0000000000975483 in opimai_real ()#84 0x00000000017682a1 in ssthrdmain ()#85 0x00000000009753af in main ()

This  is  the  “outmost”  funcNon  in  the  stack  

This  is  where  the  process  execuNon  

currently  is  

ASM  file  -­‐>  Physical  device  address  translaNon!  

Cluster  services  communicaNon  

funcNons  

Page 17: Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 2

www.enkitec.com     17      

Intermission:  How  ASM  works  

DISK  DEVICE  /dev/rdsk/xyz  

DISK  DEVICE  /dev/rdsk/xyz  

DATABASE  INSTANCE   ASM  

INSTANCE  

DB  Server  Process  

3.  Fetch  ASM  -­‐>  Physical  disk  block  mapping  from  ASM  instance!  (IPC  call!)  

1.  “Read  ASM  datafile  X  block  Y”  

2.  Check  ASM  extent  pointer  

cache  in  the  DB  instance  

ASM  Server  Process  

4.  Read  the  block  from  disk  device  

SQL> @sgastat asmPOOL NAME BYTES------------ -------------------------- ----------shared pool ASM KFFD SO 2648shared pool ASM extent pointer array 1784672shared pool ASM generic network state 2584shared pool ASM kfk state object 31544shared pool ASM map operations 30368large pool ASM map operations hashta 393216shared pool ASM rollback operations 2648shared pool ASM scan context 3544

Page 18: Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 2

www.enkitec.com     18      

A  bug!  

Oyen  the  already  

diagnosed  bugs  have  a  relevant  

stack  trace  listed  in  the  

note!  

Page 19: Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 2

www.enkitec.com     19      

A  bug!  

•  Bug  8429716    Clusteware  process  (ocssd/crsd/etc)  may  get  too  low  file  descriptor  limit  (hangs  etc..)  

•  11gR2  Grid  Infrastructure  Does  not  Use  ulimit  Sezng  Appropriately  [ID  983715.1]  

# ls –l /proc/`pgrep cssd.bin`/fd…lrwx------ 1 root root 64 Mar 29 04:11 96 -> socket:[55795]lrwx------ 1 root root 64 Mar 29 04:11 97 -> socket:[55802]lrwx------ 1 root root 64 Mar 29 04:11 98 -> socket:[55835]lrwx------ 1 root root 64 Mar 29 04:11 99 -> socket:[55842]…

Even  though  the  file  descriptor  array  and  maximum  limits  for  user  processes  had  been  configured  correctly  (/etc/sysctl.conf  and  /etc/security/limits.conf),  thanks  to  the  Oracle  bug,  the  cssd  daemon  didn't  pick  up  such  a  limit  (as  it's  scheduled  via  rc  startup  scripts  as  root,  not  regular  users  logging  on).  Thus  the  fix  would  have  to  be  applied  into  cssd  startup  script  to  bump  up  the  file  descriptor  limit  for  itself.  

Number  of  fds  in  use  reached  1024!  

Page 20: Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 2

www.enkitec.com     20      

Unistrumented  Wait  events?!  

•  None  of  these  wait  events  exist  in  Oracle  11.1:  SQL> @sed cssShow wait event descriptions matching %css%..EVENT# EVENT_NAME WAIT_CLASS PARAMETER1 ------ ------------------------------------------------------- -------------------- ----------- 848 CSS Xgrp shared operation Other function_id 845 CSS group membership query Other group_name1 844 CSS group registration Other group_name1 843 CSS initialization Other 850 CSS operation: action Other function_id 846 CSS operation: data query Other function_id 847 CSS operation: data update Other function_id 851 CSS operation: diagnostic Other function_id 849 CSS operation: query Other function_id 571 GCR CSS join retry Other retry count 572 GCR member Data from CSS Other retry count 565 IMR CSS join retry Other retry countSQL> @sed crsShow wait event descriptions matching %crs%..EVENT# EVENT_NAME WAIT_CLASS PARAMETER1 ------ ------------------------------------------------------- -------------------- ----------- 861 CRS call completion Other clsrrestype

Page 21: Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 2

www.enkitec.com     21      

Thank  you  –  and  oh,  wait!  

•  New  seminars  in  2013  •  Advanced  Oracle  TroubleshooNng  v2.1  (Feb-­‐March)  •  Advanced  Oracle  SQL  Tuning  v2.0  (March-­‐April)  •  Advanced  Exadata  Performance:  TroubleshooNng  and  OpNmizaNon  

•  h.p://blog.tanelpoder.com/seminar/  

 

•  QuesNons:  [email protected]  •  Blog:    h.p://blog.tanelpoder.com  •  ConsulNng:  h.p://www.enkitec.com