avoiding pmrs..to blame or not to blame…. db2 · sysrpts(wlmgl(policy,wgroup,sc...
TRANSCRIPT
© 2017 IBM Corporation
Avoiding PMRs..To blame or not to blame…. DB2
Saghi Amirsoleymani, WW Principal Solution Architect [email protected]
Adrian Burke, DB2 SWAT Team SVL, [email protected]
© 2016 IBM Corporation2
Agenda
Overview RMF Spreadsheet Reporter
CPU and WLM Concerns
I/O and DASD Subsystem
zIIP
REAL and Virtual Storage
Access Path Issues
Customer Scenarios
Resources
© 2016 IBM Corporation3
Holistic approach top down = RMF SMF Traces/Dumps
If SWAT team or L2 performance team is involved this is where we start
Get the big picture then drive to root cause
Often problems are intermittent and require an understanding of the entire environment
If there is a perceived DB2 subsystem, or data sharing group performance issue
Rule out Sysplex/ CF/ CEC/ LPAR constraints first
If there is a workload or period of the day suffering
WLM/ CPU constraint/ DB2 internals
If there is a single job, or group of transactions suffering
Object contention
Access path or DB2 component
Storage subsystem
Premise
© 2016 IBM Corporation4
… A tool to post-process RMF reports and analyze them in the form of Excel Spreadsheets, a graph is worth a 1,000 words, especially if it is Red
SMF records you need: reports can be run from tool, or MVS then pulled down and post-processed 70-1: % CPU, zIIP busy, weightings, number of MSU’s consumed, WLM capping, physical and logical
busy
70-2: crypto HW busy
71: paging activity
72-3: workload activity 73: channel path activity
74-1: device activity
74-4: coupling facility 74-5: cache activity
74-8: disc systems
75: page data sets
78-2: virtual storage
78-3: I/O Queueing
RMF Spreadsheet Reporter
No
Charge !
© 2016 IBM Corporation5
LPAR Trend report
REPORTS(CPU)
Can see stacked picture of single LPAR
(GP/zIIP/IFL)
This is useful to get an idea of the
CEC utilization across processors
Look at CEC’s CPU trend over the time
period with GP and specialty engines
You can superimpose the max CPU
% the LPAR will achieve based on
weightings
Also see entire CEC saturation
RMF Reports - CPU
© 2016 IBM Corporation6
LPAR Trend report
REPORTS(CPU)
Can see relative weights between
LPARs to determine if one is
exceeding its share
i.e. who will be punished when a
CPU constraint occurs
LPAR Design tool very helpful in
getting the right mix of vertical
High/Med/Low processors
RMF Reports - CPU
© 2016 IBM Corporation7
• WLM activity report
SYSRPTS(WLMGL(POLICY,WGROUP,SC
LASS,SCPER,RCLASS,RCPER,SYSNAM(
SWCN)))
• Look at all service classes during a
certain interval or 1 class over the course
of several intervals
– Yellow missed its goal, Red is a PI of
>2
• See reason for delays across all service
classes in an interval
– I/O, CPU, zIIP
• Looking at raw report is tedious, could be
hundreds of MBs of data
RMF Reports - WLM
© 2016 IBM Corporation8
Look for potential zIIP offload
that landed on a GP
AAPL% IIPCP – in Red
See what % of GCP
processor the workload
consumed
Remember to normalize if
GCPs are subcapacity
Actual response times can be
seen and graphed
# per second and average
elapsed
RMF Reports - WLM
© 2016 IBM Corporation9
This will show the CPU/zIIP Utilization broken down by service/ report class as well as CPU utilized by zIIP eligible work
during the intervals
It can show you when certain workloads collide and WLM goals are missed: who is driving the CPU %
through the roof
By using RMF Spreadsheet reporter you can generate the Overview Records
• Then create and run the Overview Report from your desktop
ApplOvwTrd tab now included in spreadsheet, no need to create WLM OVWs
Overview Records
0
50
100
150
200
250
300
350
400
03/0
1/20
16-23
.45.00
03/0
2/20
16-00
.15.00
03/0
2/20
16-00
.45.00
03/0
2/20
16-01
.15.00
03/0
2/20
16-01
.45.00
03/0
2/20
16-02
.15.00
03/0
2/20
16-02
.45.00
03/0
2/20
16-03
.15.00
03/0
2/20
16-03
.45.00
03/0
2/20
16-04
.15.00
03/0
2/20
16-04
.45.00
03/0
2/20
16-05
.15.00
03/0
2/20
16-05
.45.00
03/0
2/20
16-06
.15.00
03/0
2/20
16-06
.45.00
03/0
2/20
16-07
.15.00
03/0
2/20
16-07
.45.00
03/0
2/20
16-08
.15.00
03/0
2/20
16-08
.45.00
03/0
2/20
16-09
.15.00
03/0
2/20
16-09
.45.00
03/0
2/20
16-10
.15.00
03/0
2/20
16-10
.45.00
03/0
2/20
16-11
.15.00
03/0
2/20
16-11
.45.00
03/0
2/20
16-12
.15.00
03/0
2/20
16-12
.45.00
03/0
2/20
16-13
.15.00
03/0
2/20
16-13
.45.00
03/0
2/20
16-14
.15.00
03/0
2/20
16-14
.45.00
03/0
2/20
16-15
.15.00
03/0
2/20
16-15
.45.00
03/0
2/20
16-16
.15.00
03/0
2/20
16-16
.45.00
03/0
2/20
16-17
.15.00
03/0
2/20
16-17
.45.00
03/0
2/20
16-18
.15.00
03/0
2/20
16-18
.45.00
03/0
2/20
16-19
.15.00
03/0
2/20
16-19
.45.00
03/0
2/20
16-20
.15.00
03/0
2/20
16-20
.45.00
03/0
2/20
16-21
.15.00
03/0
2/20
16-21
.45.00
03/0
2/20
16-22
.15.00
03/0
2/20
16-22
.45.00
03/0
2/20
16-23
.15.00
[ App
l% IIP
ON
CP ]
Application Utilization OverviewMetric: Appl% IIP ON CP
Reporting Category: By Service Class
DB2VEL DDF PBATHOT PBATMED PDEFAULT STCDEF SYSSTC
© 2016 IBM Corporation10
FTP down your WLM policy in .txt format
Import the WLM policy into a spreadsheet to analyze and filter
Overview of total classes, periods, resource groups** Resource group capping results in
unpredictable response times
ROT is 30-40 service class periods running at once
Policy itself can be filtered So why do we have 9 Imp 1 Velocity
60 service classes?
This is redundant work for WLM to monitor and manage these identical classes
Easy to search through rules to determine what work is in what service class
Reports – WLM service definition formatter
© 2016 IBM Corporation11
DASD Activity Report
REPORTS(DEVICE (DASD))
Gives you overview of top 5 Logical
Control Units
See what volumes are on there,
and what DB2 data is on those
volumes
Device Summary Top10 Shows top
10 volumes based on criteria you
specify and you can manipulate
graphs
RMF Reports - DASD
© 2016 IBM Corporation12
CF activity report
SYSRPTS(CF)
Look at CPU/storage utilization over entire day
See comparison of sync vs. async across intervals
Look for delays due to sub-channels being unavailable
Look for directory reclaims
Look at all metrics for all structures during a single interval
RMF Reports - CF
© 2016 IBM Corporation13
RMF Summary
Look at CPU Busy (remember this is usually a 15 minute interval though)
DASD response taking into account the rate, a very low rate could show increased response time due to
missing cache, etc.
Demand paging
• Now-a-days we don’t want to see paging at all as storage gets cheaper and the price
paid by the online applications in response time not proportional to the ‘paging rate’
• z/OS measured the CPU cost of a sync I/O at 20-70us
RMF Summary Report
© 2016 IBM Corporation14
These 2 LPARs LP11 and LP15 are consuming every MIP on the box, borrowing back and forth
This was meant to be a load test, and you can see where the test LPAR (Green) ran out of steam as
the production LPAR took the CPU cycles
In internal benchmarks maximum throughput is achieved between 92-94% - determining root cause almost
impossible at 100%, no consistency
CPU constraints (1)
© 2016 IBM Corporation15
ROT: DB2 threads should not end up in a service class which uses WLM resource group capping
Resource group capping will ensure that this workload does not get over ‘x’ Service Units a second, and this includes all the DB2 subsystems in the plex,
Blocked workload support cannot help these capped transactions, so if there is a serious CPU constraint all DDF work could be starved, and could be suspended while holding important DB2 locks/latches
• In general we suggest avoiding resource group capping in favor of lowering the priority of the work
• The CAP delay is the % of delays due to resource group capping
WLM
© 2016 IBM Corporation16
We do not want the goals to be too loose: if >90% of transactions complete in less than ½ of their goal, the goal should be adjusted tighter, to avoid violent swings in response time under CPU constraint
The goal here is 10 seconds for a Portal application that must render its page in 3 seconds, and the transactions are finishing in 4 milliseconds
• The WLM goals should align with the business goals/ SLAs
Make the goal around 20 milliseconds so the service level can be maintained
Response time goals… too loose
© 2016 IBM Corporation17
The goals need to be reasonable, i.e. attainable by the workload
WLM cannot shorten the response time to something lower than the CPU time needed for the transaction
to complete
With a performance index of 5 all day long this workload could be skip clocked (ignored) if there were CPU
constraints
Response time goals… too stringent
© 2016 IBM Corporation18
Look at the response time buckets in WLM activity report to gauge reality
No amount of CPU could bring these transactions back in line with the others
The goal is 95%, but only 90% complete in time, so take these outlying trans and break them out into
another service class, or adjust the goal to 90%
WLM Buckets
© 2016 IBM Corporation19
For transactions and most business processes a Response time goal is much more effective/predictable
during times of CPU constraint than velocity goals – percentile or average goal?
Transaction classes with outliers of 2x the average or more should be percentile goals
When determining a good response time goal you need to trend it out
Determine where the business goal is in relevance to what it is achieving
z/OS 1.13 includes average response time info even for velocity goals
Response time goals vs. velocity goals
© 2016 IBM Corporation20
Many zIIP eligible workloads are ‘spikey’ in nature – look in CPU activity
Rush of DRDA requests, Utilities, or SQL CPU parallelism leads to overflow
Hiperdispatch is VERY sensitive to the relative LPAR weights (HIGH/MED/LOW)
Key is to apportion weights based on actual utilization – not share zIIPs with everyone – LPAR design tool can help here
Otherwise engines will remain parked causing work to spill over to the GCPs
zIIP Overflow…LPAR Weights
4 zIIPs parked
2 zIIPs parked2 zIIPs parked4 zIIPs parked
zIIP overflow
to GCPszIIP Utilization
© 2016 IBM Corporation21
What if I have lots of not accounted for time?
OMPE accounting report (parallel tasks on zIIP)
RMF Spreadsheet Reporter response delay report
Part of WLM activity trend report
SYS1.PARMLIB (IEAOPTxx) setting
IIPHONORPRIORITY = NO (not recommended)
• Meaning all zIIP eligible work will queue waiting for a zIIP
• Normally ‘Needs Help’ algorithm re-dispatches work to a GCP
• Parallel tasks are 80% zIIP eligible, so PARAMDEG should be influenced by the number of zIIPs you have
• Important in v10 and v11, and if you have zAAP on zIIP – no zAAP on z13
• V10 includes prefetch, deferred writes,
• V11 includes GBP writes, castout/notify, log prefetch/write
• In V11 if you turn it off NO System Agents will be dispatched to the zIIP
Discretionary work always waits on the zIIP
zIIP Shortages
CPU delay at about 33%, and the
zIIP delay is at 34%.
© 2016 IBM Corporation22
• What happens if Prefetch Engines are starved of zIIP?
– Other Read I/O events and time per event will increase
– PREF. DISABLED – NO READ ENG could increase
• Customers have seen batch
programs miss their window
- solution is to add zIIP capacity
• Prefetch may be scheduled even if
all the pages are resident, so app
still sees delays with 100% BP hit
ratio and no I/Os
– Increased elapsed time
zIIPs and Prefetch
© 2016 IBM Corporation23
Sometimes you need the entire picture when going after response time issues
After migration to DB2 10 customer’s applications were experiencing ‘good’ and ‘bad’ days
Some access path regressions… but was this related?
Here are two top 5 logical control unit report from the same time each day
Activity rate is quite close (same work going on)
Where does the increase in response time come from? – DISC (disconnect time)
Synchronous remote copy (Metro Mirror) where the target cannot keep up, and asynchronous copy with
write pacing (XRC) can cause high DISC time
DASD response time
© 2016 IBM Corporation24
• What is an acceptable paging rate? 0 for DB2 storage
• REAL storage is a one time charge which will save CPU cycles, which you pay for on a monthly basis
– z/OS measured the CPU cost of a sync I/O to be 20us70us
• Even if you want to wait until V11 to tune your buffer pools with the simulation capabilities you can save CPU
today by avoiding paging
• Impact customers have seen from being short on REAL storage
– Transaction times begin to climb, customers see sub-second trans take 10’s of seconds (buffer pool hit
might require a page-in from AUX)
– # of concurrent threads in DB2 begin to climb, CTHREAD/MAXDBAT might be hit
– SYSPROG and DBA perception is of a system slowdown
• If SVCDUMP occurs (SDUMP,Q) workload may be non-dispatchable until dump finishes
• If however you are real storage rich (i.e. > MAXSPACE in reserve and 0 paging) , look at turning
REALSTORAGE_MANAGEMENT = AUTO OFF– Have seen cases with large memory and high thread deallocations where DB2 ASID CPU could be
saved
DB2 and Storage
© 2016 IBM Corporation25
RSM zparm controls when DB2 discards 64-bit storage (gives it back
to z/OS) :: RSM = AUTO (default) OFF
• 64-bit Shared is almost all thread storage
Storage normally discarded at deallocation or after 120 commits
If there is ample storage on the LPAR investigate turning this off
Need to monitor DBM1 storage to understand if there is storage
growth after change
Possible CPU savings
Poor thread reuse means more threads are deallocated and
created constantly
• The release of storage from these deallocated threads costs
cycles in the address space where the thread lives as well as
MSTR
• If POOLINAC constantly throws away threads DIST SRB could
increase
• With local attach threads DBM1 could be aggravated as well
• There is also some CPU in z/OS RSM which is used to reclaim
the storage
• This customer saw 4x redux in MSTR SRB time
REALSTORAGE_MANAGEMENT
Spikes in MSTR SRB
relative to discarding
thread storage
© 2016 IBM Corporation26
In the graphic we can see DB2 storage goes out from REAL to AUX when the real available drops to ‘0’ on the LPAR
Using IFCID 225 and MEMU2 to look at AUX vs. what’s in REAL
Worst case in this example to get those pages back in:
700 MB – sync I/O time ~1ms = 0.001*179,200 = 179 seconds
MAXSPACE suggestion 16GB… could not be supported here
DB2 and Storage
© 2016 IBM Corporation27
So who caused me to get paged out?? If you run a WLM activity report and look at the Storage Trend graph in the reporter you can see the
actual frames used by a service or report class
The Page In Rates were also high during this time for DB2 as it recovered from AUX
DB2 and Storage
© 2016 IBM Corporation28
By default DFSORT and other sort products usually take as much storage as they can get, to help
performance… but what about everyone else?
DFSORT parameters affecting storage use (II13495 ) means to protect DB2
• These can be dynamically changed for workloads using ICEPRMxx member
EXPMAX=% of storage for memory object and hyperspace sorting, somewhat depends on EXPOLD and EXPRES
how much can you spare
EXPOLD = % of storage allowed to be pushed to AUX 0
EXPRES= % of storage to preserve, maybe in case of DUMPSPACE/ MAXSPACE 16GB min in V10
For DB2SORT the PAGEMON parameter limits use of central storage
Real storage and Sort products
This shows EXPMAX = 25GB,
effectively capping what DFSORT
can consume.
© 2016 IBM Corporation29
DB2 (DP2C) has over 1GB in AUX - paging of 15 / second
EXPMAX = 20% so SORT can use ~50GB max
ODDLxx job is using 45GB which is actually 31% of the available (256-114)
EXPMAX looks at total storage configured on the LPAR, regardless of PGFIX
Look at EXPOLD 0 and lowering EXPMAX to stop DFSORT from stealing old storage and pushing us out
to AUX
Ensure you do not end up with >70% of the LPAR page fixed
If 80% is fixed (IRA400E) and address spaces become non-dispatchable
Sort: 256GB LPAR ~114GB page fixed
© 2016 IBM Corporation30
AUTOSIZE(YES) allows WLM to alter the BPs
automatically +/- 25% each time DB2 is recycled,
in order to avoid sync I/O delay
Customer had this on, and PGFIX(YES) for all
buffer pools
z/OS 2.1 allows them to shrink
V11 allows for a MINSIZE and MAXSIZE to be
set
Over time the paged fixed BPs grew to
consume 80% of the LPAR causing address
spaces to be non-dispatchable = sysplex
slowdown!!
AUTOSIZE
This line shows 80%
of the 25GB LPAR.
This is a stacked
graph of DB2 buffer
pools on the LPAR
© 2017 IBM Corporation
Access Path Degradation
How do customers currently manage DB2 for z/OS performance?
A. Manually B. IBM MonitorsC. other monitorsD. Phone call/email ComplaintsE. They Don’tF. Ad hoc
© 2016 IBM Corporation32
Example: Access Path tracking for dynamic SQL
1.determine dynamic access path degradation (using parameter markers)
a. How?
• i. In test using statement cache table historical values for average elapsed time? – or ranking and browsing them?
• IFCID 316,317,318 must be started for useful stats (best practice)
• ii. Application folks provide a top 10 and do their own testing
• iii. Mainview/ OMPE reports – how do you know who it is?
• iv. Phone call from end user (well of course)
2. run SQL through Data Studio or DSM (or other monitoring tools)
a. does EXPLAIN in data studio match what you see in STATEMENT_CACHE / PLAN tables from a previous execution?
• Meaning you must have historical records of this, look for index scans, table space scans, EXPLAINingSTMTID will just do another EXPLAIN, not show you the current access path
• i. YES – why is this execution noticeably worse – 1 set of very skewed literals?
• ii. NO – obviously RUNSTATS must have invalidated the statement and you have prepared the statement again, which should be when the end user noticed the degradation
• Could better RUNSTATS help??
32
© 2016 IBM Corporation33
What is IBM Data Studio?
Comprehensive tool
An integrated environment for managing databases
and developing database applications
Replaces Control Center in DB2 LUW
Built on the popular Eclipse framework
Runs on Red Hat Linux, SUSE Linux, Windows
Web and Eclipse components
Data Studio Web console: health and availability
monitoring
NO-charge to download!
33
© 2016 IBM Corporation34
Review and Visualize access path – See flow of query processing– See indexes and operations– See optimizer rationale
Assess access path stability to reduce risk of performance regression– Is the optimizer able to apply the filtering early?– Are there indexes that support an efficient path?– Do statistics allow distinction between the choices?http://www.ibm.com/developerworks/data/library/techarticle/dm1006optimquerytuner1/
#2 Analyze Access Plans :Avoid Common Mistakes
34
© 2016 IBM Corporation35 April 18, 2017
#2 Access Path – What is good / bad AP
No definitive answer exists for this question – except for:
IT DEPENDS !
Let’s look at some issues in order to find the “correct” answer
One example : One SQL-statement costs 0.050 CPU-sec while another
one costs 2 CPU-minutes – which one do you want to spend time
tuning ?
© 2016 IBM Corporation36 April 18, 2017
#2 Access Path – What is good / bad AP
The “cheap” SQL-statement is executed 100,000 times per hour in an
online transaction, while the other is executed once every day in a batch
job
One weeks consumption:
Online SQL: (0.050 x 100000 x 24 x 7) =840000 CPU-sec.
Batch SQL : (2 x 60 x 7) = 840 CPU-sec.
© 2016 IBM Corporation37
Improve statistics quality and collection
Results
• Accurate estimated costs
• Better query performance
• Less CPU consumption
• Improved maintenance window throughput
“80 % of access path PMRs could be resolved by statistics advisor before calling IBM support.” – IBM Support
Generates RUNSTATS
control statements
Conflicting statistics
explanation
© 2016 IBM Corporation38
How are Statistics currently collected?
Why are Histogram statistics not collected?
Distribution Statistics
Are you over collecting statistics?
simple answer
Some collect no or insufficient statistics
• Prime reason for poor performing access paths
Do you want to collect statistics on every column and permutations of combination of columns?
• No way!
Requires similar analysis of SQL as for index design
Have to include columns which you may not benefit from adding to an index
Analysis of queries labor intensive without QWT
Iterative process analyzing explain data (as always)
What to do next
38
© 2016 IBM Corporation39
Indexes are decided at design stage…
Lot of effort is spent making SQL to use the provided indexes
But what if the SQL is "right" and it's the indexes that are "wrong"
How do you simply test your hypotheses without impacting production
What are indexes for?
Access paths
Allow high performance random access
Are potentially bad for sequential access
Cost resources to maintain
#3 What are Indexes for ?
Indexes are used to increase performance
© 2016 IBM Corporation40
#3 What to do next …. Improve Database design Testing Applications
Fine Tuning
Deploying new applications
Removing Obsolete Indexes
Redesigning for Performance
Troubleshooting
Re-deploying Applications
Indexes are used to increase performance
Modify existing indexes
© 2016 IBM Corporation41
4. Previous attempts were futile, need to engage level 2 as we believe it is an optimizer issue: send
doc to level 2
Capture SQL from statement cache along with SQL environment capture via stored proc or
Data Studio or DSM – along with literal values
5. As a last ditch effort, assisted by level 2 use access path hints in DB2 - determine if
BIND(QUERY) is reasonable – determine which access path you want if it was not the one that
was prepared, or fake your stats
put it in DSN_USERQUERY, BIND query, then flush user query table
What to do next….
© 2016 IBM Corporation42
#4 Capture Statement Cache
Filter Rows
TIP: Query Tuner issues dynamic SQL if you check the box we will not captured that
information
Schedule
42
© 2016 IBM Corporation43
Enable environment reproduction
DDL
Catalog statistics
Zparms (if DSNWZP stored
procedure available)
Explain tables information
(PLAN_TABLE,
DSN_DETCOST_TABLE,
DSN_PREDICAT_TABLE) (if SQL
used as input)
Speed up service process
Environment Capture Facilitates Collaboration
43
© 2016 IBM Corporation44
9
.
.
..
IBM DB2 Performance Management Solution
provides:
Fast identification with automated alerts,
proactive notification and 24x7 monitoring
Tuning of queries and workloads proactively
Expert advice with built-in advisors
Diverse set of capabilities managed
via Data Server Manager (DSM)
Easy-to-use integrated view of
overall DB2 performance management
Seamless navigation and
movement via functional capabilities versus
individual products
DSM
© 2016 IBM Corporation4646
References
RMF spreadsheet reporting tool
– http://www-03.ibm.com/systems/z/os/zos/features/rmf/tools/rmftools.html
– http://www-03.ibm.com/systems/z/os/zos/features/rmf/tools/index.html#spr_win
ATS CICS and DB2 Performance Workshop Wildfire class
http://w3-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/PRS1778
MEMU2
– https://www.ibm.com/developerworks/mydeveloperworks/files/app?lang=en#/person/270000K6H5
**Trials and tribulations endured by customers via trial and error tuning
© 2016 IBM Corporation4747
Methods for capturing documentation for all releases is documented here https://www.ibm.com/support/docview.wss?uid=swg21206998
SYSPROC.ADMIN_INFO_SQL supports V8 -> V10 (Required) Excellent developerWorks article here:
• http://www.ibm.com/developerworks/data/library/techarticle/dm-1012capturequery/index.html
It is installed in V10 base and is subject to the installation verification process• DB2HLQ.SDSNSAMP(DSNTESR) will create and bind it• calling program is DSNADMSB, and sample JCL in DSNTEJ6I
Ensure DB2 9 and DB2 10 have APAR PM39871 applied
Data Server Manager incorporates this procedure into a web UI (Best Practice)• http://www.ibm.com/developerworks/downloads/im/data/#optional• No charge product, replacement for OSC and Visual Explain and Data Studio• Single option to download with • Incorporates Statistics Advisor• Query Environment Capture used to collect doc.
• FTP doc directly to DB2 Level 2 in tool• Can be used to duplicate stats in TEST
Capturing documentation for IBM for access path regression