avoiding pmrs..to blame or not to blame…. db2 · sysrpts(wlmgl(policy,wgroup,sc...

47
© 2017 IBM Corporation Avoiding PMRs..To blame or not to blame…. DB2 Saghi Amirsoleymani, WW Principal Solution Architect [email protected] Adrian Burke, DB2 SWAT Team SVL, [email protected]

Upload: doandien

Post on 21-Jun-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

© 2017 IBM Corporation

Avoiding PMRs..To blame or not to blame…. DB2

Saghi Amirsoleymani, WW Principal Solution Architect [email protected]

Adrian Burke, DB2 SWAT Team SVL, [email protected]

© 2016 IBM Corporation2

Agenda

Overview RMF Spreadsheet Reporter

CPU and WLM Concerns

I/O and DASD Subsystem

zIIP

REAL and Virtual Storage

Access Path Issues

Customer Scenarios

Resources

© 2016 IBM Corporation3

Holistic approach top down = RMF SMF Traces/Dumps

If SWAT team or L2 performance team is involved this is where we start

Get the big picture then drive to root cause

Often problems are intermittent and require an understanding of the entire environment

If there is a perceived DB2 subsystem, or data sharing group performance issue

Rule out Sysplex/ CF/ CEC/ LPAR constraints first

If there is a workload or period of the day suffering

WLM/ CPU constraint/ DB2 internals

If there is a single job, or group of transactions suffering

Object contention

Access path or DB2 component

Storage subsystem

Premise

© 2016 IBM Corporation4

… A tool to post-process RMF reports and analyze them in the form of Excel Spreadsheets, a graph is worth a 1,000 words, especially if it is Red

SMF records you need: reports can be run from tool, or MVS then pulled down and post-processed 70-1: % CPU, zIIP busy, weightings, number of MSU’s consumed, WLM capping, physical and logical

busy

70-2: crypto HW busy

71: paging activity

72-3: workload activity 73: channel path activity

74-1: device activity

74-4: coupling facility 74-5: cache activity

74-8: disc systems

75: page data sets

78-2: virtual storage

78-3: I/O Queueing

RMF Spreadsheet Reporter

No

Charge !

© 2016 IBM Corporation5

LPAR Trend report

REPORTS(CPU)

Can see stacked picture of single LPAR

(GP/zIIP/IFL)

This is useful to get an idea of the

CEC utilization across processors

Look at CEC’s CPU trend over the time

period with GP and specialty engines

You can superimpose the max CPU

% the LPAR will achieve based on

weightings

Also see entire CEC saturation

RMF Reports - CPU

© 2016 IBM Corporation6

LPAR Trend report

REPORTS(CPU)

Can see relative weights between

LPARs to determine if one is

exceeding its share

i.e. who will be punished when a

CPU constraint occurs

LPAR Design tool very helpful in

getting the right mix of vertical

High/Med/Low processors

RMF Reports - CPU

© 2016 IBM Corporation7

• WLM activity report

SYSRPTS(WLMGL(POLICY,WGROUP,SC

LASS,SCPER,RCLASS,RCPER,SYSNAM(

SWCN)))

• Look at all service classes during a

certain interval or 1 class over the course

of several intervals

– Yellow missed its goal, Red is a PI of

>2

• See reason for delays across all service

classes in an interval

– I/O, CPU, zIIP

• Looking at raw report is tedious, could be

hundreds of MBs of data

RMF Reports - WLM

© 2016 IBM Corporation8

Look for potential zIIP offload

that landed on a GP

AAPL% IIPCP – in Red

See what % of GCP

processor the workload

consumed

Remember to normalize if

GCPs are subcapacity

Actual response times can be

seen and graphed

# per second and average

elapsed

RMF Reports - WLM

© 2016 IBM Corporation9

This will show the CPU/zIIP Utilization broken down by service/ report class as well as CPU utilized by zIIP eligible work

during the intervals

It can show you when certain workloads collide and WLM goals are missed: who is driving the CPU %

through the roof

By using RMF Spreadsheet reporter you can generate the Overview Records

• Then create and run the Overview Report from your desktop

ApplOvwTrd tab now included in spreadsheet, no need to create WLM OVWs

Overview Records

0

50

100

150

200

250

300

350

400

03/0

1/20

16-23

.45.00

03/0

2/20

16-00

.15.00

03/0

2/20

16-00

.45.00

03/0

2/20

16-01

.15.00

03/0

2/20

16-01

.45.00

03/0

2/20

16-02

.15.00

03/0

2/20

16-02

.45.00

03/0

2/20

16-03

.15.00

03/0

2/20

16-03

.45.00

03/0

2/20

16-04

.15.00

03/0

2/20

16-04

.45.00

03/0

2/20

16-05

.15.00

03/0

2/20

16-05

.45.00

03/0

2/20

16-06

.15.00

03/0

2/20

16-06

.45.00

03/0

2/20

16-07

.15.00

03/0

2/20

16-07

.45.00

03/0

2/20

16-08

.15.00

03/0

2/20

16-08

.45.00

03/0

2/20

16-09

.15.00

03/0

2/20

16-09

.45.00

03/0

2/20

16-10

.15.00

03/0

2/20

16-10

.45.00

03/0

2/20

16-11

.15.00

03/0

2/20

16-11

.45.00

03/0

2/20

16-12

.15.00

03/0

2/20

16-12

.45.00

03/0

2/20

16-13

.15.00

03/0

2/20

16-13

.45.00

03/0

2/20

16-14

.15.00

03/0

2/20

16-14

.45.00

03/0

2/20

16-15

.15.00

03/0

2/20

16-15

.45.00

03/0

2/20

16-16

.15.00

03/0

2/20

16-16

.45.00

03/0

2/20

16-17

.15.00

03/0

2/20

16-17

.45.00

03/0

2/20

16-18

.15.00

03/0

2/20

16-18

.45.00

03/0

2/20

16-19

.15.00

03/0

2/20

16-19

.45.00

03/0

2/20

16-20

.15.00

03/0

2/20

16-20

.45.00

03/0

2/20

16-21

.15.00

03/0

2/20

16-21

.45.00

03/0

2/20

16-22

.15.00

03/0

2/20

16-22

.45.00

03/0

2/20

16-23

.15.00

[ App

l% IIP

ON

CP ]

Application Utilization OverviewMetric: Appl% IIP ON CP

Reporting Category: By Service Class

DB2VEL DDF PBATHOT PBATMED PDEFAULT STCDEF SYSSTC

© 2016 IBM Corporation10

FTP down your WLM policy in .txt format

Import the WLM policy into a spreadsheet to analyze and filter

Overview of total classes, periods, resource groups** Resource group capping results in

unpredictable response times

ROT is 30-40 service class periods running at once

Policy itself can be filtered So why do we have 9 Imp 1 Velocity

60 service classes?

This is redundant work for WLM to monitor and manage these identical classes

Easy to search through rules to determine what work is in what service class

Reports – WLM service definition formatter

© 2016 IBM Corporation11

DASD Activity Report

REPORTS(DEVICE (DASD))

Gives you overview of top 5 Logical

Control Units

See what volumes are on there,

and what DB2 data is on those

volumes

Device Summary Top10 Shows top

10 volumes based on criteria you

specify and you can manipulate

graphs

RMF Reports - DASD

© 2016 IBM Corporation12

CF activity report

SYSRPTS(CF)

Look at CPU/storage utilization over entire day

See comparison of sync vs. async across intervals

Look for delays due to sub-channels being unavailable

Look for directory reclaims

Look at all metrics for all structures during a single interval

RMF Reports - CF

© 2016 IBM Corporation13

RMF Summary

Look at CPU Busy (remember this is usually a 15 minute interval though)

DASD response taking into account the rate, a very low rate could show increased response time due to

missing cache, etc.

Demand paging

• Now-a-days we don’t want to see paging at all as storage gets cheaper and the price

paid by the online applications in response time not proportional to the ‘paging rate’

• z/OS measured the CPU cost of a sync I/O at 20-70us

RMF Summary Report

© 2016 IBM Corporation14

These 2 LPARs LP11 and LP15 are consuming every MIP on the box, borrowing back and forth

This was meant to be a load test, and you can see where the test LPAR (Green) ran out of steam as

the production LPAR took the CPU cycles

In internal benchmarks maximum throughput is achieved between 92-94% - determining root cause almost

impossible at 100%, no consistency

CPU constraints (1)

© 2016 IBM Corporation15

ROT: DB2 threads should not end up in a service class which uses WLM resource group capping

Resource group capping will ensure that this workload does not get over ‘x’ Service Units a second, and this includes all the DB2 subsystems in the plex,

Blocked workload support cannot help these capped transactions, so if there is a serious CPU constraint all DDF work could be starved, and could be suspended while holding important DB2 locks/latches

• In general we suggest avoiding resource group capping in favor of lowering the priority of the work

• The CAP delay is the % of delays due to resource group capping

WLM

© 2016 IBM Corporation16

We do not want the goals to be too loose: if >90% of transactions complete in less than ½ of their goal, the goal should be adjusted tighter, to avoid violent swings in response time under CPU constraint

The goal here is 10 seconds for a Portal application that must render its page in 3 seconds, and the transactions are finishing in 4 milliseconds

• The WLM goals should align with the business goals/ SLAs

Make the goal around 20 milliseconds so the service level can be maintained

Response time goals… too loose

© 2016 IBM Corporation17

The goals need to be reasonable, i.e. attainable by the workload

WLM cannot shorten the response time to something lower than the CPU time needed for the transaction

to complete

With a performance index of 5 all day long this workload could be skip clocked (ignored) if there were CPU

constraints

Response time goals… too stringent

© 2016 IBM Corporation18

Look at the response time buckets in WLM activity report to gauge reality

No amount of CPU could bring these transactions back in line with the others

The goal is 95%, but only 90% complete in time, so take these outlying trans and break them out into

another service class, or adjust the goal to 90%

WLM Buckets

© 2016 IBM Corporation19

For transactions and most business processes a Response time goal is much more effective/predictable

during times of CPU constraint than velocity goals – percentile or average goal?

Transaction classes with outliers of 2x the average or more should be percentile goals

When determining a good response time goal you need to trend it out

Determine where the business goal is in relevance to what it is achieving

z/OS 1.13 includes average response time info even for velocity goals

Response time goals vs. velocity goals

© 2016 IBM Corporation20

Many zIIP eligible workloads are ‘spikey’ in nature – look in CPU activity

Rush of DRDA requests, Utilities, or SQL CPU parallelism leads to overflow

Hiperdispatch is VERY sensitive to the relative LPAR weights (HIGH/MED/LOW)

Key is to apportion weights based on actual utilization – not share zIIPs with everyone – LPAR design tool can help here

Otherwise engines will remain parked causing work to spill over to the GCPs

zIIP Overflow…LPAR Weights

4 zIIPs parked

2 zIIPs parked2 zIIPs parked4 zIIPs parked

zIIP overflow

to GCPszIIP Utilization

© 2016 IBM Corporation21

What if I have lots of not accounted for time?

OMPE accounting report (parallel tasks on zIIP)

RMF Spreadsheet Reporter response delay report

Part of WLM activity trend report

SYS1.PARMLIB (IEAOPTxx) setting

IIPHONORPRIORITY = NO (not recommended)

• Meaning all zIIP eligible work will queue waiting for a zIIP

• Normally ‘Needs Help’ algorithm re-dispatches work to a GCP

• Parallel tasks are 80% zIIP eligible, so PARAMDEG should be influenced by the number of zIIPs you have

• Important in v10 and v11, and if you have zAAP on zIIP – no zAAP on z13

• V10 includes prefetch, deferred writes,

• V11 includes GBP writes, castout/notify, log prefetch/write

• In V11 if you turn it off NO System Agents will be dispatched to the zIIP

Discretionary work always waits on the zIIP

zIIP Shortages

CPU delay at about 33%, and the

zIIP delay is at 34%.

© 2016 IBM Corporation22

• What happens if Prefetch Engines are starved of zIIP?

– Other Read I/O events and time per event will increase

– PREF. DISABLED – NO READ ENG could increase

• Customers have seen batch

programs miss their window

- solution is to add zIIP capacity

• Prefetch may be scheduled even if

all the pages are resident, so app

still sees delays with 100% BP hit

ratio and no I/Os

– Increased elapsed time

zIIPs and Prefetch

© 2016 IBM Corporation23

Sometimes you need the entire picture when going after response time issues

After migration to DB2 10 customer’s applications were experiencing ‘good’ and ‘bad’ days

Some access path regressions… but was this related?

Here are two top 5 logical control unit report from the same time each day

Activity rate is quite close (same work going on)

Where does the increase in response time come from? – DISC (disconnect time)

Synchronous remote copy (Metro Mirror) where the target cannot keep up, and asynchronous copy with

write pacing (XRC) can cause high DISC time

DASD response time

© 2016 IBM Corporation24

• What is an acceptable paging rate? 0 for DB2 storage

• REAL storage is a one time charge which will save CPU cycles, which you pay for on a monthly basis

– z/OS measured the CPU cost of a sync I/O to be 20us70us

• Even if you want to wait until V11 to tune your buffer pools with the simulation capabilities you can save CPU

today by avoiding paging

• Impact customers have seen from being short on REAL storage

– Transaction times begin to climb, customers see sub-second trans take 10’s of seconds (buffer pool hit

might require a page-in from AUX)

– # of concurrent threads in DB2 begin to climb, CTHREAD/MAXDBAT might be hit

– SYSPROG and DBA perception is of a system slowdown

• If SVCDUMP occurs (SDUMP,Q) workload may be non-dispatchable until dump finishes

• If however you are real storage rich (i.e. > MAXSPACE in reserve and 0 paging) , look at turning

REALSTORAGE_MANAGEMENT = AUTO OFF– Have seen cases with large memory and high thread deallocations where DB2 ASID CPU could be

saved

DB2 and Storage

© 2016 IBM Corporation25

RSM zparm controls when DB2 discards 64-bit storage (gives it back

to z/OS) :: RSM = AUTO (default) OFF

• 64-bit Shared is almost all thread storage

Storage normally discarded at deallocation or after 120 commits

If there is ample storage on the LPAR investigate turning this off

Need to monitor DBM1 storage to understand if there is storage

growth after change

Possible CPU savings

Poor thread reuse means more threads are deallocated and

created constantly

• The release of storage from these deallocated threads costs

cycles in the address space where the thread lives as well as

MSTR

• If POOLINAC constantly throws away threads DIST SRB could

increase

• With local attach threads DBM1 could be aggravated as well

• There is also some CPU in z/OS RSM which is used to reclaim

the storage

• This customer saw 4x redux in MSTR SRB time

REALSTORAGE_MANAGEMENT

Spikes in MSTR SRB

relative to discarding

thread storage

© 2016 IBM Corporation26

In the graphic we can see DB2 storage goes out from REAL to AUX when the real available drops to ‘0’ on the LPAR

Using IFCID 225 and MEMU2 to look at AUX vs. what’s in REAL

Worst case in this example to get those pages back in:

700 MB – sync I/O time ~1ms = 0.001*179,200 = 179 seconds

MAXSPACE suggestion 16GB… could not be supported here

DB2 and Storage

© 2016 IBM Corporation27

So who caused me to get paged out?? If you run a WLM activity report and look at the Storage Trend graph in the reporter you can see the

actual frames used by a service or report class

The Page In Rates were also high during this time for DB2 as it recovered from AUX

DB2 and Storage

© 2016 IBM Corporation28

By default DFSORT and other sort products usually take as much storage as they can get, to help

performance… but what about everyone else?

DFSORT parameters affecting storage use (II13495 ) means to protect DB2

• These can be dynamically changed for workloads using ICEPRMxx member

EXPMAX=% of storage for memory object and hyperspace sorting, somewhat depends on EXPOLD and EXPRES

how much can you spare

EXPOLD = % of storage allowed to be pushed to AUX 0

EXPRES= % of storage to preserve, maybe in case of DUMPSPACE/ MAXSPACE 16GB min in V10

For DB2SORT the PAGEMON parameter limits use of central storage

Real storage and Sort products

This shows EXPMAX = 25GB,

effectively capping what DFSORT

can consume.

© 2016 IBM Corporation29

DB2 (DP2C) has over 1GB in AUX - paging of 15 / second

EXPMAX = 20% so SORT can use ~50GB max

ODDLxx job is using 45GB which is actually 31% of the available (256-114)

EXPMAX looks at total storage configured on the LPAR, regardless of PGFIX

Look at EXPOLD 0 and lowering EXPMAX to stop DFSORT from stealing old storage and pushing us out

to AUX

Ensure you do not end up with >70% of the LPAR page fixed

If 80% is fixed (IRA400E) and address spaces become non-dispatchable

Sort: 256GB LPAR ~114GB page fixed

© 2016 IBM Corporation30

AUTOSIZE(YES) allows WLM to alter the BPs

automatically +/- 25% each time DB2 is recycled,

in order to avoid sync I/O delay

Customer had this on, and PGFIX(YES) for all

buffer pools

z/OS 2.1 allows them to shrink

V11 allows for a MINSIZE and MAXSIZE to be

set

Over time the paged fixed BPs grew to

consume 80% of the LPAR causing address

spaces to be non-dispatchable = sysplex

slowdown!!

AUTOSIZE

This line shows 80%

of the 25GB LPAR.

This is a stacked

graph of DB2 buffer

pools on the LPAR

© 2017 IBM Corporation

Access Path Degradation

How do customers currently manage DB2 for z/OS performance?

A. Manually B. IBM MonitorsC. other monitorsD. Phone call/email ComplaintsE. They Don’tF. Ad hoc

© 2016 IBM Corporation32

Example: Access Path tracking for dynamic SQL

1.determine dynamic access path degradation (using parameter markers)

a. How?

• i. In test using statement cache table historical values for average elapsed time? – or ranking and browsing them?

• IFCID 316,317,318 must be started for useful stats (best practice)

• ii. Application folks provide a top 10 and do their own testing

• iii. Mainview/ OMPE reports – how do you know who it is?

• iv. Phone call from end user (well of course)

2. run SQL through Data Studio or DSM (or other monitoring tools)

a. does EXPLAIN in data studio match what you see in STATEMENT_CACHE / PLAN tables from a previous execution?

• Meaning you must have historical records of this, look for index scans, table space scans, EXPLAINingSTMTID will just do another EXPLAIN, not show you the current access path

• i. YES – why is this execution noticeably worse – 1 set of very skewed literals?

• ii. NO – obviously RUNSTATS must have invalidated the statement and you have prepared the statement again, which should be when the end user noticed the degradation

• Could better RUNSTATS help??

32

© 2016 IBM Corporation33

What is IBM Data Studio?

Comprehensive tool

An integrated environment for managing databases

and developing database applications

Replaces Control Center in DB2 LUW

Built on the popular Eclipse framework

Runs on Red Hat Linux, SUSE Linux, Windows

Web and Eclipse components

Data Studio Web console: health and availability

monitoring

NO-charge to download!

33

© 2016 IBM Corporation34

Review and Visualize access path – See flow of query processing– See indexes and operations– See optimizer rationale

Assess access path stability to reduce risk of performance regression– Is the optimizer able to apply the filtering early?– Are there indexes that support an efficient path?– Do statistics allow distinction between the choices?http://www.ibm.com/developerworks/data/library/techarticle/dm1006optimquerytuner1/

#2 Analyze Access Plans :Avoid Common Mistakes

34

© 2016 IBM Corporation35 April 18, 2017

#2 Access Path – What is good / bad AP

No definitive answer exists for this question – except for:

IT DEPENDS !

Let’s look at some issues in order to find the “correct” answer

One example : One SQL-statement costs 0.050 CPU-sec while another

one costs 2 CPU-minutes – which one do you want to spend time

tuning ?

© 2016 IBM Corporation36 April 18, 2017

#2 Access Path – What is good / bad AP

The “cheap” SQL-statement is executed 100,000 times per hour in an

online transaction, while the other is executed once every day in a batch

job

One weeks consumption:

Online SQL: (0.050 x 100000 x 24 x 7) =840000 CPU-sec.

Batch SQL : (2 x 60 x 7) = 840 CPU-sec.

© 2016 IBM Corporation37

Improve statistics quality and collection

Results

• Accurate estimated costs

• Better query performance

• Less CPU consumption

• Improved maintenance window throughput

“80 % of access path PMRs could be resolved by statistics advisor before calling IBM support.” – IBM Support

Generates RUNSTATS

control statements

Conflicting statistics

explanation

© 2016 IBM Corporation38

How are Statistics currently collected?

Why are Histogram statistics not collected?

Distribution Statistics

Are you over collecting statistics?

simple answer

Some collect no or insufficient statistics

• Prime reason for poor performing access paths

Do you want to collect statistics on every column and permutations of combination of columns?

• No way!

Requires similar analysis of SQL as for index design

Have to include columns which you may not benefit from adding to an index

Analysis of queries labor intensive without QWT

Iterative process analyzing explain data (as always)

What to do next

38

© 2016 IBM Corporation39

Indexes are decided at design stage…

Lot of effort is spent making SQL to use the provided indexes

But what if the SQL is "right" and it's the indexes that are "wrong"

How do you simply test your hypotheses without impacting production

What are indexes for?

Access paths

Allow high performance random access

Are potentially bad for sequential access

Cost resources to maintain

#3 What are Indexes for ?

Indexes are used to increase performance

© 2016 IBM Corporation40

#3 What to do next …. Improve Database design Testing Applications

Fine Tuning

Deploying new applications

Removing Obsolete Indexes

Redesigning for Performance

Troubleshooting

Re-deploying Applications

Indexes are used to increase performance

Modify existing indexes

© 2016 IBM Corporation41

4. Previous attempts were futile, need to engage level 2 as we believe it is an optimizer issue: send

doc to level 2

Capture SQL from statement cache along with SQL environment capture via stored proc or

Data Studio or DSM – along with literal values

5. As a last ditch effort, assisted by level 2 use access path hints in DB2 - determine if

BIND(QUERY) is reasonable – determine which access path you want if it was not the one that

was prepared, or fake your stats

put it in DSN_USERQUERY, BIND query, then flush user query table

What to do next….

© 2016 IBM Corporation42

#4 Capture Statement Cache

Filter Rows

TIP: Query Tuner issues dynamic SQL if you check the box we will not captured that

information

Schedule

42

© 2016 IBM Corporation43

Enable environment reproduction

DDL

Catalog statistics

Zparms (if DSNWZP stored

procedure available)

Explain tables information

(PLAN_TABLE,

DSN_DETCOST_TABLE,

DSN_PREDICAT_TABLE) (if SQL

used as input)

Speed up service process

Environment Capture Facilitates Collaboration

43

© 2016 IBM Corporation44

9

.

.

..

IBM DB2 Performance Management Solution

provides:

Fast identification with automated alerts,

proactive notification and 24x7 monitoring

Tuning of queries and workloads proactively

Expert advice with built-in advisors

Diverse set of capabilities managed

via Data Server Manager (DSM)

Easy-to-use integrated view of

overall DB2 performance management

Seamless navigation and

movement via functional capabilities versus

individual products

DSM

© 2016 IBM Corporation45

© 2016 IBM Corporation4646

References

RMF spreadsheet reporting tool

– http://www-03.ibm.com/systems/z/os/zos/features/rmf/tools/rmftools.html

– http://www-03.ibm.com/systems/z/os/zos/features/rmf/tools/index.html#spr_win

ATS CICS and DB2 Performance Workshop Wildfire class

http://w3-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/PRS1778

MEMU2

– https://www.ibm.com/developerworks/mydeveloperworks/files/app?lang=en#/person/270000K6H5

**Trials and tribulations endured by customers via trial and error tuning

© 2016 IBM Corporation4747

Methods for capturing documentation for all releases is documented here https://www.ibm.com/support/docview.wss?uid=swg21206998

SYSPROC.ADMIN_INFO_SQL supports V8 -> V10 (Required) Excellent developerWorks article here:

• http://www.ibm.com/developerworks/data/library/techarticle/dm-1012capturequery/index.html

It is installed in V10 base and is subject to the installation verification process• DB2HLQ.SDSNSAMP(DSNTESR) will create and bind it• calling program is DSNADMSB, and sample JCL in DSNTEJ6I

Ensure DB2 9 and DB2 10 have APAR PM39871 applied

Data Server Manager incorporates this procedure into a web UI (Best Practice)• http://www.ibm.com/developerworks/downloads/im/data/#optional• No charge product, replacement for OSC and Visual Explain and Data Studio• Single option to download with • Incorporates Statistics Advisor• Query Environment Capture used to collect doc.

• FTP doc directly to DB2 Level 2 in tool• Can be used to duplicate stats in TEST

Capturing documentation for IBM for access path regression