availability and storage intelligence- what it can do ?· availability and storage...

Download Availability and Storage Intelligence- What it can do ?· Availability and Storage Intelligence-What…

Post on 31-Jul-2018




0 download

Embed Size (px)


  • Availability and Storage Intelligence-What it can do for you

    Ros Schulman and John TicicHDS and IntelliMagic

    Date of presentation (01/11/2016)Session LE

  • Methodology for Monitoring Mainframe


  • Characterize the Issue Dimensions of performance measurement?

    Online Workloads Throughput

    IOPs MBS

    Response Time Achieved by employing low resource utilization and minimal queuing

    Batch Workloads Throughput

    IOPs MBS

    Achieved by employing maximum resource utilization with moderate queuing Optimizing utilization of a storage access resource for a batch workload and an online

    workload are mutually exclusive. Consequently, batch workloads and online workloads should generally not share

    same storage access resource at same time.

  • Online Workload IO Profile

    Metric Name Name in MAR Value Description NormalValueBad

    Value As seen from

    I/O Rate IOPS I/Os per second N/A Host View

    Read Rate Disk Reads/sec I/Os per second N/A Host & Port View

    Write Rate Disk Writes/sec I/Os per second N/A Host & Port View

    Read Block Size Avg Disk Bytes/Read Bytes xfered per I/O operation 4k to 27k N/A Host View

    Write Block Size Avg Disk Bytes/Write Bytes xfered per I/O operation 2k to 27k N/A Host View

    Read Response Time Avg Disk Sec/Read Time required to complete a Read I/O (Millisecond) 1 to 10 > 10 Host & Port View

    Write ResponseTime Avg Disk Sec/Write

    Time required to complete a Write I/O (Millisecond) 1 to 3 > 3 Host & Port View


    Values shown in the Normal Value column are planning estimates. You should baseline your I/O profile when all systems are in a good and

    normal running state.

  • Minute-by-minute basis for Online workloads High utilization in any minute is a point of concern. Scope of concern:

    Immediate, likely to cause perceptible response time problems Contingent, inadequate reserves to support processing during failures

    Average over time remaining in batch window for batch workloads High utilization by itself is not cause for concern; it is a design goal.

    Maximizing utilization per resource maximizes throughput per resource, the optimization goal for batch processing.

    High response time is not cause for concern either. High response time is a natural consequence of high utilization levels and moderate queuing, the keys to maximizing throughput per resource.

    The key question: Is there adequate capacity to complete processing within the batch window

    even after a component failure?

    Assessing Utilization Levels, Online or Batch

  • Design for Normal or Failure Operation

    Thresholds are being exceeded regularly.

    Corrective attention is recommended.

    Further load may cause severe degradation.

    Thresholds are being exceeded occasionally.

    Increased monitoring is appropriate.

    Bursts of load may cause noticeable performance degradation.

    Normal operation. No thresholds are

    being exceeded. Ability to accommodate

    bursts of load without noticeable impact.

    Allow for Utilization during failure modes

    Design/build for normal operation

    50% 75%

    Design/build for failure operation

    Traffic Light System:


  • Reporting Intervals Most performance issues are analyzed using 1-

    minute data intervals. Performance problems requiring shorter interval

    analysis are rare, but do occur. Analysis of 1-minute interval data is generally limited

    to 1 to 2 day durations. Short intervals avoid muting peaks by averaging.

    Longer 15 minute intervals are mostly useful for workload cycle and trend analysis.

  • Utilization = percent busy or occupied. Most storage performance problems are attributable to

    excessive storage resource utilization. High MPB utilization High Front End Port Utilization High Write Pending (high back end utilization) High Array Group utilization

    Storage resource throughput, utilization, and response time are reported by: Mainframe Analytics Recorder (MAR) SVP Performance Monitor (Export) Tuning Manager

    Assess Storage Resource Utilization

  • Front end port utilization Work to balance a system or predict new loads. High microprocessor utilization is an unambiguous

    indication of high port utilization. Low microprocessor utilization is not by itself a

    definitive indication of low port utilization. When port microprocessor utilization is low,

    throughput in MB/s must also be examined before concluding that port utilization is low.

    Throughput constraints for small block I/O traffic typically manifest themselves as high microprocessor utilization.

    Front End Port Utilization

  • For Online workloads 30% during normal operations for HDD

    SSD/FMD you can go up to 80% Utilization reserve required to accommodate failure

    For Batch workloads As high as possible, because batch metric is normally Elapsed Time Expected maximums of 70%-80%

    Depends on the burst profile of initiator Average utilization over time remaining in batch window should not exceed


    Maximum Recommended Array Group Utilization

  • Response time This threshold depends on the application needs and

    the Service Level Agreement (SLA) for the application.

    Since the Logical Unit (LU) Response Time has a direct impact on applications, this indicator should be monitored on key LUs to determine deltas as loads increase.

    Watch out for worst performing LUs Use a performance monitor to look at worst

    performance LUs by correlating to VOLSERs.

    Response Time Monitoring

  • Benefits and Use cases

  • Think of Medicine before Radiology Choose your medicine

    Why do you need DASD internal metrics?

  • Stg Analytics: X-Ray for your Storage

  • All the data produced in these slides is captured with MXG

    Uses the MXG member standards TYPEMAR VMACMAR EXMARnn

    Capture Data with MXG


  • RMF does not go away Still provides the host

    view of performance 15 minute intervals can

    smooth out extremes and variance

    Activity does not usually happen on a quarter hour boundary

    Drilling Deep MRI for your DASD

  • RMF 15 minutes with MAR 1 minute Not all sites can reduce

    RMF to 1 minute intervals MAR interval can be

    shorter than RMF You can still find problems

    below the surface like an MRI

    Drilling Deep MRI for your DASD

  • What are the overheads? The observer effect

    Did it changed because I looked?

  • A Tuning ExampleWhere RMF is not enough

  • Barry Merrill (MXG) taught the value of the scatter plot

    There are many observations of volumes with excessive response time at medium IO Rates

    A performance problem is diagnosed with RMF

  • The anomaly is paralleled by Average Pend Time

    All the usual suspects eliminated Channel MP Usage Open Exchanges

    Next candidate MPB % busy

    A performance problem is diagnosed with RMF

  • 95th percentile of all intervals is greater than 10ms

    Assume that this identifies volumes with the most frequent incidence of high pending time

    Identify the volumes significant pend time

  • Now we start the X-Ray

  • Use Device Number and Volser to tie RMF and internal stats together

    Identify that all high Pend time volumes are using the same MPB

    What are the MPB used by these volumes

  • MPB 3 is frequently overused

    Other MPB, except 0 have spare capacity

    Solution reassign the high pend LDEV to other MPB Assume a correlation

    with high pend and MPB usage

    Identify the volumes significant pend time

  • So how good was the treatment

  • Tuning and optimization that RMF cannot provide Statistics based on the Storage architecture Cache, Channel Ports, CLPR, PG, HDP/HDP

    activity Capacity Planning

    Track MPB usage trends Preventative tuning Improve JIT upgrades

    Radiotherapy for DASD

  • Building on RMFLets look at RMF data and see how additional MAR data can help us understand and investigate performance issues.

  • Building on RMF

    RMF has very useful performance and configuration data, but we sometimes have the need to supplement what is available with vendor specific data.

    Lets look at some examples.

    Sample charts shown using IntelliMagic Vision.

  • Response TimeBased on RMF Data The Host View

    A critical metric for application performance.Good performance, but we need other metrics to judge this.

  • Other Critical MetricsBased on RMF Data The Host View

    All critical to judging performance and potentially understand Disk utilization.But what is it like under the convers?

    I/O Rate

    Throughput MB/s

  • Back-end Drive Rate (ops/s)Based on RMF Data The Disk View

    Yes, RMF will show us disk internal activity.We can use the Rank statistics (SMF 74.8) to see I/O activity at the internal disk


  • Back-end Read Response Time (ms) Based on RMF Data The Disk View

    We can see that one of the tiers (7.2K 4 TB) has a significant response time peak.

  • Response time for Reads from RAID Group Based on RMF Data The Disk View

    We can go deeper and get to the individual ranks (parity groups).But we dont have detailed inter


View more >