smarter z/os infrastructure management
TRANSCRIPT
1
Smarter z/OS Infrastructure Management with IntelliMagic Vision v8.8
Brent Phillips – Managing Director, AmericasTodd Havekost – Sr. IntelliMagic Consultant
August 24, 2016
2
Topics
Modernizing how we process and
understand RMF/SMF IntelliMagic
Vision v8.8 new feature highlights How to identify
unrealized MLC reduction
opportunities
3
Commonly Held Assumptions
1. After 30+ years, the RMF/SMF reporting process is matureReality – the greatest potential value in the data is still unutilized
2. SMF is for forensics and trending, no value for real-time Reality – it can provide “sooner than real-time” visibility
3. CPU’s should be run at 100% to maximize cost efficiencyReality – you can optimize cost without compromising on quality
4
At IntelliMagic wecreate new intelligence out of your performance and configuration data.
“Any sufficiently advanced technology is indistinguishable from magic”
Arthur C. Clarke, 1962
5
Keep the z/OS infrastructure running the application workloadswithout service disruptions and as efficiently as possible:• Prevent: See issues before they can disrupt service and availability
• Resolve: Quickly identify and solve underlying problems
• Optimize: Save cost without risk to performance or availability
• Elevate: Amplify the strengths of the IT team
Your goals?
6
• Realize these goals with intelligence from the SMF data, but only if interpreted using embedded expert knowledge:‒ Derive new, meaningful metrics out of the raw SMF data
‒ Throughput limits for internal components
‒ Best practices for configuration and performance management
‒ Balance and redundancy identification
‒ Relationship and interaction of logical and physical resources
Modernizing with Embedded Expert Knowledge
7
• Far more powerful than statistical tools that look at anomalies:‒ The lack of interpretation means it can not do predictions
‒ Can suffer from a lot of false positives
‒ Does not help in creating a higher efficiency
• Far more efficient than a human looking through all the data:‒ Human intelligence is powerful, but there simply isn’t enough time
‒ Very ‘boring’ to review every metric/result to find the needle
The Difference
8
Performance Today: Either Good or Bad
Unconcerned
Strategic Focus
Panic - Hard to focus
What’s wrong with this picture??
9
Performance using Better Intelligence
Unconcerned
Strategic Focus
Panic - Hard to focus
10© IntelliMagic 2014
Time
Response Time
Your existing monitors look at symptoms here,
only after users experience problems
SLA
Perf
orm
ance
Real-time Performance Monitoring
Easy metric to get, but is an effect,
not a cause
11
Availability Intelligence identifies risk here, before
response time suffers
© IntelliMagic 2014
Time
Response Time
Sub-component SaturationSL
A Pe
rfor
man
ce
Monitoring with Availability Intelligence
Requires evaluating every data point
with expert domain knowledge about every component
12© IntelliMagic 2014
Time
Response Time Sub-component Saturation
SLA
Perf
orm
ance
Most infrastructure “fires” can be prevented by
intervening here
Avoiding Disruptions
13
I/O Performance Example
Storage Array Response
Times
Within Array
Between Arrays
Imbalance?
Application Workloads
Config or Failure
Changes?Disk Device
Loads
FW Bypass, etc.
Back-end,Cache
AdapterUtilization
Fibre Switch Errors
Front-endLag
Measure:
Lead Measures:Lead Measures:
14
Automation & the Power of Knowing
• Automatically identify risk in every interval, every device, every data center
• Like a “thousand pairs of eyes”, automated interpretation of what the data means is the only way to continuously achieve ITIL v3 definition of capacity management:
– ensuring…the IT Infrastructure is able to deliver agreed Service Level Targets in a cost effective and timely manner…considers all Resources required to deliver the IT Service...
15
IntelliMagic Vision for z/OS Systems
PreventResolveOptimizeElevate
z/OS System•Processor•CECs, LPARs•Specialty Engines (zIIP, zAAP)•Etc
z/OS System•Processor Cache•Relative Nest Intensity•SMF 113 Records•Etc
z/OS System•Paging Reports•Virtual Storage•zEDC /PCI Express •Etc
z/OS System•Workload Manager•Several MIPS/MSU reports•Trending/Comparison•Etc
Coupling Facility/XCF•CF / XCF Health•CF / XCF Analysis•Trending•Etc
Jobs and Datasets•Data Sets•Address Spaces•Trending•Etc
16
IntelliMagic for z/OS Disk & Replication
PreventResolveOptimizeElevate
Disk Storage•Front End•Back End•Channels & zHPF•Etc
Disk Replication•Replication Status•Rating Over Time•Verify Balance•Etc
FICON Directors•Director Health•Service Statistics•Channel Health•Etc
Jobs and Datasets•Data Sets•Address Spaces•Trending•Etc
Intelligent Trending•Selected Statistics •Summarized Hourly•Summarized Daily•Etc
Comparison•All Reports•Compare Daily•Compare Weekly, Monthly•Etc
17
IntelliMagic Vision for z/OS Virtual Tape
PreventResolveOptimizeElevate
Tape System• Cache• Throttling• Balance• etc
Tape Replication• Send/Receive• Grid Transfers/queues• Balance• etc
Intelligent Trending• Selected Statistics • Summarized Hourly• Summarized Daily• etc
Host Activity• Systems/Jobs/Programs• Volume Groups• Device Groups• etc
Front End• Throughput• Virtual Devices/Mounts• Balance• etc
Back End • Pools• Migration/Recall/Reclaim• Balance• etc
Detailed webinar
overview on Virtual Tape Analytics:
Tuesday August 30:
bit.ly/zvtape
18
• Good problem to solve with Software as a Service• Easy Access to intelligence relevant to different roles• Access to IntelliMagic experts for knowledge transfer, analysis• Solution infrastructure is managed for you, creating more focus
IntelliMagic Vision as a Service
19
IntelliMagic Vision Homepage
20
Embedded Expertise: Infrastructure KRI’s
Focus area: Disk Subsystem, LPAR, CPU, WLM,Virtual Tape, etc.
Performance Metrics
Key Risk Indicators
Automatically identify and rate performance risks and efficiency opportunities with 1000’s of automated health checks
21
Embedded Expertise: Quantify good vs bad
Automatically rate existing and new metrics using embedded expert knowledge about z/OS and your infrastructure
to derive intelligence about performance threats and efficiency opportunities
No Border, Opinion N/A
Green Border, Good
Yellow Border, Early Warning
Red Border, Exceptions
22
Embedded Expertise: Rate Exception Severity
A three level rating system based on hardware capabilities
A three level, dynamic rating based on both workload
characteristics and hardware
23
IntelliMagic Vision v8.8 Highlights
24
8.8 z/OS Reporting Enhancements
• Application Groups
• Real and Virtual Storage
• Processor Reporting
25
Application Groups
26
Application Groups
27
Application Groups – Data Sources
• Data sets
• Coupling Facility structures
• Disk volumes
• XCF members
• XCF transmission groups
• Jobs
• Service Classes
• Report Classes
28
29
Real and Virtual Storage Reporting• Virtual storage
• Real storage
• LFAREA / 1MB Pages
• High Virtual Common (HVCOMMON) and High Virtual Shared (HVSHARE) areas
• Storage Class Memory / Flash Storage
30
31
Processor Reporting – Guiding Principles
• Separate reporting of CPU on general purpose CPs from zIIPsand zAAPs
• Use MIPS to refer to the processor capacity rating metric (not in its classic sense of millions of instructions per second)
32
Processor Reporting – New Report Sets• Tables and CP Usage
• 4 Hr Rolling Avg and Capping
• LPAR Mgmt and Capture Ratio
• Priority Raised
• Mobile
33
34
Tables and CP Usage• Overview tables and multicharts from "All Processors“ report
set
• WLM Parms table from “WLM Constants” report set
• 2 sets of reports for CPU on general purpose CPs‒ Sequenced by CEC, LPAR, Workload, Service Class‒ First set in units of MIPS‒ Second set in units of Processors
35
36
4 Hr Rolling Avg and Capping (1 of 5)
• New CECs table with data and drilldowns focused on Vertical CP configurations (“polarization”) and LPAR Topology (99.14)
• New LPAR Config table with key metrics for Vertical CP configuration and MLC analysis‒ Physical & Logical CPs, Vertical CP configuration,
LPAR % Weight, Capacity Group, Soft Cap‒ Polarity drilldown shows MIPS by Polarity (VH, VM, VL)‒ Changes drilldown lists changes in Vertical CP config
37
4 Hr Rolling Avg and Capping (2 of 5)
• Rolling 4 Hour Average charts - by CEC and LPAR‒ Drilldown compares 4HRA vs. interval RMF CPU usage‒ Further drilldowns by Workload, Service Class, Address Space
• % WLM Capping – critical aspect of MLC Reduction‒ Capping Limited Processor Resources (%)‒ Capping Processor Resources Considered by WLM (%)
• This indicates the time interval when the LPAR's access to CPU was limited and the vertical CP configuration was disrupted
38
4 Hr Rolling Avg and Capping (3 of 5)
• % Logical CP Utilization‒ Low utilizations can indicate surplus logical CPs‒ Online Processor drilldown helpful to confirm LPARs where logical CPs
may be over-specified
• CP Weight – CP Usage vs. LPAR Weight (rated)‒ Low values can help identify LPARs that could “donate” weight to
increase Vertical Highs for high-use LPARs
• Logical CP tuning can help optimize Vertical CP configuration and minimize PR/SM overhead‒ 2:1 Logical/Physical is general Rule of Thumb
39
4 Hr Rolling Avg and Capping (4 of 5)
• RNI by LPAR‒ Also appears here (in addition to Processor Hardware focal
point) because it is a critical metric for tuning‒ Processor drilldown very helpful to show RNI impact of work
executing on VMs & VLs (after filtering out zIIPs and parked VLs)
• % CPs Vert High – % Physical CPs Defined as Vertical Highs for CEC‒ Workloads executing on Vertical Highs experience improved
processor cache efficiency
40
4 Hr Rolling Avg and Capping (5 of 5)
• Polarity CEC - % CP Time Dispatched on Vertical Highs‒ Work executing on Vertical Highs typically has a lower RNI and
thus executes more efficiently‒ Drilldowns by System and MIPS by Polarity
• Dispatch Pol. – Dispatched CP MIPS by LPAR by Polarity‒ Shows MIPS executing on VHs, VMs and VLs for all LPARs‒ Extremely helpful to identify opportunities for RNI tuning‒ By Time and Changes drilldowns also very useful
• WLM Nodes – LPAR Topology from 99.14 records
41
42
LPAR Mgmt and Capture Ratio (1 of 2)
• Improves reporting on PR/SM LPAR overhead
• Phys CP % - Unattributed LPAR Overhead for Physical Partition‒ RMF collects overhead that cannot be attributed to a specific
LPAR and reports in *PHYSICAL* LPAR‒ Expressed as % of entire CEC
43
LPAR Mgmt and Capture Ratio (2 of 2)
• CP LPAR Mgmt % - Overhead assigned to an LPAR‒ Expressed as % of that LPAR's total utilization (e.g., attributed LPAR
overhead of 0.5% of CEC for LPAR that consumed 10% of CEC would be 5% on this report)
‒ Capture Ratio drilldown compares LPAR Mgmt % vs. RMF capture ratio (typically an inverse relationship, lower LPAR Mgmt % correlates to higher RMF capture ratio)
• CP Capt % - CP Capture Ratio‒ General purpose CPU time captured in RMF 72.3 (and assigned to
service classes) as % of total general purpose CPU consumption per RMF 70, by system
‒ Drilldown to compare against LPAR Mgmt %
44
Processor Reporting – New Report Sets• Tables and CP Usage
• 4 Hr Rolling Avg and Capping
• LPAR Mgmt and Capture Ratio
• Priority Raised – WLM raising priority of tasks holding resources other tasks are waiting for
• Mobile – New WLM capability to classify activity originating from mobile devices
45
46
47
8.8 z/OS Reporting Enhancements
• Application Groups
• Real and Virtual Storage
• Processor Reporting
48
How to Identify Unrealized MLC Opportunities
49
• FTP RMF/SMF data to IntelliMagic• IntelliMagic will analyze and identify opportunities• You get access to logon and explore your data
MLC Assessment Service