capacity management for large virtual server estates a rationalized approach copyright 2014, perfcap...
TRANSCRIPT
Capacity Management for Large Virtual Server Estates
A Rationalized Approach
Copyright 2014, PerfCap Corporation
The Capacity PlanningDilemmaComputing style shift: complex distributed systems
“Real CP” too expensive/complex & “Cycles are free”
Vast estates of underutilized systems Real capacity planning marginalized
Little or no capacity planning and little cost control
Reactive capacity management
Expensive “Black Swans”
capacity
cost
workload
CM for Large Server Estates 2
Applications and Infrastructure
Trade Finance
Retail Banking
Cash Management
Trust & Securities
SecuritiesOrigination,
Sales, Trading
Corporate Advisory
FOREX Trading
Mutual Funds Investments
Alternative Investments
(RREEF)
Institutional Asset
Management
Insurance Asset
Management
Online Banking
ETFs
…
Multitude of applications share same IT infrastructure. Each application has its particular capacity management needs.
IT managers struggling to balance costs and performance.
CM for Large Server Estates 3
Monitor Key Performance Indicators Select and define KPI thresholds which
suggest performance problems Alert and trigger investigation when KPI
thresholds are crossed. Attempt to predict future behavior of KPIs
based on past history Determine risk by predicted time to failure Trigger investigation and corrective action
in a timely fashion
PM/CP Process
Reactive analysis
Proactive management
CM for Large Server Estates 4
The Problem
• Automate monitoring of performance data• Automate risk evaluation• Automate timely triggers for capacity investigation• Selectively perform in-depth capacity planning
How do you do capacity management for a large server estate?
CM for Large Server Estates 5
Visualize performance, capacity and risk status of all distributed application services in a single enterprise-wide view
Go beyond simplistic trending to projections of actual system responsiveness reflecting end-user satisfaction
Do realistic capacity planning with limited business forecasts
A solution that scales from 10s to 10,000s of servers
The Challenges
CM for Large Server Estates 6
Automated Solution Uses New: Methodology - Risk Analysis Metric - Headroom Risk Visualization Format
Status Dashboards Enterprise-wide rollup status (by service,
business, etc.) Transition Reports
CM for Large Server Estates 7
Automated Collection and Analysis
Internet
AnalysisCMDB
hypervisors
PhysicalServers
Storage Arrays
VMs
Array Console
Networks Storage
Events
Trending
Clusters
Real Time
Applications
Performance/Capacity
Reports
Risk Dashboards
Notifications
CM for Large Server Estates 8
Breakthrough
Maximum
Current Risk Status Color
Transa
ctio
n R
esp
onse
Tim
e
Time : Days/Weeks/Months
Lead Time Lead
Time
Automated Risk AnalysisUsing Common KPIs
CM for Large Server Estates 9
Application Performance
The key issue of application performance is responsiveness.
e.g. transaction response time, batch turnaround time, end-to-end
processing time, time to db update, trade execution time, etc.
CM for Large Server Estates 10
Response Time vs KPI
CM for Large Server Estates 11
Application Response Time Changes
As Workload Changes
0
100
200
300
400
500
600
700
800
900
1000
0 1 2 3 4 5 6 7 8 9 10
Transactions/second
Resp
onse
Tim
e (m
s)
.
CM for Large Server Estates 12
Using Trending to Determine Capacity
0
100
200
300
400
500
600
700
800
900
1000
0 2 4 6 8 10 12 14 16 18 20
Transactions/second
Resp
onse
Tim
e (m
s)
.
If acceptable response time should not exceed 600 ms, then application load capacity should not exceed 19 transactions / second.
Estimated application capacity is 19 trans/sec
CM for Large Server Estates 13
Application PerformanceReality vs Linear Trend
0
500
1000
1500
2000
2500
3000
0 2 4 6 8 10 12 14 16 18 20
Transactions/second
Resp
onse
Tim
e (m
s)
.
This is the typical relationship between load and response time. After “knee” of the curve is reached, response time degrades rapidly.
CM for Large Server Estates 14
True Application Capacity
0
200
400
600
800
1000
1200
1400
1600
1800
2000
0 2 4 6 8 10 12
Transactions/second
Resp
onse
Tim
e (m
s)
.
Actual application capacity is 9 trans/sec
True capacity is not maximum sustainable load but maximum load with acceptable performance.
CM for Large Server Estates 15
Capacity Headroom
l
Where do you want to operate?
Current Workload Headroom Saturation Point
Operational Capacity
Workload
Res
pons
e T
ime
Response time is a function of CPU, disk, memory, adapters, etc.
Headroom is the portion of operational capacity remaining.
CM for Large Server Estates 16
Headroom Risk Analysis
CM for Large Server Estates 17
Risk History Dashboard
CM for Large Server Estates 18
Capacity Risk Monitoring
Automated Risk Analysis Computations
Risk Status History Dashboard
Risk Status Dashboards
Automated Color Transition Notification
CM for Large Server Estates 19
A Tractable Solution Reduces capacity planner’s workload Closer to real user-perceived performance Capacity manage 10,000s of servers
CM for Large Server Estates 20
VIRTUALIZED INFRASTRUCTURES
Same Issues, New Complexity
CM for Large Server Estates 21
New Challenges New complexity Hierarchical views / service views What systems virtualized to save cost? Performance/capacity consequences “What-if” provisioning scenarios
CM for Large Server Estates 22
New Level of Complexity
Predict
capacity demand
Physical limit
Breakthrough threshold
Lead time
Lead time
Time
Me
tric
Physical limit
Breakthrough threshold
Physical limit
Breakthrough threshold
Lead timeLead timeLead time
Lead timeLead time
Time
Me
tric
Update / rebalancehost hardware
host CaMCycle
Predict
capacity demand
Physical limit
Breakthrough threshold
Lead time
Lead time
Time
Me
tric
Physical limit
Breakthrough threshold
Physical limit
Breakthrough threshold
Lead timeLead timeLead time
Lead timeLead time
Time
Me
tric
Update / rebalancehost hardware
host CaMCycle
Predict
capacity demand
Physical limit
Breakthrough threshold
Lead time
Lead time
Time
Me
tric
Physical limit
Breakthrough threshold
Physical limit
Breakthrough threshold
Lead timeLead timeLead time
Lead timeLead time
Time
Me
tric
Update
VM provisioning
VM CaMCycle
Predict
capacity demand
Physical limit
Breakthrough threshold
Lead time
Lead time
TimeM
etr
ic
Physical limit
Breakthrough threshold
Physical limit
Breakthrough threshold
Lead timeLead timeLead time
Lead timeLead time
TimeM
etr
ic
Update
VM provisioning
VM CaMCycle
. . . for each VM . . .
. . . for each host . . .
Predict
capacity demand
Physical limit
Breakthrough threshold
Lead time
Lead time
Time
Me
tric
Physical limit
Breakthrough threshold
Physical limit
Breakthrough threshold
Lead timeLead timeLead time
Lead timeLead time
Time
Me
tric
Update / rebalancehost hardware
host CaMCycle
Predict
capacity demand
Physical limit
Breakthrough threshold
Lead time
Lead time
Time
Me
tric
Physical limit
Breakthrough threshold
Physical limit
Breakthrough threshold
Lead timeLead timeLead time
Lead timeLead time
Time
Me
tric
Predict
capacity demand
Physical limit
Breakthrough threshold
Lead time
Lead time
Time
Me
tric
Physical limit
Breakthrough threshold
Physical limit
Breakthrough threshold
Lead timeLead timeLead time
Lead timeLead time
Time
Me
tric
Update / rebalancehost hardware
host CaMCycle
Predict
capacity demand
Physical limit
Breakthrough threshold
Lead time
Lead time
Time
Me
tric
Physical limit
Breakthrough threshold
Physical limit
Breakthrough threshold
Lead timeLead timeLead time
Lead timeLead time
Time
Me
tric
Update / rebalancehost hardware
host CaMCycle
Predict
capacity demand
Physical limit
Breakthrough threshold
Lead time
Lead time
Time
Me
tric
Physical limit
Breakthrough threshold
Physical limit
Breakthrough threshold
Lead timeLead timeLead time
Lead timeLead time
Time
Me
tric
Predict
capacity demand
Physical limit
Breakthrough threshold
Lead time
Lead time
Time
Me
tric
Physical limit
Breakthrough threshold
Physical limit
Breakthrough threshold
Lead timeLead timeLead time
Lead timeLead time
Time
Me
tric
Update / rebalancehost hardware
host CaMCycle
Predict
capacity demand
Physical limit
Breakthrough threshold
Lead time
Lead time
Time
Me
tric
Physical limit
Breakthrough threshold
Physical limit
Breakthrough threshold
Lead timeLead timeLead time
Lead timeLead time
Time
Me
tric
Update
VM provisioning
VM CaMCycle
Predict
capacity demand
Physical limit
Breakthrough threshold
Lead time
Lead time
TimeM
etr
ic
Physical limit
Breakthrough threshold
Physical limit
Breakthrough threshold
Lead timeLead timeLead time
Lead timeLead time
TimeM
etr
ic
Update
VM provisioning
VM CaMCycle
. . . for each VM . . .
. . . for each host . . .
Predict
capacity demand
Physical limit
Breakthrough threshold
Lead time
Lead time
Time
Me
tric
Physical limit
Breakthrough threshold
Physical limit
Breakthrough threshold
Lead timeLead timeLead time
Lead timeLead time
Time
Me
tric
Update
VM provisioning
VM CaMCycle
Predict
capacity demand
Physical limit
Breakthrough threshold
Lead time
Lead time
Time
Me
tric
Physical limit
Breakthrough threshold
Physical limit
Breakthrough threshold
Lead timeLead timeLead time
Lead timeLead time
Time
Me
tric
Predict
capacity demand
Physical limit
Breakthrough threshold
Lead time
Lead time
Time
Me
tric
Physical limit
Breakthrough threshold
Physical limit
Breakthrough threshold
Lead timeLead timeLead time
Lead timeLead time
Time
Me
tric
Update
VM provisioning
Update
VM provisioning
VM CaMCycle
Predict
capacity demand
Physical limit
Breakthrough threshold
Lead time
Lead time
TimeM
etr
ic
Physical limit
Breakthrough threshold
Physical limit
Breakthrough threshold
Lead timeLead timeLead time
Lead timeLead time
TimeM
etr
ic
Update
VM provisioning
VM CaMCycle
Predict
capacity demand
Physical limit
Breakthrough threshold
Lead time
Lead time
TimeM
etr
ic
Physical limit
Breakthrough threshold
Physical limit
Breakthrough threshold
Lead timeLead timeLead time
Lead timeLead time
TimeM
etr
ic
Predict
capacity demand
Physical limit
Breakthrough threshold
Lead time
Lead time
TimeM
etr
ic
Physical limit
Breakthrough threshold
Physical limit
Breakthrough threshold
Lead timeLead timeLead time
Lead timeLead time
TimeM
etr
ic
Update
VM provisioning
Update
VM provisioning
VM CaMCycle
. . . for each VM . . .
. . . for each host . . .
Must do CM on both physical and virtual levels.
CM for Large Server Estates 23
Key Principle
It is essential to provide capacity management from both the perspective of each virtual machine and the perspective of the host systems on which the virtual machines operate.
CM for Large Server Estates 24
Capacity Risk (Two Perspectives)
Enterprise View
Host Views
Data Centre Views
Guest Views
Cluster Views
Serv
ice V
iew
- ER
P
Serv
ice V
iew
- eM
ail
Serv
ice V
iew
– CR
M
Serv
ice V
iew
– HR
CM for Large Server Estates 25
Capacity Risk (Two Perspectives)
CM for Large Server Estates 26
Projected Resource View (Any Level)
London Data Centre, CPU GHz Resource Projections, 31-Dec-2011
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
Total GHz GHz Available Peak GHz Used Average GHz Used
Proj. Total GHz Proj. GHz Available Proj. Peak GHz Used Proj. Average GHz Used
CM for Large Server Estates 27
Underutilized Systems
Maximum 12-month CPU Utilizations
0 10 20 30 40 50 60 70 80 90 100
PRDB-MP05-E00
PRDB-MP01-E00
PRDB-MP02-E00
PRDA-MP05-E00
PRDB-MP04-E00
PRDB-MP07-E00
PRDA-MP04-E00
PRDA-MP02-E00
PRDA-MP07-E00
PRDB-MP06-E00
PRDB-MP09-E00
PRDB-MP12-E00
PRDA-MP06-E00
PRDB-MP03-E00
PRDA-MP09-E00
PRDA-MP12-E00
PRDA-MP03-E00
PRDB-MP11-E00
PRDB-MP08-E00
PRDA-MP11-E00
PRDA-MP08-E00
PRDB-MP10-E00
PRDA-MP10-E00
Extract from CMDB
CM for Large Server Estates 28
Underutilized Risk Color Status
Physical limit
Breakthrough threshold
Lead time
Time
Metr
ic
Underutilized threshold
New risk color
Use a new purple color status to identify virtualization candidates.
CM for Large Server Estates 29
Virtualization Consequences
CM for Large Server Estates 30
Virtualization Consequences
What happens if I move VMs, re-provision VMs, clone VMs, change host hardware, etc.?
CM for Large Server Estates 31
Virtual Infrastructure CP Challenges Enterprise-to-host performance and capacity visibility
IT infrastructure servers Distributed application services
Automated performance analysis, advising and modeling
Smooth scaling from 10s to 10,000s of servers “What if” modeling of vSphere clusters and services
CM for Large Server Estates 32