cmg 101 - understanding performance
DESCRIPTION
Web performance is good, understanding performance is better. What you need to understand in order to be able to have IT systems that perform well at a reasonable cost.TRANSCRIPT
Performance is good, Understanding performance is better
Peter HJ van EijkChairman NLCMG
A non-profit community of professionals
Feb 11, 2012
CMG 101Computer Cloud Measurement Group
Understand:• Definitions of availability and response time• Psychological and business effect of delay/response time. User interfaces, cost of
downtime • Transactions, and their structure. • Waterfall diagrams for transactions and web page downloads• Performance measures (seconds, bytes, bits per seconds, IOPS, etc).• Reporting measures / metrics. • Visualization of quantitative data, how to• Resources (CPU, memory, disk, network, software) • Elementary queuing theory• Phases in development and how to incorporate performance and capacity (analysis,
design, etc.), performance engineering• Typical free and commercial tools, or at least their functionality
– monitoring, reporting, alerting, analysis, modelling
Availability and Response Time
• Availability: Ability of a Configuration Item or IT Service to perform its agreed Function when required. […] Availability is usually calculated as a percentage.
• Response Time: A measure of the time taken to complete an Operation or Transaction
Graphs of availability and response time
Psychological and business cost of downtime
€ + $ + £
Sudden surges can kill you1-
jan-
0819
-jan-
086-
feb-
0824
-feb-
0813
-Mrt
-200
831
-Mrt
-200
818
-apr
-08
6-m
ei-0
824
-mei
-08
11-ju
n-08
29-ju
n-08
17-ju
l-08
4-au
g-08
22-a
ug-0
89-
sep-
0827
-sep
-08
15-o
kt-0
82-
nov-
0820
-nov
-08
8-de
c-08
26-d
ec-0
813
-jan-
0931
-jan-
0918
-feb-
0908
-Mrt
-200
926
-Mrt
-200
913
-apr
-09
1-m
ei-0
919
-mei
-09
0
100,000
200,000
300,000
400,000
500,000
600,000
700,000 Pageviews
Pageviews
Page
view
s
Bron: SiteStat
IceSave failure
KNMI.nlPageviews per hour
0
20000
40000
60000
80000
100000
120000
140000
160000
180000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
30-dec
31-dec
Ordinary day
Weather alarm day
Transactions and their structure waterfall diagrams
Query
Ack
ServerClient
Ack
Reply
Netwerk latency
Serverturnaround time
Yslow detail
A single user level transaction decomposes into multiple transactions on components
© Digital Infrastructures
9
Transactions: from visits to bandwidth
Visits
Pageviews
GET requests
Bandwidth
7,42 pageviews per bezoek (volgens SiteStat), echter lager tijdens crisis
Circa 6800 bytes per request gemiddeld
Sitestat meting
Sitestat meting, ServerlogsPageopbouw via FireBug
HTTP Serverlogs
HTTP Serverlogs
10,6 (=79/7,42) GET/pageview effectief32 GET voor homepage (volgens browser)
79 GET per bezoek volgens logfile en Sitestat
1,7 visits/sec
6.380 /uur
13 pageviews/sec
47.338 /uur
140 requests/sec
0,95 Mbyte/sec
7,6 Megabit/sec
How to diagnose a problem, where to look? Resource = capacity
WAN LinkWAN Link
SANSAN
End to endEnd to end
Router Switch (CPE)
Router Switch (CPE)
NASNAS
(Test) client(Test) client
Firewall, ProxyFirewall, Proxy
LAN switchesLAN switches
Load BalancerLoad Balancer
HTTP front endHTTP front end
MySQL DBMySQL DB
Users
Application
Network
Network lines
Server
Example breakdowns
Na het uitvragen van de medewerkersnummers (er zijn 373 Janssen’s), worden dienstverbanddetails per stuk uitgevraagd (in totaal 612). Dit leidt op het GBO LAN tot 30 sec doorlooptijd (gemeten).
Op basis van 50 mSec roundtrip op het WAN
Resource contribution to response time, modeling different resource allocations
Modelling different network bandwidth’s effect on response time
0 100 200 300 400 500
GBO
ICTRO 2Mb
256K
64K
Server tijd (sec) Client tijd (sec)
Netwerk tijd delay (sec) Netwerk tijd bandbreedte (sec)
Excessive client/server chatter leads to a user interaction time of more than 7 minutes!
How much faster will this be with?•Very fast network/•Very fast client / •Very fast server
Queuing theoryD
ela
y f
acto
r
0
2
4
6
8
10
12
10% 20% 30% 40% 50% 60% 70% 80% 90%
Utilisation
Response depends on capacity At higher loads, congestion can set in
Traffic load
Actu
al
thro
ug
hp
ut
Congestion
Perfect
Sweet spotSweet spot
Sw
eet
spot
Sw
eet
spot
So what was the bottleneck?
• KNMI: static page served from database 1000/sec
• Ministry: very chatty client/server interaction• DNB: JSP application server serves static
content• Anne Frank: many, large digital assets, no use of
CDN• Hospital information system: client (front-end)
code
How to incorporate performance in development and operations
Typical free and commercial tools and their functionality
Functionality• Monitoring• Reporting• Alerting• Analysis• Modelling• Etc …
Example tools• Nagios• Cacti• WatchMouse• PDQ• R• Yslow• …
CMG 101• We want to develop a ‘standard’ body of
knowledge– To educate our people– Speak more of the same language– Enable tool vendors to more easily express their
offerings• Note: defining what is in the course is not the
same as developing a course
Call for Action
• Want to know more?• Want to collaborate, contribute?• Want to get a course?• Want to sponsor?
• Talk to mePeter HJ van Eijk@petersgriddle
[email protected] +31 2268 4939
www.nlcmg.nl NLCMG is a chapter of CMG.org
Some of my performance projects
• KNMI (Weather service): website meltdown after weather emergency (“weeralarm”)
• DNB (Dutch Banks Authority): website meltdown during 2008 financial crisis
• Unnamed Ministry: information system with multi-minute response times
• Crisis.nl: ….• Anne Frank website: … anticipated surge after major
redesign• Hospital information system: storage sizing
http://zoom.nl/foto/1713577/portret/cloudwatch.html
Achtung alles Lookenspeepers! Nur watchen das Cloud.
How does a financial IT crisis look like?
Fernando’s office (bank’s capacity planner)