the dark art of production alerting
TRANSCRIPT
T h e D a r k A r t o f B u i l d i n g a P r o d u c ti o n I n c i d e n t S y s t e m
@AloisReitbauerwww.ruxit.com
N o b r o ke n c a b l e s
N o d a t a c e n t e r fi r e s
O t h e r t h i n g s c a n h a p p e n a s w e l l
Continuous deployments
Infrastructure changes
other “everyday” stuff
Scaling an incident system
H o w i t f e e l s t o d o w h a t w e d o
D o y o u a l e r t ?
Typical error rate of 3 percent at 10.000 transactions/min
During the night we now have 5 errors in 100 requests.
D o y o u a l e r t ?
Typical response time has been around 300 ms.
Now we see response times up to 600 ms.
W e a r e g o o d a t fi x i n g p r o b l e m s , b u t n o t r e a l l y g o o d
a t d e t e c ti n g t h e m .
H o w c a n w e g e t b e tt e r ?.
It is all about statisticsI t ’s a l l a b o u t s t a ti s ti c s
Stati sti cs is about objecti vely lying to yourself
in a meaningful way.
H o w t o d e s i g n a n i n c i d e n t
How to calculatethis value?
I t l o o k s r e a l l y s i m p l e
Which metric to pick?
How to getthis baseline?
How to define thatthis happened?
W h i c h m et r i c s to p i c k ?
T h r e e t y p e s o f m e t r i c sCapacity MetricsDefine how much of a resource is used.
Discrete MetricsSimple countable things, like errors or users.
Continuous MetricsMetrics represented by a range of values at any given time.
C a p a c i t y M et r i c sGood for capacity planning, not so good for production alerting
C o n n e c ti o n P o o l s
b ett e r u s eConnection acquisition timeTells you, whether anyone needed a connection and did not get it.
C P U U s a g e
b ett e r u s eCombination of Load Average and CPU usageeven better correlate the with response times of applications
D i s c rete M et r i c sPretty easy to track and analyze.
C o nti n u o u s M et r i c sRequire some extra work as they are not that easy to track.
Conti nuous Metrics – The hope
42
Conti nuous Metrics – The reality
What the average tells us
What the median tells us
H o w to get a b a s e l i n e ?
A baseline is not a numberBaselines define the range of a value combined with a probability
Normal distributi on as baseline
Mean: 500 msStd. Dev.: 100 ms
68 %400ms – 600 ms
95 %300ms – 700 ms
100 200 300 400 500 600 700 800 900
99 %200ms – 800 ms
T h i s c a n g o r e a l l y w r o n g
“Why alerts suck and monitoring solutions need to become better”
H o w t h i s l e a d s t o f a l s e a l e r t s
Many false alerts
Aggressive Baseline
No alerts at all
Moderate Baseline
Find the right distributi on modelHowever, this can be really hard to impossible
Your distr ibuti on might look l ike this
… or l ike this
or completely diff erentyou never know …
H o w c a n w e s o l v e t h i s p r o b l e m ?
N o r m a l d i s t r i b u ti o n - a g a i n
50 Percent slower than μ
97.6 Percent slower than μ + 2σ
Median97th Percentile
The 50 t h and 90 t h percenti le defi ne normal behavior
without needingto know anything about the
distributi on model
Median shows the real problem
H o w t o d e fi n e n o n - n o r m a l b e h a v i o r ?
Fo r t u n ate l y, t h i s i s n o t t h e p ro b l e m we n e e d to s o l ve
We are only talking about missed expectations
Let’s look at two scenarios
Errors
Is a certain error rate likely to happen or not?
Response Times
Is a certain increase in response time significant
enough to trigger an incident?
The error rate scenarioWe have a typical error rate of 3 percent at 10.000 transactions/minute
During the night we now have 5 errors in 100 requests. Should we alert – or not?
W h a t c a n w e l e a r n
S t a ti s ti c s i s e v e r w h e r e
B i n o m i a l D i st r i b u ti o nTells us how likely it is to see n successes in a certain number of trials
H o w m a n y e r r o r s a r e o k ?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 190.0%
20.0%
40.0%
60.0%
80.0%
100.0%
120.0%
Likeliness of at least n errors
18 % probability to see 5 or more errors. Which is within 2 times Std. Deviation. We do not alert.
R e s p o n s e T i m e E x a m p l eOur median response time is 300 ms
and we measure
200 ms 400 ms 350 ms 200 ms 600 ms500 ms 150 ms 350 ms 400 ms 600 ms
P e r c e n ti l e D r i ft
D e t e c ti o n
Did the median drift signifi cantly?
Check all values above 300 ms200 ms 400 ms 350 ms 200 ms 600 ms500 ms 150 ms 350 ms 400 ms 600 ms
7 values are higher than the median. Is this normal?
We can again use the Binomial Distribution
A p p l y i n g t h e B i n o m i a l D i s t r i b u ti o n
We have a 50 percent likeliness to see values above the median.
How likely is is that 7 out of 10 samples are higher?
The probability is 17 percent, so we should not alert.
How to calculatethis value?
… a n d w e a r e d o n e !
Which metric to pick?
How to getthis baseline?
How to define thatthis happened?
This was just the beginningThere are many more use things about statistics, probabilities, testing, ….
A l o i s R e i t b a u e [email protected]@AloisReitbauer
http://bit.ly/nycwebperferf
Image Credits
http://commons.wikimedia.org/wiki/File:Network_switches.jpghttp://commons.wikimedia.org/wiki/File:Wheelock_mt.jpghttp://commons.wikimedia.org/wiki/File:Fire-lite-bg-10.jpghttp://commons.wikimedia.org/wiki/File:Estacaobras.jpghttp://commons.wikimedia.org/wiki/File:Speedo_angle.jpghttp://commons.wikimedia.org/wiki/File:WelcomeToVegasNite.JPGhttp://commons.wikimedia.org/wiki/File:Dice_02138.JPGhttp://commons.wikimedia.org/wiki/File:Teadlased_j%C3%A4%C3%A4l.jpg