the dark art of production alerting

T h e D a r k A r t o f B u i l d i n g a P r o d u c ti o n I n c i d e n t S y s t e m

@AloisReitbauerwww.ruxit.com

N o b r o ke n c a b l e s

N o d a t a c e n t e r fi r e s

O t h e r t h i n g s c a n h a p p e n a s w e l l

Continuous deployments

Infrastructure changes

other “everyday” stuff

Scaling an incident system

H o w i t f e e l s t o d o w h a t w e d o

D o y o u a l e r t ?

Typical error rate of 3 percent at 10.000 transactions/min

During the night we now have 5 errors in 100 requests.

D o y o u a l e r t ?

Typical response time has been around 300 ms.

Now we see response times up to 600 ms.

W e a r e g o o d a t fi x i n g p r o b l e m s , b u t n o t r e a l l y g o o d

a t d e t e c ti n g t h e m .

H o w c a n w e g e t b e tt e r ?.

It is all about statisticsI t ’s a l l a b o u t s t a ti s ti c s

Stati sti cs is about objecti vely lying to yourself

in a meaningful way.

H o w t o d e s i g n a n i n c i d e n t

How to calculatethis value?

I t l o o k s r e a l l y s i m p l e

Which metric to pick?

How to getthis baseline?

How to define thatthis happened?

W h i c h m et r i c s to p i c k ?

T h r e e t y p e s o f m e t r i c sCapacity MetricsDefine how much of a resource is used.

Discrete MetricsSimple countable things, like errors or users.

Continuous MetricsMetrics represented by a range of values at any given time.

C a p a c i t y M et r i c sGood for capacity planning, not so good for production alerting

C o n n e c ti o n P o o l s

b ett e r u s eConnection acquisition timeTells you, whether anyone needed a connection and did not get it.

C P U U s a g e

b ett e r u s eCombination of Load Average and CPU usageeven better correlate the with response times of applications

D i s c rete M et r i c sPretty easy to track and analyze.

C o nti n u o u s M et r i c sRequire some extra work as they are not that easy to track.

Conti nuous Metrics – The hope

42

Conti nuous Metrics – The reality

What the average tells us

What the median tells us

H o w to get a b a s e l i n e ?

A baseline is not a numberBaselines define the range of a value combined with a probability

Normal distributi on as baseline

Mean: 500 msStd. Dev.: 100 ms

68 %400ms – 600 ms

95 %300ms – 700 ms

100 200 300 400 500 600 700 800 900

99 %200ms – 800 ms

T h i s c a n g o r e a l l y w r o n g

“Why alerts suck and monitoring solutions need to become better”

H o w t h i s l e a d s t o f a l s e a l e r t s

Many false alerts

Aggressive Baseline

No alerts at all

Moderate Baseline

Find the right distributi on modelHowever, this can be really hard to impossible

Your distr ibuti on might look l ike this

… or l ike this

or completely diff erentyou never know …

H o w c a n w e s o l v e t h i s p r o b l e m ?

N o r m a l d i s t r i b u ti o n - a g a i n

50 Percent slower than μ

97.6 Percent slower than μ + 2σ

Median97th Percentile

The 50 t h and 90 t h percenti le defi ne normal behavior

without needingto know anything about the

distributi on model

Median shows the real problem

H o w t o d e fi n e n o n - n o r m a l b e h a v i o r ?

Fo r t u n ate l y, t h i s i s n o t t h e p ro b l e m we n e e d to s o l ve

We are only talking about missed expectations

Let’s look at two scenarios

Errors

Is a certain error rate likely to happen or not?

Response Times

Is a certain increase in response time significant

enough to trigger an incident?

The error rate scenarioWe have a typical error rate of 3 percent at 10.000 transactions/minute

During the night we now have 5 errors in 100 requests. Should we alert – or not?

W h a t c a n w e l e a r n

S t a ti s ti c s i s e v e r w h e r e

B i n o m i a l D i st r i b u ti o nTells us how likely it is to see n successes in a certain number of trials

H o w m a n y e r r o r s a r e o k ?

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 190.0%

20.0%

40.0%

60.0%

80.0%

100.0%

120.0%

Likeliness of at least n errors

18 % probability to see 5 or more errors. Which is within 2 times Std. Deviation. We do not alert.

R e s p o n s e T i m e E x a m p l eOur median response time is 300 ms

and we measure

200 ms 400 ms 350 ms 200 ms 600 ms500 ms 150 ms 350 ms 400 ms 600 ms

P e r c e n ti l e D r i ft

D e t e c ti o n

Did the median drift signifi cantly?

Check all values above 300 ms200 ms 400 ms 350 ms 200 ms 600 ms500 ms 150 ms 350 ms 400 ms 600 ms

7 values are higher than the median. Is this normal?

We can again use the Binomial Distribution

A p p l y i n g t h e B i n o m i a l D i s t r i b u ti o n

We have a 50 percent likeliness to see values above the median.

How likely is is that 7 out of 10 samples are higher?

The probability is 17 percent, so we should not alert.

How to calculatethis value?

… a n d w e a r e d o n e !

Which metric to pick?

How to getthis baseline?

How to define thatthis happened?

This was just the beginningThere are many more use things about statistics, probabilities, testing, ….

A l o i s R e i t b a u e [email protected]@AloisReitbauer

http://bit.ly/nycwebperferf

Image Credits

http://commons.wikimedia.org/wiki/File:Network_switches.jpghttp://commons.wikimedia.org/wiki/File:Wheelock_mt.jpghttp://commons.wikimedia.org/wiki/File:Fire-lite-bg-10.jpghttp://commons.wikimedia.org/wiki/File:Estacaobras.jpghttp://commons.wikimedia.org/wiki/File:Speedo_angle.jpghttp://commons.wikimedia.org/wiki/File:WelcomeToVegasNite.JPGhttp://commons.wikimedia.org/wiki/File:Dice_02138.JPGhttp://commons.wikimedia.org/wiki/File:Teadlased_j%C3%A4%C3%A4l.jpg

the dark art of production alerting

Technology

continuous metrics metrics

median response time

capacity metrics good

normal distribution

aloisreitbauer http

typical response time

imagecredits http

binomial distribution