the dark art of production alerting

58
The Dark Art of Building a Production Incident System @AloisReitbauer www.ruxit.com

Upload: alois-reitbauer

Post on 19-Jun-2015

1.092 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: The Dark Art of Production Alerting

T h e D a r k A r t o f B u i l d i n g a P r o d u c ti o n I n c i d e n t S y s t e m

@AloisReitbauerwww.ruxit.com

Page 2: The Dark Art of Production Alerting

N o b r o ke n c a b l e s

Page 3: The Dark Art of Production Alerting

N o d a t a c e n t e r fi r e s

Page 4: The Dark Art of Production Alerting

O t h e r t h i n g s c a n h a p p e n a s w e l l

Continuous deployments

Infrastructure changes

other “everyday” stuff

Page 5: The Dark Art of Production Alerting

Scaling an incident system

Page 6: The Dark Art of Production Alerting

H o w i t f e e l s t o d o w h a t w e d o

Page 7: The Dark Art of Production Alerting

D o y o u a l e r t ?

Typical error rate of 3 percent at 10.000 transactions/min

During the night we now have 5 errors in 100 requests.

Page 8: The Dark Art of Production Alerting

D o y o u a l e r t ?

Typical response time has been around 300 ms.

Now we see response times up to 600 ms.

Page 9: The Dark Art of Production Alerting

W e a r e g o o d a t fi x i n g p r o b l e m s , b u t n o t r e a l l y g o o d

a t d e t e c ti n g t h e m .

Page 10: The Dark Art of Production Alerting

H o w c a n w e g e t b e tt e r ?.

Page 11: The Dark Art of Production Alerting

It is all about statisticsI t ’s a l l a b o u t s t a ti s ti c s

Page 12: The Dark Art of Production Alerting

Stati sti cs is about objecti vely lying to yourself

in a meaningful way.

Page 13: The Dark Art of Production Alerting

H o w t o d e s i g n a n i n c i d e n t

Page 14: The Dark Art of Production Alerting

How to calculatethis value?

I t l o o k s r e a l l y s i m p l e

Which metric to pick?

How to getthis baseline?

How to define thatthis happened?

Page 15: The Dark Art of Production Alerting

W h i c h m et r i c s to p i c k ?

Page 16: The Dark Art of Production Alerting

T h r e e t y p e s o f m e t r i c sCapacity MetricsDefine how much of a resource is used.

Discrete MetricsSimple countable things, like errors or users.

Continuous MetricsMetrics represented by a range of values at any given time.

Page 17: The Dark Art of Production Alerting

C a p a c i t y M et r i c sGood for capacity planning, not so good for production alerting

Page 18: The Dark Art of Production Alerting

C o n n e c ti o n P o o l s

Page 19: The Dark Art of Production Alerting

b ett e r u s eConnection acquisition timeTells you, whether anyone needed a connection and did not get it.

Page 20: The Dark Art of Production Alerting

C P U U s a g e

Page 21: The Dark Art of Production Alerting

b ett e r u s eCombination of Load Average and CPU usageeven better correlate the with response times of applications

Page 22: The Dark Art of Production Alerting

D i s c rete M et r i c sPretty easy to track and analyze.

Page 23: The Dark Art of Production Alerting

C o nti n u o u s M et r i c sRequire some extra work as they are not that easy to track.

Page 24: The Dark Art of Production Alerting

Conti nuous Metrics – The hope

42

Page 25: The Dark Art of Production Alerting

Conti nuous Metrics – The reality

Page 26: The Dark Art of Production Alerting

What the average tells us

Page 27: The Dark Art of Production Alerting

What the median tells us

Page 28: The Dark Art of Production Alerting

H o w to get a b a s e l i n e ?

Page 29: The Dark Art of Production Alerting

A baseline is not a numberBaselines define the range of a value combined with a probability

Page 30: The Dark Art of Production Alerting

Normal distributi on as baseline

Mean: 500 msStd. Dev.: 100 ms

68 %400ms – 600 ms

95 %300ms – 700 ms

100 200 300 400 500 600 700 800 900

99 %200ms – 800 ms

Page 31: The Dark Art of Production Alerting

T h i s c a n g o r e a l l y w r o n g

“Why alerts suck and monitoring solutions need to become better”

Page 32: The Dark Art of Production Alerting

H o w t h i s l e a d s t o f a l s e a l e r t s

Page 33: The Dark Art of Production Alerting

Many false alerts

Aggressive Baseline

Page 34: The Dark Art of Production Alerting

No alerts at all

Moderate Baseline

Page 35: The Dark Art of Production Alerting

Find the right distributi on modelHowever, this can be really hard to impossible

Page 36: The Dark Art of Production Alerting

Your distr ibuti on might look l ike this

Page 37: The Dark Art of Production Alerting

… or l ike this

Page 38: The Dark Art of Production Alerting

or completely diff erentyou never know …

Page 39: The Dark Art of Production Alerting

H o w c a n w e s o l v e t h i s p r o b l e m ?

Page 40: The Dark Art of Production Alerting

N o r m a l d i s t r i b u ti o n - a g a i n

50 Percent slower than μ

97.6 Percent slower than μ + 2σ

Median97th Percentile

Page 41: The Dark Art of Production Alerting

The 50 t h and 90 t h percenti le defi ne normal behavior

without needingto know anything about the

distributi on model

Page 42: The Dark Art of Production Alerting

Median shows the real problem

Page 43: The Dark Art of Production Alerting

H o w t o d e fi n e n o n - n o r m a l b e h a v i o r ?

Page 44: The Dark Art of Production Alerting

Fo r t u n ate l y, t h i s i s n o t t h e p ro b l e m we n e e d to s o l ve

We are only talking about missed expectations

Page 45: The Dark Art of Production Alerting

Let’s look at two scenarios

Errors

Is a certain error rate likely to happen or not?

Response Times

Is a certain increase in response time significant

enough to trigger an incident?

Page 46: The Dark Art of Production Alerting

The error rate scenarioWe have a typical error rate of 3 percent at 10.000 transactions/minute

During the night we now have 5 errors in 100 requests. Should we alert – or not?

Page 47: The Dark Art of Production Alerting

W h a t c a n w e l e a r n

Page 48: The Dark Art of Production Alerting

S t a ti s ti c s i s e v e r w h e r e

Page 49: The Dark Art of Production Alerting

B i n o m i a l D i st r i b u ti o nTells us how likely it is to see n successes in a certain number of trials

Page 50: The Dark Art of Production Alerting

H o w m a n y e r r o r s a r e o k ?

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 190.0%

20.0%

40.0%

60.0%

80.0%

100.0%

120.0%

Likeliness of at least n errors

18 % probability to see 5 or more errors. Which is within 2 times Std. Deviation. We do not alert.

Page 51: The Dark Art of Production Alerting

R e s p o n s e T i m e E x a m p l eOur median response time is 300 ms

and we measure

200 ms 400 ms 350 ms 200 ms 600 ms500 ms 150 ms 350 ms 400 ms 600 ms

Page 52: The Dark Art of Production Alerting

P e r c e n ti l e D r i ft

D e t e c ti o n

Page 53: The Dark Art of Production Alerting

Did the median drift signifi cantly?

Check all values above 300 ms200 ms 400 ms 350 ms 200 ms 600 ms500 ms 150 ms 350 ms 400 ms 600 ms

7 values are higher than the median. Is this normal?

We can again use the Binomial Distribution

Page 54: The Dark Art of Production Alerting

A p p l y i n g t h e B i n o m i a l D i s t r i b u ti o n

We have a 50 percent likeliness to see values above the median.

How likely is is that 7 out of 10 samples are higher?

The probability is 17 percent, so we should not alert.

Page 55: The Dark Art of Production Alerting

How to calculatethis value?

… a n d w e a r e d o n e !

Which metric to pick?

How to getthis baseline?

How to define thatthis happened?

Page 56: The Dark Art of Production Alerting

This was just the beginningThere are many more use things about statistics, probabilities, testing, ….

Page 57: The Dark Art of Production Alerting

A l o i s R e i t b a u e [email protected]@AloisReitbauer

http://bit.ly/nycwebperferf

Page 58: The Dark Art of Production Alerting

Image Credits

http://commons.wikimedia.org/wiki/File:Network_switches.jpghttp://commons.wikimedia.org/wiki/File:Wheelock_mt.jpghttp://commons.wikimedia.org/wiki/File:Fire-lite-bg-10.jpghttp://commons.wikimedia.org/wiki/File:Estacaobras.jpghttp://commons.wikimedia.org/wiki/File:Speedo_angle.jpghttp://commons.wikimedia.org/wiki/File:WelcomeToVegasNite.JPGhttp://commons.wikimedia.org/wiki/File:Dice_02138.JPGhttp://commons.wikimedia.org/wiki/File:Teadlased_j%C3%A4%C3%A4l.jpg