better service monitoring through histograms sv perl 09012016

33
Better service monitoring through histograms Fred Moyer - @phredmoyer Silicon Valley Perl, 09-01-2016

Upload: fred-moyer

Post on 13-Apr-2017

93 views

Category:

Software


0 download

TRANSCRIPT

Page 1: Better service monitoring through histograms sv perl 09012016

Better service monitoring through histogramsFred Moyer - @phredmoyerSilicon Valley Perl, 09-01-2016

Page 2: Better service monitoring through histograms sv perl 09012016

Who likes to wake up for false positives?

Page 3: Better service monitoring through histograms sv perl 09012016

Synthetics

Easy to setup, but not a real user

Page 4: Better service monitoring through histograms sv perl 09012016

Stephen Falken: Uh, uh, General, what you see on these screens up here is a fantasy; a computer-enhanced hallucination. Those blips are not real missiles. They're phantoms. (War Games, 1983)

Page 5: Better service monitoring through histograms sv perl 09012016

Real Users

Page 6: Better service monitoring through histograms sv perl 09012016

Real Users

Page 7: Better service monitoring through histograms sv perl 09012016

500 ms is really 2,000 ms

Spike Erosion

Page 8: Better service monitoring through histograms sv perl 09012016

Threshold Based Alerting

Page 9: Better service monitoring through histograms sv perl 09012016

“Alert if a request takes longer than 200 ms”

10,10,10,10,10,10,10,10,10,5000

Alerts on one outlier in 10

Threshold Alerting

Page 10: Better service monitoring through histograms sv perl 09012016

“Alert if request average over one minute is longer than 200 ms”

avg(10,10,210,210,210,210) = 143 (860/6)

Does not alert on multiple high samples

Threshold Alerting

Page 11: Better service monitoring through histograms sv perl 09012016

‘average’ eq ‘arithmetic mean’A=S/N

A = averageN = the number of samples

S = the sum of the samples in the set

Math Refresher

Page 12: Better service monitoring through histograms sv perl 09012016

median = midpoint of data set

The 50th percentile is 555 - q(0.5)

Value 111 222 333 444 555

666

777 888 999

Sample # 1 2 3 4 5 6 7 8 9

Math Refresher

Page 13: Better service monitoring through histograms sv perl 09012016

90th percentile - 90% of samples below it

The 90th percentile is 1,000 - q(0.9)

Value 111

222

333

444

555

666

777

888

999 1,00

01,111

Sample #

1 2 3 4 5 6 7 8 9 10 11

Math Refresher

Page 14: Better service monitoring through histograms sv perl 09012016

100th Percentile - the maximum value

The 100th percentile is 1,111 - q(1)

Value 111

222

333

444

555

666

777

888

999

1,000 1,11

1Sample #

1 2 3 4 5 6 7 8 9 10 11

Math Refresher

Page 15: Better service monitoring through histograms sv perl 09012016

Sample value

Number of samples

Histogram

Page 16: Better service monitoring through histograms sv perl 09012016

Sample value

Number of samples

Normal Distribution

Page 17: Better service monitoring through histograms sv perl 09012016

Sample value

Number of samples

Normal Distribution

68% within one sigma (σ)

Page 18: Better service monitoring through histograms sv perl 09012016

Sample value

Number of samples

Non-Normal Distribution

Page 19: Better service monitoring through histograms sv perl 09012016

Sample value

Number of samples

Non-Normal Distribution

Page 20: Better service monitoring through histograms sv perl 09012016

Non-Normal Distribution

Operations data groups at different points

Page 21: Better service monitoring through histograms sv perl 09012016

Non-Normal Distribution

Users to the right of the red line are gone

Page 22: Better service monitoring through histograms sv perl 09012016

Request latency“We keep hearing from people that the

website is slow. But it is fine when we test it, and the request latency graph is

constant”

You are only looking at part of the picture.

Page 23: Better service monitoring through histograms sv perl 09012016

Heat Map

Histograms over time windows

Page 24: Better service monitoring through histograms sv perl 09012016

Percentiles

Page 25: Better service monitoring through histograms sv perl 09012016

Practical PercentilesBandwidth usage is often billed at 95th percentile

usageRecord 5 minute data usage intervals

Sort samples by value of sampleThrow out the highest 5% of samples

Charge usage based on the remaining top sample, i.e. 300 MB transferred over 5 minutes = 1 MB/s rate

billing

Page 26: Better service monitoring through histograms sv perl 09012016

Practical Percentiles

If I measure 95th percentile per 5 minutes all month long,

I CANNOT calculate 95th percentile over the month.

Page 27: Better service monitoring through histograms sv perl 09012016

Angry users

How many users are you pissing off?

Page 28: Better service monitoring through histograms sv perl 09012016

Angry users

Page 29: Better service monitoring through histograms sv perl 09012016

“Alert me if request latency 90th percentile over one minute is

exceeded”

Percentile based alerting

q(0.9)[10,10,10,10,10,10,10,10,5000] == 10Alert IS NOT triggered

Do you want to be woken up for this? NO!

Page 30: Better service monitoring through histograms sv perl 09012016

“Alert me if request latency 90th percentile over one minute is exceeded”

Percentile based alerting

q(0.9)[10,10,10,10,10,10,250,300] = ~270Alert IS triggered

Do you want to be woken up for this? YES!

Page 31: Better service monitoring through histograms sv perl 09012016

Percentile based alerting

Page 32: Better service monitoring through histograms sv perl 09012016

Who’s using this approach?

Google.com - in house monitoring systemsCirconus.com - hosted histogram monitoring

You? (I’ve written my own histograms but use Circonus for production systems)

Page 33: Better service monitoring through histograms sv perl 09012016

Questions?

Thanks to Circonus for tools and help with math

http://www.circonus.com/free-account/Look for future monitoring talks here soon

http://meetup.com/monitorSF