do you really know your response times? · dropwizard metrics i use dropwizard and you get i...

32
Do You Really Know Your Response Times? Daniel Rolls March 2017

Upload: others

Post on 06-Aug-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Do You Really Know Your Response Times? · Dropwizard Metrics I Use Dropwizard and you get I Metrics infrastructure for free I Metrics from Cassandra and Dropwizard bundles for free

Do You Really Know Your Response Times?

Daniel Rolls

March 2017

Page 2: Do You Really Know Your Response Times? · Dropwizard Metrics I Use Dropwizard and you get I Metrics infrastructure for free I Metrics from Cassandra and Dropwizard bundles for free

Sky Over The Top Delivery

I Web Services

I Over The Top Asset Delivery

I NowTV/Sky Go

I Always up

I High traffic

I Highly concurrent

Page 3: Do You Really Know Your Response Times? · Dropwizard Metrics I Use Dropwizard and you get I Metrics infrastructure for free I Metrics from Cassandra and Dropwizard bundles for free

OTT Endpoints

GET /stuff

{“foo”: “bar”}

Our App

I How much traffic is hitting that endpoint?

I How quickly are we responding to a typical customer?

I One customer complained we respond slowly. How slow do we get?

I What’s the difference between the fastest and slowest responses?

I I don’t care about anomalies but how slow are the slowest 1%?

Page 4: Do You Really Know Your Response Times? · Dropwizard Metrics I Use Dropwizard and you get I Metrics infrastructure for free I Metrics from Cassandra and Dropwizard bundles for free

Collecting Response Times

My App Response Time Collection System

I Large volumes of network traffic

I Risk of losing data (network may fail)

I Affects application performance

I Needs measuring itself!

Page 5: Do You Really Know Your Response Times? · Dropwizard Metrics I Use Dropwizard and you get I Metrics infrastructure for free I Metrics from Cassandra and Dropwizard bundles for free

Our Setup

Graphite Grafana

Application instance 1

Application instance 2

Page 6: Do You Really Know Your Response Times? · Dropwizard Metrics I Use Dropwizard and you get I Metrics infrastructure for free I Metrics from Cassandra and Dropwizard bundles for free

Dropwizard Metrics Library: Types of Metric

I Counter

I Gauge — ‘instantaneous measurement of a value’

I Meter (counts, rates)

I Histogram — min, max, mean, stddev, percentiles

I Timer — Meter + Histogram

Page 7: Do You Really Know Your Response Times? · Dropwizard Metrics I Use Dropwizard and you get I Metrics infrastructure for free I Metrics from Cassandra and Dropwizard bundles for free

Example Dashboard

Page 8: Do You Really Know Your Response Times? · Dropwizard Metrics I Use Dropwizard and you get I Metrics infrastructure for free I Metrics from Cassandra and Dropwizard bundles for free

Dropwizard Metrics

I Use Dropwizard and you getI Metrics infrastructure for freeI Metrics from Cassandra and Dropwizard bundles for freeI You can easily add timers to metrics just by adding annotations

I Ports exist for other languages

I Developers, architects, managers everybody loves graphs

I We trust and depend on them

I We rarely understand them

I We lie to ourselves and to our managers with them

Page 9: Do You Really Know Your Response Times? · Dropwizard Metrics I Use Dropwizard and you get I Metrics infrastructure for free I Metrics from Cassandra and Dropwizard bundles for free

Goals of this talk

I Understand how we can measure service time latencies

I Ensure meaningful statistics are given back to managers

I Learn how to use appropriate dashboards for monitoring and alerting

Page 10: Do You Really Know Your Response Times? · Dropwizard Metrics I Use Dropwizard and you get I Metrics infrastructure for free I Metrics from Cassandra and Dropwizard bundles for free

What is the 99th Percentile Response Time?

?

Page 11: Do You Really Know Your Response Times? · Dropwizard Metrics I Use Dropwizard and you get I Metrics infrastructure for free I Metrics from Cassandra and Dropwizard bundles for free

What is the 99th Percentile?

Page 12: Do You Really Know Your Response Times? · Dropwizard Metrics I Use Dropwizard and you get I Metrics infrastructure for free I Metrics from Cassandra and Dropwizard bundles for free

Our Setup

Graphite Grafana

Application instance 1

Application instance 2

Reservoir

Reservoir

Page 13: Do You Really Know Your Response Times? · Dropwizard Metrics I Use Dropwizard and you get I Metrics infrastructure for free I Metrics from Cassandra and Dropwizard bundles for free

Reservoirs

Reservoir(1000 elements)

Page 14: Do You Really Know Your Response Times? · Dropwizard Metrics I Use Dropwizard and you get I Metrics infrastructure for free I Metrics from Cassandra and Dropwizard bundles for free

Types of Reservoir

I Sliding window

I Time-base sliding window

I Exponentially decaying

Page 15: Do You Really Know Your Response Times? · Dropwizard Metrics I Use Dropwizard and you get I Metrics infrastructure for free I Metrics from Cassandra and Dropwizard bundles for free

Forward Decay

Time

v 3=45ms

v 2=38ms

v 1=25ms

v 4=57ms

x3x2

x1x4La

ndmark

wi = eαxi

Page 16: Do You Really Know Your Response Times? · Dropwizard Metrics I Use Dropwizard and you get I Metrics infrastructure for free I Metrics from Cassandra and Dropwizard bundles for free

w1 w2 w3 w4 w5 w6 w7 w8

v1 v2 v3 v4 v5 v6 v7 v8

Sorted by value

Getting at the percentiles

I Normalise weights:∑

i wi = 1

I Lookup by normalised weight

Data retention

I Sorted Map indexed by w .random number

I Smaller indices removed first

Page 17: Do You Really Know Your Response Times? · Dropwizard Metrics I Use Dropwizard and you get I Metrics infrastructure for free I Metrics from Cassandra and Dropwizard bundles for free

Response Time Jumps for 4 Minutes

Page 18: Do You Really Know Your Response Times? · Dropwizard Metrics I Use Dropwizard and you get I Metrics infrastructure for free I Metrics from Cassandra and Dropwizard bundles for free

One Percent Rise from 20ms to 500ms

Page 19: Do You Really Know Your Response Times? · Dropwizard Metrics I Use Dropwizard and you get I Metrics infrastructure for free I Metrics from Cassandra and Dropwizard bundles for free

One Percent Rise from 20ms to 500ms

Page 20: Do You Really Know Your Response Times? · Dropwizard Metrics I Use Dropwizard and you get I Metrics infrastructure for free I Metrics from Cassandra and Dropwizard bundles for free

Trade-off

I Autonomous teamsI Know one app wellI Feel responsible for app performance

I But. . .I Can’t know everythingI Will make mistakes with numbersI We might even ignore mistakes

Page 21: Do You Really Know Your Response Times? · Dropwizard Metrics I Use Dropwizard and you get I Metrics infrastructure for free I Metrics from Cassandra and Dropwizard bundles for free

One Long Request Blocks New Requests

Page 22: Do You Really Know Your Response Times? · Dropwizard Metrics I Use Dropwizard and you get I Metrics infrastructure for free I Metrics from Cassandra and Dropwizard bundles for free

One Long Request Blocks New Requests

Page 23: Do You Really Know Your Response Times? · Dropwizard Metrics I Use Dropwizard and you get I Metrics infrastructure for free I Metrics from Cassandra and Dropwizard bundles for free

Spikes and Tower Blocks

Page 24: Do You Really Know Your Response Times? · Dropwizard Metrics I Use Dropwizard and you get I Metrics infrastructure for free I Metrics from Cassandra and Dropwizard bundles for free

Splitting Things Up

My App

IOS

Android

Web

Brand A

Brand B

Brand A

Brand B

Brand A

Brand B

Page 25: Do You Really Know Your Response Times? · Dropwizard Metrics I Use Dropwizard and you get I Metrics infrastructure for free I Metrics from Cassandra and Dropwizard bundles for free

Metric Imbalance Visualised

Reservoir 1100 ms10 ms

Reservoir 2100 ms

Reservoir 3100 ms

Max 100 ms100 ms

Page 26: Do You Really Know Your Response Times? · Dropwizard Metrics I Use Dropwizard and you get I Metrics infrastructure for free I Metrics from Cassandra and Dropwizard bundles for free

Metric Imbalance

I One pool gives more accurate resultsI Multiple pools allow drilling down, but. . .

I Some pools may have inaccurate performance measurementsI Only those with sufficient rates should be analysedI How can we narrow down on just those?

I Simpson’s Paradox

Page 27: Do You Really Know Your Response Times? · Dropwizard Metrics I Use Dropwizard and you get I Metrics infrastructure for free I Metrics from Cassandra and Dropwizard bundles for free

Simpson’s Paradox

Explanation

I Two variables have a positive correlation

I Grouped data shows a negative correlation

I There’s a lurking third variable

Page 28: Do You Really Know Your Response Times? · Dropwizard Metrics I Use Dropwizard and you get I Metrics infrastructure for free I Metrics from Cassandra and Dropwizard bundles for free

Simpson’s Paradox

X

Y

X

Y

I Increasing traffic =⇒ X gets slower

I Increasing traffic =⇒ Y gets faster

I We move % traffic to System Y

I We wait for prime time peak

I System gets slower???

I 100% of brand B traffic still goes to X

I Results are pooled by client and brand

I Classic example: UC Berkeley genderbias

Page 29: Do You Really Know Your Response Times? · Dropwizard Metrics I Use Dropwizard and you get I Metrics infrastructure for free I Metrics from Cassandra and Dropwizard bundles for free

Lessons Learnt

I Want fast alerting?I Use max

I If you don’t graph the max you are hiding the bad

I Don’t just look at fixed percentiles.I Understand the distribution of the data (HdrHistogram)I A few fixed percentiles tells you very little as a test

I Monitor one metric per endpointI When aggregating response times

I Use maxSeries

Page 30: Do You Really Know Your Response Times? · Dropwizard Metrics I Use Dropwizard and you get I Metrics infrastructure for free I Metrics from Cassandra and Dropwizard bundles for free

So We’re Living a Lie, Does it Matter?

Page 31: Do You Really Know Your Response Times? · Dropwizard Metrics I Use Dropwizard and you get I Metrics infrastructure for free I Metrics from Cassandra and Dropwizard bundles for free

Conclusions and Thoughts

I Don’t immediately assume numbers on dashboards are meaningful

I Understand what you are graphing

I Test assumptionsI Provide these tools and developers will confidently use them

I Although maybe not correctly!I Most developers are not mathematicians

I Keep it simple!

I Know which numbers are real and which are lies!

Page 32: Do You Really Know Your Response Times? · Dropwizard Metrics I Use Dropwizard and you get I Metrics infrastructure for free I Metrics from Cassandra and Dropwizard bundles for free

Thank you