winning the metrics battle

Post on 15-Jan-2015

6.634 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

The slides from a presentation at Velocity Europe 2012 talk about how the Guardian does metrics an monitoring. The original proposal is at http://velocityconf.com/velocityeu2012/public/schedule/detail/26576 and there is also an article about it at http://www.guardian.co.uk/info/developer-blog/2012/oct/04/winning-the-metrics-battle

TRANSCRIPT

Winning the metrics battle (finally)

Winning the metrics battle (finally)

Simon Hildrew

Infrastructure Developer

The Guardian

Nick Satterly

Monitoring Engineer

The Guardian

The metrics battlefield

1,400 2,800

50,000

180,000

Total metrics

5 minutes

every 15seconds

http://www.flickr.com/photos/ghostsigns/6676069121

http://www.flickr.com/photos/millynet/134071210

developer dashboards

0

5

10

15

20

Physical screens Screensaver hacks

dev

hack

business dashboards

metrics + dashboards = culture change

http://www.flickr.com/photos/chrisjames_taylor/5454315456

Side project

Incremental upgrade

Use off the shelf tool

Pragmatic solution

Done in a year

our approach➡ Prioritise

➡ Understand the real problem

➡ Question the tools

➡ Be ambitious

➡ Keep learning

Prioritise

drowning in work

http://www.flickr.com/photos/iampeas/246738971

a dedicated monitoring and metrics engineer

Understand the real problem

Urgent issue - current tool end of life

The story so far...

metrics were not helping us solve production outages

ballooning number of applications

but... difficult to instrument applications

T.T. Fix

T.T. Detect+

T.T. Diagnose+

T.T. Resolve

=

inaccessible tools

http://www.flickr.com/photos/kdashy/2678539087

inconsistent data

http://www.flickr.com/photos/sybrenstuvel/2468506922

hypothesising & arguingeasier than measuring

http://www.flickr.com/photos/nouqraz/200049988

The ‘right’ thing

• measure everything

• measure frequently

• measure each data point once

• input and output must be open

Question the tools

Brute force?

http://www.flickr.com/photos/epublicist/3546059144

The safe option?

http://www.flickr.com/photos/alicebartlett/2361209195

Unintuitive?

http://www.flickr.com/photos/merlijnhoek/2841785343

http://www.flickr.com/photos/evansville/8953838/

Imposing a flawed model?

Too difficult / no progress?http://www.flickr.com/photos/ginja_andy/4165849136/

Nagios

• the “IBM” of monitoring tools

• compromise over quantity and frequency of checks

• < insert your criticism of nagios here >

Zabbix

• metric collection tightly coupled to monitoring tool

• confusing UI with poor visualisation

• needed brute force to make limited API work

The ‘right’ thing

• measure everything

• measure frequently

• measure each data point once

• input and output must be open

don’t compromise

Be ambitious

Throw work away

http://www.flickr.com/photos/mugley/2961131550

Draw your dream

Get as far as you can

http://www.flickr.com/photos/sk8geek/7358702704

graphite

Etsy dashboard

FITB ganglia

network applicationshosts

db?

api?

SNMP? syslog?

alerting?

message queue

screens users

Develop missing pieces

http://www.flickr.com/photos/kalexanderson/5969012589

graphite

Etsy dashboard

FITB ganglia

network applicationshosts

mongodb elastic search

ganglia alerts

ganglia-api

syslog alerts

SNMP alerts

alerta

message queue

screens users

Guardian Managementhttps://github.com/guardian/guardian-management

Ganglia APIhttps://github.com/guardian/ganglia-api

rescale image???

Alertahttps://github.com/guardian/alerta

• Ganglia

• FITB

• Graphite

• Etsy dashboards

• Guardian managementhttps://github.com/guardian/guardian-management

• Guardian ganglia-apihttps://github.com/guardian/ganglia-api

• Guardian alertahttps://github.com/guardian/alerta

Current stack

Keep learning

we are not there yet

Watch the cultural changes

detecting

diagnosis

diagnosis

performance testing

confirmation

#monitoringsucks

➡ Prioritise

➡ Understand the real problem

➡ Question the tools

➡ Be ambitious

➡ Keep learning

tools can change culture

Thank you

Simon Hildrew@sihil

simon.hildrew@guardian.co.uk

Nick Satterly@nicksatterly

nick.satterly@guardian.co.uk

http://github.com/guardianhttp://gu.com/p/3ap5f

top related