osmc 2014: from monitoringsucks to monitoringlove (and back) | kris buytaert

From #MonitoringSucks to

#MonitoringLove

(and back)

@KrisBuytaert OSMC 2014 , Nuremberg, Germany

Kris Buytaert ●I used to be a Dev, ●Then Became an Op ●Chief Trolling Officer and Open Source Consultant @inuits.eu ●Everything is an effing DNS Problem ●Building Clouds since before the bookstore ●Organising Conferences ●Evangelizing devops

An opinionated talk about the Open Source Monitoring tooling landscape

In which I hope to learn from YOU

#devops=~C(L)AMS ● Culture

● (Lean)

● Automation

● Monitoring and Measurement

● Sharing

● Damon Edwards and John Willis

Gene Kim

Monitoring is usually an aftertought ENOBUDGET, ENOTIME

An 2008 OLS Paper ● We have bloated Java tools

● Some open Core stuff

● DYI folks want traditional Nagios

● DBA Required

#monitoringsucks ● John Vincent (@lusis), june 2011

● A sub #devops movement

● https://github.com/monitoringsucks/

https://github.com/monitoringsucks/

https://github.com/monitoringsucks/

Why #monitoringsucks ● Manual config (gui)

● Not in sync with reality

● Hosts only

● Services sometimes

● Aplication never

● Chaos or out of sync with reality

● Alert Fatigue

Let's forget about ● Tools with no (stable) API

● Tools with strong focus on GUI

● Unless you are an SME with < 100 nodes

● Zenoss, Hyperic, GroundWork, ....

● P.S. : don't even mention proprietary software to me

What we want

● Small , well suited components

• Collect

• Transport / Mangle

• Store

• Analyse

• Act / Alert

• Visualize

#monitoringlove

•Ulf Mansson #devopsdays Rome 2011

•A new era of tooling

•#monitoringlove hacksessions @inuits

•#monitorama

Icinga •2009 Fork

•I consider Nagios dead

•Vibrant Community (or they stalk me)

•Throw great parties in Nurnberg

•Nobody can pronounce it anyhow

•https://github.com/Inuits/puppet-icinga/

https://github.com/Inuits/puppet-icinga/




Stored Configs

#monitoringlove But the love was about :

Sensu ● Awesome for non static environments

● Scaling a clustered RabbitMQ ?

● This is Europe, U no do cloud

Automation of #monitoring brought back

the #love

●Autodetection

●Multiplexing

●Trend Forecasting

I love CheckMK

•Autodetection ?

•Service,

•Business Functionalities

•eg. vhosts etc

•Single Source of Truth

I hate CheckMK

Monitoring a service vs

Monitoring a Service

definition of done:

monitored and in production

A software project is not done untill your last end user is dead

Culture,

Automation,

Measurement : measure all the things

Sharing

Deploy Statistics ● Time To Deploy

● Deploy Frequency

● Lifecycle frequency

● Map to other metrics

CollectD all the metrics, at high intervals

Oldschool graphite

Self Service Gdash based pipelines

Puppetized Templates (wip)

Grafana

Graphite++ ● Dashboards

• Grafana

● Engines :

• InfluxDB

• Cyanite

Triggers on Graphs ● Export Java Metrics

● JMXTrans

● Export JMXConfigs

● Configure NRPE Check

● Export NagiosCheck

● Collect JMX Exports on JMXTransNode

● Graph Em

● Collect Icinga Configs on Icinga

Aggregation ● Alert on streams

● Alert on aggregated metrics

Riemann ● I still don't get it ?

● Distributed Top

● Do you like Clojure ?

● Riemann Health plugin ?

● s/riemann-health/collectd/g;

● Output to graphite

Graphs to Knowledge

Skyline

•Oculus

•Creating Information out of this data

•Big data

•Machine Learning

But I have log files..

Logs and Metrics ● Graylog2

● ELSA (Enterprise Log Search and Archive)

● ELK Stack

● Collect from anywhere

● Filter

● Send anywhere

● Queing

Black on White ?

APM But what about my apps ?

Half the world cheers about SAAS tools :(

Packetbeat ● Traffic Flow through network

● Transactions causing errros

● SQL per HTTP

● API call usage

PacketBeat

This new “D” hype

Containers are the new black

● 1 process per container

● Metric collection ?

● Service health ?

So you want service registration of your healthy (containerized) applications ?

Enter Consul.io ● Service discovery

● Failure detection

● Using Gossip build on top of Serf

● Random node 2 node communication

● A HashiCorp project

Consul ● Uses monitoring_plugins for health

● Creates unhealthy dns setups

● Sensu alike

● Key-Value store

● Consul_template => fills your templates

Everything is a freaking dns problem

Self Healing ● Pacemaker Corosync (ocf resource that monitors your service)

● Mesos

● Kubernetes

● Scale changes, Consensus Models change

So your DC fails

Whom to alert when ?

'New' kids on the block ● Flapjack

● flapjack.io

● monitoring notification routing + event processing system

● OpenDuty

● github.com/szechuen/OpenDuty

● Duty management

My Alerting Strategy

Is still in beta

And back :(

In 2014 I`m still running the same check for

- service registration (consul)

- high availability (pacemaker/corosync)

- monitoring (icinga)

But I love where Monitoring is heading

We have much less false positives

And we have a Maintainable Monitoring Infra

Kinda

Your next trip to Gent !

CfgMgmtcamp.eu February 2 and 3, 2015

CFP is Open !

Contact [email protected] Further Reading @krisbuytaert http://www.krisbuytaert.be/blog/ http://www.inuits.eu/

Inuits Duboistraat 50 2060 Antwerpen Belgium 891.514.231 +32 475 961221

http://www.krisbuytaert.be/blog/

http://www.krisbuytaert.be/blog/

osmc 2014: from monitoringsucks to monitoringlove (and back) | kris buytaert

Software