osmc 2014: from monitoringsucks to monitoringlove (and back) | kris buytaert

Post on 02-Jul-2015

302 Views

Category:

Software

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

Back in June 2011 John Vincent ranted on twitter that #monitoringsucks, and for a lot of us he was absolutely right. At #devopsdays Rome 2012, in November, Ulf Mansson proclaimed his new found love for monitoring and we changed the hashtag into #monitoringlove. Based on a new era of open source tools, Ulf started loving monitoring again. And for a lot of us he was absolutely right. Over the past 5 years an enormous amount of new tools and new patterns has come out of the community sometimes tagged with #devops, pretty much all of them open source. Do you still know what you should be using for what? And what the differences are? An opinionated overview of the open source monitoring landscape to clear up the confusion on what you should use, or make the decision even more difficult on you :)

TRANSCRIPT

From #MonitoringSucks to

#MonitoringLove

(and back)

@KrisBuytaert OSMC 2014 , Nuremberg, Germany

Kris Buytaert ●I used to be a Dev, ●Then Became an Op ●Chief Trolling Officer and Open Source Consultant @inuits.eu ●Everything is an effing DNS Problem ●Building Clouds since before the bookstore ●Organising Conferences ●Evangelizing devops

An opinionated talk about the Open Source Monitoring tooling landscape

In which I hope to learn from YOU

#devops=~C(L)AMS ● Culture

● (Lean)

● Automation

● Monitoring and Measurement

● Sharing

● Damon Edwards and John Willis

Gene Kim

Monitoring is usually an aftertought ENOBUDGET, ENOTIME

An 2008 OLS Paper ● We have bloated Java tools

● Some open Core stuff

● DYI folks want traditional Nagios

● DBA Required

#monitoringsucks ● John Vincent (@lusis), june 2011

● A sub #devops movement

● https://github.com/monitoringsucks/

Why #monitoringsucks ● Manual config (gui)

● Not in sync with reality

● Hosts only

● Services sometimes

● Aplication never

● Chaos or out of sync with reality

● Alert Fatigue

Let's forget about ● Tools with no (stable) API

● Tools with strong focus on GUI

● Unless you are an SME with < 100 nodes

● Zenoss, Hyperic, GroundWork, ....

● P.S. : don't even mention proprietary software to me

What we want

● Small , well suited components

• Collect

• Transport / Mangle

• Store

• Analyse

• Act / Alert

• Visualize

#monitoringlove

•Ulf Mansson #devopsdays Rome 2011

•A new era of tooling

•#monitoringlove hacksessions @inuits

•#monitorama

Icinga •2009 Fork

•I consider Nagios dead

•Vibrant Community (or they stalk me)

•Throw great parties in Nurnberg

•Nobody can pronounce it anyhow

•https://github.com/Inuits/puppet-icinga/

Stored Configs

#monitoringlove But the love was about :

Sensu ● Awesome for non static environments

● Scaling a clustered RabbitMQ ?

● This is Europe, U no do cloud

Automation of #monitoring brought back

the #love

●Autodetection

●Multiplexing

●Trend Forecasting

I love CheckMK

•Autodetection ?

•Service,

•Business Functionalities

•eg. vhosts etc

•Single Source of Truth

I hate CheckMK

Monitoring a service vs

Monitoring a Service

definition of done:

monitored and in production

A software project is not done untill your last end user is dead

Culture,

Automation,

Measurement : measure all the things

Sharing

Deploy Statistics ● Time To Deploy

● Deploy Frequency

● Lifecycle frequency

● Map to other metrics

CollectD all the metrics, at high intervals

Oldschool graphite

Self Service Gdash based pipelines

Puppetized Templates (wip)

Gdash

Grafana

Graphite++ ● Dashboards

• Grafana

● Engines :

• InfluxDB

• Cyanite

Triggers on Graphs ● Export Java Metrics

● JMXTrans

● Export JMXConfigs

● Configure NRPE Check

● Export NagiosCheck

● Collect JMX Exports on JMXTransNode

● Graph Em

● Collect Icinga Configs on Icinga

Aggregation ● Alert on streams

● Alert on aggregated metrics

Riemann ● I still don't get it ?

● Distributed Top

● Do you like Clojure ?

● Riemann Health plugin ?

● s/riemann-health/collectd/g;

● Output to graphite

Graphs to Knowledge

Skyline

•Oculus

•Creating Information out of this data

•Big data

•Machine Learning

But I have log files..

Logs and Metrics ● Graylog2

● ELSA (Enterprise Log Search and Archive)

● ELK Stack

● Collect from anywhere

● Filter

● Send anywhere

● Queing

Black on White ?

APM But what about my apps ?

Half the world cheers about SAAS tools :(

Packetbeat ● Traffic Flow through network

● Transactions causing errros

● SQL per HTTP

● API call usage

PacketBeat

This new “D” hype

Containers are the new black

● 1 process per container

● Metric collection ?

● Service health ?

So you want service registration of your healthy (containerized) applications ?

Enter Consul.io ● Service discovery

● Failure detection

● Using Gossip build on top of Serf

● Random node 2 node communication

● A HashiCorp project

Consul ● Uses monitoring_plugins for health

● Creates unhealthy dns setups

● Sensu alike

● Key-Value store

● Consul_template => fills your templates

Everything is a freaking dns problem

Self Healing ● Pacemaker Corosync (ocf resource that monitors your service)

● Mesos

● Kubernetes

● Scale changes, Consensus Models change

So your DC fails

Whom to alert when ?

'New' kids on the block ● Flapjack

● flapjack.io

● monitoring notification routing + event processing system

● OpenDuty

● github.com/szechuen/OpenDuty

● Duty management

My Alerting Strategy

Is still in beta

And back :(

In 2014 I`m still running the same check for

- service registration (consul)

- high availability (pacemaker/corosync)

- monitoring (icinga)

But I love where Monitoring is heading

We have much less false positives

And we have a Maintainable Monitoring Infra

Kinda

Your next trip to Gent !

CfgMgmtcamp.eu February 2 and 3, 2015

CFP is Open !

Contact Kris.Buytaert@inuits.eu Further Reading @krisbuytaert http://www.krisbuytaert.be/blog/ http://www.inuits.eu/

Inuits Duboistraat 50 2060 Antwerpen Belgium 891.514.231 +32 475 961221

top related