osmc 2014: from monitoringsucks to monitoringlove (and back) | kris buytaert
DESCRIPTION
Back in June 2011 John Vincent ranted on twitter that #monitoringsucks, and for a lot of us he was absolutely right. At #devopsdays Rome 2012, in November, Ulf Mansson proclaimed his new found love for monitoring and we changed the hashtag into #monitoringlove. Based on a new era of open source tools, Ulf started loving monitoring again. And for a lot of us he was absolutely right. Over the past 5 years an enormous amount of new tools and new patterns has come out of the community sometimes tagged with #devops, pretty much all of them open source. Do you still know what you should be using for what? And what the differences are? An opinionated overview of the open source monitoring landscape to clear up the confusion on what you should use, or make the decision even more difficult on you :)TRANSCRIPT
From #MonitoringSucks to
#MonitoringLove
(and back)
@KrisBuytaert OSMC 2014 , Nuremberg, Germany
Kris Buytaert ●I used to be a Dev, ●Then Became an Op ●Chief Trolling Officer and Open Source Consultant @inuits.eu ●Everything is an effing DNS Problem ●Building Clouds since before the bookstore ●Organising Conferences ●Evangelizing devops
An opinionated talk about the Open Source Monitoring tooling landscape
In which I hope to learn from YOU
#devops=~C(L)AMS ● Culture
● (Lean)
● Automation
● Monitoring and Measurement
● Sharing
● Damon Edwards and John Willis
Gene Kim
Monitoring is usually an aftertought ENOBUDGET, ENOTIME
An 2008 OLS Paper ● We have bloated Java tools
● Some open Core stuff
● DYI folks want traditional Nagios
● DBA Required
#monitoringsucks ● John Vincent (@lusis), june 2011
● A sub #devops movement
● https://github.com/monitoringsucks/
Why #monitoringsucks ● Manual config (gui)
● Not in sync with reality
● Hosts only
● Services sometimes
● Aplication never
● Chaos or out of sync with reality
● Alert Fatigue
Let's forget about ● Tools with no (stable) API
● Tools with strong focus on GUI
● Unless you are an SME with < 100 nodes
● Zenoss, Hyperic, GroundWork, ....
● P.S. : don't even mention proprietary software to me
What we want
● Small , well suited components
• Collect
• Transport / Mangle
• Store
• Analyse
• Act / Alert
• Visualize
#monitoringlove
•Ulf Mansson #devopsdays Rome 2011
•A new era of tooling
•#monitoringlove hacksessions @inuits
•#monitorama
Icinga •2009 Fork
•I consider Nagios dead
•Vibrant Community (or they stalk me)
•Throw great parties in Nurnberg
•Nobody can pronounce it anyhow
•https://github.com/Inuits/puppet-icinga/
Stored Configs
#monitoringlove But the love was about :
Sensu ● Awesome for non static environments
● Scaling a clustered RabbitMQ ?
● This is Europe, U no do cloud
Automation of #monitoring brought back
the #love
●Autodetection
●Multiplexing
●Trend Forecasting
I love CheckMK
•Autodetection ?
•Service,
•Business Functionalities
•eg. vhosts etc
•Single Source of Truth
I hate CheckMK
Monitoring a service vs
Monitoring a Service
definition of done:
monitored and in production
A software project is not done untill your last end user is dead
Culture,
Automation,
Measurement : measure all the things
Sharing
Deploy Statistics ● Time To Deploy
● Deploy Frequency
● Lifecycle frequency
● Map to other metrics
CollectD all the metrics, at high intervals
Oldschool graphite
Self Service Gdash based pipelines
Puppetized Templates (wip)
Gdash
Grafana
Graphite++ ● Dashboards
• Grafana
● Engines :
• InfluxDB
• Cyanite
Triggers on Graphs ● Export Java Metrics
● JMXTrans
● Export JMXConfigs
● Configure NRPE Check
● Export NagiosCheck
● Collect JMX Exports on JMXTransNode
● Graph Em
● Collect Icinga Configs on Icinga
Aggregation ● Alert on streams
● Alert on aggregated metrics
Riemann ● I still don't get it ?
● Distributed Top
● Do you like Clojure ?
● Riemann Health plugin ?
● s/riemann-health/collectd/g;
● Output to graphite
Graphs to Knowledge
Skyline
•Oculus
•Creating Information out of this data
•Big data
•Machine Learning
But I have log files..
Logs and Metrics ● Graylog2
● ELSA (Enterprise Log Search and Archive)
● ELK Stack
● Collect from anywhere
● Filter
● Send anywhere
● Queing
Black on White ?
APM But what about my apps ?
Half the world cheers about SAAS tools :(
Packetbeat ● Traffic Flow through network
● Transactions causing errros
● SQL per HTTP
● API call usage
PacketBeat
This new “D” hype
Containers are the new black
● 1 process per container
● Metric collection ?
● Service health ?
So you want service registration of your healthy (containerized) applications ?
Enter Consul.io ● Service discovery
● Failure detection
● Using Gossip build on top of Serf
● Random node 2 node communication
● A HashiCorp project
Consul ● Uses monitoring_plugins for health
● Creates unhealthy dns setups
● Sensu alike
● Key-Value store
● Consul_template => fills your templates
Everything is a freaking dns problem
Self Healing ● Pacemaker Corosync (ocf resource that monitors your service)
● Mesos
● Kubernetes
● Scale changes, Consensus Models change
So your DC fails
Whom to alert when ?
'New' kids on the block ● Flapjack
● flapjack.io
● monitoring notification routing + event processing system
● OpenDuty
● github.com/szechuen/OpenDuty
● Duty management
My Alerting Strategy
Is still in beta
And back :(
In 2014 I`m still running the same check for
- service registration (consul)
- high availability (pacemaker/corosync)
- monitoring (icinga)
But I love where Monitoring is heading
We have much less false positives
And we have a Maintainable Monitoring Infra
Kinda
Your next trip to Gent !
CfgMgmtcamp.eu February 2 and 3, 2015
CFP is Open !
Contact [email protected] Further Reading @krisbuytaert http://www.krisbuytaert.be/blog/ http://www.inuits.eu/
Inuits Duboistraat 50 2060 Antwerpen Belgium 891.514.231 +32 475 961221