Slate
Scaling Nagios 4
Daniel Wittenberg
About Me
Unix/Linux admin since mid 90's
Nagios/Netsaint user since early 2000's
Owned/operated consulting business for almost 10 years that provided distributed monitoring using Nagios
Previously employed by Fortune 50 Insurance company
Currently Monitoring Platform Manager at IPsoft Inc.
About IPsoft
Provider of Remote Infrastructure Management and automation services
ITIL and 6 Sigma compliance management framework
Automation that resolves 56% of all incidents, and 90% L1
Monitoring, Automation, Event Correlation, Management....
Offices around the world in ten countries
http://www.ipsoft.com
Last year...
What is BIG ?
My Configuration
~700 Nagios Servers
~130,000 Monitored Devices
~3,000,000 Service Checks
Mix of customized Nagios 3.2.3 and 4.0.0
Scientific Linux 6.2/6.4
Managed by Puppet 3.x
2/3 on VMware ESX rest are bare metal
Adding new Nagios servers almost daily
What's different with Nagios 4
SPEED!
Current testing shows on average 500% faster over 3.2.3
What's different with Nagios 4
Some things that would impact performance/stabilityhttp://nagios.sourceforge.net/docs/nagioscore/4/en/whatsnew.htmlEmbedded Perl Gone
external_command_buffer_slots - Gone
-x option to not verify circular paths no longer needed in rc scripts
Configuration Verification algorithm changes, massive startup speed increase
Event Queue algorithm changes, helps with CPU utilization * Andreas 2012 Pres.
Disk I/O reduced to virtually 0
NEW query handler interface, better communication with core
NEW core workers reduces I/O, memory, CPU
Completely re-written spec file for better installs, debug modes
Perf Testing Lab Setup
Servers are all ESX 5 based VM's on the same cluster
Variable CPU cores, 4GB memory
Metrics used to consider a test failure:CPU Block Queue > 3
CPU I/O Wait > 3
CPU Idle < 10%
Service Check Latency > 1s
Host Check Latency > 1s
30 minute run time, > 3% failure rate failed the test
Fully automated increasing work load, consistent results
Add 1 host + 1 service check, try to get best case numbers w/o check lat.
Test Lab Architecture
Test Results
CPU CoresService ChecksVersion 3.2.3Service Checks Version 4.0.0rc1Difference
1170010500617%
2330020800630%
4650035300543%
81170045100385%
Other software used
Customized livestatus based on Andreas updates for Nagios 4https://github.com/ageric/livestatus
Developing custom single pane interface to replace CGI/Check_mk Multisite
Developing full REST API to talk to QH, livestatus and config files
nagios-qh.rb Query Handler interface to gather loadctl metricshttps://www.dropbox.com/s/h6zn0ecycqb1xrc/nagios-qh.rb
Custom load control daemon that talks to QH
Custom Event Broker to send perf data directly to ActiveMQ for post-processing
Custom agent, like NRPE on steroids without limitations like buffer size
Other performance tweaks
Sysctl Changesnet.ipv4.tcp_fin_timeout
net.ipv4.tcp_keepalive_profiles
net.ipv4.tcp_tw_recycle
net.ipv4.tcp_tw.reuse
No longer need RAMDISK, but still in the default sysconfig/RC script for now
Keep logging levels as low as possible
Disable CGI's whenever possible
Disable Environment Macros
Don't use resource macros when you don't need to, they are not cached
Other performance tweaks
/etc/security/limits.d/nagios.confipmon soft nofile 131072
ipmon hard nofile 131072
ipmon soft nproc 131072
ipmon hard nproc 131072
Nearly disable OOM killer for the nagios process, saves it until lastecho '-16' > /proc//oom_adj
Re-nice puppet to run at 10 so less impacting (true for any extra services)/etc/sysconfig/puppet NICELEVEL=10
This should apply to any other running services that might take resources
Common Perf Tools
vmstat / top cpu/memory
iostat / iotop disk usage
iptraf - network
sar cpu/memory/disk
strace immediate debugging, also debugging QA
esxtop VM stats
tuned can dynamically tune system
perf record -p / perf list / perf top -u nagios
How to keep it running good
Monitor everything...you can never have too much info!
CPU load and CPU stats (idle/wait/user/system)
Disk space, inodes free
All application/system logs (apache, syslog, nagios.log, etc.)
Hardware status
Swap / Physical Memory Usage
Puppet state (state.yaml)
Apache Stats (if have GUI/API)
Network performance and stats (errors, throughput, etc.)
NTP time and drift (more important on VM's)
Our Platform Architecture (simplified)
Known Issues (and complaints)
Number of workers on smaller (1-2 core) systems easily overloaded
No remote workers (yet)
Still have to restart to add new hosts/services
No REST API natively
Livestatus (or similar) not native
Questions ?
@dwittenberg2008
www.linkedin.com/in/dwittenberg
nagios and nagios-devel IRC
Nagios Users and Devel mailing lists
Always looking to hire new people so contact me!
Click to edit the title text format
Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline LevelSeventh Outline LevelEighth Outline LevelNinth Outline Level