event console integration - checkmk.comtemplates in php (ノಠ益ಠ) ノ彡 no semantics, no...
TRANSCRIPT
03.05.2018, Sven PanneCheck_MK Conference #4
Event Console Integration
Check_MK Conference #42
Two Kinds of Monitoring
Pushing/pulling states
“Classic” Check_MK
Pushing of messages
Independent of state
Event Console
Status-based Event-based
Both kinds are useful!
Check_MK Conference #43
Design Considerations
High frequency of incoming
messages
Human-manageable amount of
actual events
Not a high-performance
syslog archive
Check_MK Conference #44
TrapParser
syslogParser
RuleEngine
Actions
SNMP
syslogspool
Notifications
Shell Script
local
EC
Architectural Overview IBasic Pipeline
Check_MK Conference #45
TrapParser
syslogParser
RuleEngine
Actions
SNMP
syslogspool
History Events
ViewsEventsHistoryDetails
Notifications
Shell Script
Monitoring CoreLivestatus
local
GUI
EC
Architectural Overview IIPersistence & GUI
Check_MK Conference #46
TrapParser
syslogParser
RuleEngine
Actions
SNMP
syslogspool
History Events
Event Service
ViewsEventsHistoryDetails
Check
Notifications
Shell Script
Monitoring CoreLivestatus
local
GUI
EC
Architectural Overview IIIThere and Back Again
Logwatch
Check_MK Conference #47
TrapParser
syslogParser
RuleEngine
Actions
SNMP
syslogspool
History Events
Event Service
ViewsEventsHistoryDetails
Check
Notifications
Shell Script
Monitoring CoreLivestatus
local
GUI
ECHosts,Down-times
Architectural Overview IVHost Information
Logwatch
Check_MK Conference #48
Events are useful,even during host downtimes
Three chunks yet to be completed
Still generated, but marked specially
Automatically archived after downtime
Tactical overview considers them “handled”
Actions can be skipped
Check_MK Conference #49
Further Improvements
Finer configuration of limits
More powerful rewriting
Export of rule packs via MKPs
03.05.2018, Tom BärwinkelCheck_MK Conference #4
Background jobs
Check_MK Conference #412
Examples of long running processes
Host renamingBake AgentsCreate Report
Check_MK Conference #413
Problem – Apache process timeouts
Check_MK Conference #414
The usual workaround
locate files increase timeout restart
Let’s just increase the timeout
update
Check_MK Conference #415
Still a problem - What’s the status?
+ +No userfeedback
Blockedprocesses
Apacherestarts
Check_MK Conference #416
Solution - separation of concerns
CommonInterface
GUI
configuration start jobs show progress
Background jobs
do actual work send updates
Check_MK Conference #417
How it looks in practice - Reports
Check_MK Conference #418
How it looks in practice - Reports
Check_MK Conference #419
The Background jobs overview
Check_MK Conference #420
Special rules for critical jobs
Affected jobs: bake agents rename host
Only one job allowed Error page with details Require acknowledgement
Upcoming background jobs
Download of agent output
Service discovery
SLA calculations
03.05.2018, Konstantin BüttnerCheck_MK Conference #4
Time-specific check parameters
Check_MK Conference #424
Use case● Recurring events lead to conditions that would be abnormal during other times
● Eg. backup jobs etc.
Check_MK Conference #425
Predictive Levels?
Check_MK Conference #426
Solution – Timespecific parameters● Define parameters for time periods
● Generic solution – works for all check plugins
Check_MK Conference #427
How does it work?
Check_MK Conference #428
How does it work?
Check_MK Conference #429
03.05.2018, Konstantin BüttnerCheck_MK Conference #4
Customizable graph layouts
Check_MK Conference #432
Our objective in graphing
● Traditional graphing system are for customizing graphs
● But: day-to-day, we’d rather just have graphsTweaking graphs is fun, but time consuming
● Our objective: Make our data useful out of the box
Check_MK Conference #433
A bit of history - PNP4Nagios
PNP graph In the Olden Days
Graphs were defined for each check type Templates in PHP ( ノ ಠ益 ಠ) ノ彡┻━┻ No semantics, no uniformity* No customizability**
*except for checks sharing templates**unless you wrote your own templates
Check_MK Conference #434
A bit of history - Metrics
Metrics graph A new dawn
Most visible: New look Most important: Semantics Checks no longer have graphs… ...but values have semantics
→Automatic graphs→Custom graphs
Check_MK Conference #435
Now: More customizability
● Unified options for dashlets, views, reports(Werk #5569)
Check_MK Conference #436
Check_MK Conference #437
Check_MK Conference #438
Now: More customizability
● Unified options for dashlets, views, reports(Werk #5569)
● Vertical scaling and mirroring customizable(Werk #3920)
Check_MK Conference #439
Check_MK Conference #440
03.05.2018, Konstantin BüttnerCheck_MK Conference #4
Performance improvements
Check_MK Conference #443
Why is a performance focus critical?
● The typical Check_MK setup keeps growing – more services, more sites, more users
● Better scalability makes new things possible
How to make things go faster
Avoid unnecessary work
Optimize
Make things feel faster
Check_MK Conference #445
Avoid unnecessary work - Tabs● A typical browser: 50 open tabs
● You don’t use them all… but they all make regular livestatus queries, though
● New: Only update active tabs/windows (Werk #4753)
● Benefit: No more wasted data transfers
Check_MK Conference #446
Avoid unnecessary work – Emails● Graphs can make emails reasonably large
● Multiply that by the number of recipients ...
● What’s a mail server for? (Werk #4813)
● Benefit: Notification system becomes more scalable
Check_MK Conference #447
Avoid unnecessary work – Sites
● Frequent scenario: multi-site setup
● For many users, only few sites are actually relevant
● Yet, GUI in the past contacted all sites for every user
● No more: Now configureable for each user (Werk #4921)
Check_MK Conference #448
Optimize – BI● Large aggregations now compile faster (Werk #5142)
● Magic - “it’s the algorithm, stupid”
● Further reworking already planned
Check_MK Conference #449
Optimize – liveproxyd● Previously: # The main loop of the daemon goes here
● One python process – GIL limits scalability
● Now what?
Check_MK Conference #450
Optimize – liveproxyd● Now: One process per site, master process to manage● Process view:
● More resilient, more performant through use of multiple CPUs (cf. Werk #4901)
Check_MK Conference #451
Make things feel faster - Graphs● Previously: View with many Graphs? Better wait for all the data
● Now: Load the view with placeholders, update graphs asynchronously
Take-home message
Attention to detail pays off
Continuously improving
If you have lots of sites and lots of users, Werk #4921 may help you