osmc 2014: why we do monitoring wrong | michael medin

74
Wrong Why we do

Upload: netways

Post on 02-Jul-2015

153 views

Category:

Software


0 download

DESCRIPTION

In many IT fields we have in the past decade seen a truly revolution in terms of changes whereas others have been slow (I am looking at you leading open source monitoring tool). I think it is high time for a revolutionary change and more importantly I think it is time for monitoring to become not only a tool for the IT support team but for the entire enterprise. This will NOT be about some fancy new monitoring tool or even about NSClient++. Instead I will show how we can change the tools we use today to make monitoring a bit more modern and take it into the new millennia (yes, the one we are already in). ...But since I have added which I think is the coolest feature ever to NSClient++ I might (if I have time) spend 5 minutes (at the end) showing the new WEB UI for NSClient++ :)

TRANSCRIPT

Page 1: OSMC 2014: Why we do monitoring wrong | Michael Medin

WrongWhy we do

Page 2: OSMC 2014: Why we do monitoring wrong | Michael Medin

…frustration…

dev not ops

Page 3: OSMC 2014: Why we do monitoring wrong | Michael Medin
Page 4: OSMC 2014: Why we do monitoring wrong | Michael Medin
Page 5: OSMC 2014: Why we do monitoring wrong | Michael Medin
Page 6: OSMC 2014: Why we do monitoring wrong | Michael Medin
Page 7: OSMC 2014: Why we do monitoring wrong | Michael Medin
Page 8: OSMC 2014: Why we do monitoring wrong | Michael Medin
Page 9: OSMC 2014: Why we do monitoring wrong | Michael Medin
Page 10: OSMC 2014: Why we do monitoring wrong | Michael Medin
Page 11: OSMC 2014: Why we do monitoring wrong | Michael Medin

Please don’t be angry!

Some times I am busy

Page 12: OSMC 2014: Why we do monitoring wrong | Michael Medin
Page 13: OSMC 2014: Why we do monitoring wrong | Michael Medin
Page 14: OSMC 2014: Why we do monitoring wrong | Michael Medin
Page 15: OSMC 2014: Why we do monitoring wrong | Michael Medin

TAKE:1

Page 16: OSMC 2014: Why we do monitoring wrong | Michael Medin

check_disk -w 80 –c 90

Page 17: OSMC 2014: Why we do monitoring wrong | Michael Medin

Slack

-w 80 –c 901gb 1tb 1pb

0.2g 219g 225 179g

Page 18: OSMC 2014: Why we do monitoring wrong | Michael Medin

Better?

-w $ARG1$1gb 1tb 1pb

0.2g 22g 2 251g

80% 98% 99,8%

Magic?

Page 19: OSMC 2014: Why we do monitoring wrong | Michael Medin

0

500

1000

1500

2000

2500

3000

Value Warning Critical

The problem

The first alert

On call staff alerted

Lost time

Things went bad!

Page 20: OSMC 2014: Why we do monitoring wrong | Michael Medin

0

500

1000

1500

2000

2500

3000

Value Warning Critical

The problem

The first alert

On call staff alerted

Lost time

Page 21: OSMC 2014: Why we do monitoring wrong | Michael Medin

No Slack

-w trend-line1gb 1tb 1pb

0g 0g 0g

Page 22: OSMC 2014: Why we do monitoring wrong | Michael Medin

Works With Everything!

Magic?

Page 23: OSMC 2014: Why we do monitoring wrong | Michael Medin

TAKE:2

Page 24: OSMC 2014: Why we do monitoring wrong | Michael Medin

planningWhat aboutCapacity

Bounds?

Page 25: OSMC 2014: Why we do monitoring wrong | Michael Medin

Alarm clock

Page 26: OSMC 2014: Why we do monitoring wrong | Michael Medin

0

500

1000

1500

2000

2500

Warning Critical HDD 1 HDD 2

Full

How long?

> 80%

> 90%

Page 27: OSMC 2014: Why we do monitoring wrong | Michael Medin

0

500

1000

1500

2000

2500

Warning Critical HDD 1 HDD 2

Full

warn=full in less than x weeks

Page 28: OSMC 2014: Why we do monitoring wrong | Michael Medin

Photo Credit Howard Dickins

Alarm clock

2 hours before work

Page 29: OSMC 2014: Why we do monitoring wrong | Michael Medin

0

500

1000

1500

2000

2500

3000

Value Warning Critical

The first alert

On call staff alerted

Page 30: OSMC 2014: Why we do monitoring wrong | Michael Medin

No basic math!

Magic?

Page 31: OSMC 2014: Why we do monitoring wrong | Michael Medin
Page 32: OSMC 2014: Why we do monitoring wrong | Michael Medin

check_disk -w 80 –c 90

Page 33: OSMC 2014: Why we do monitoring wrong | Michael Medin

0

500

1000

1500

2000

2500

Value Warning Critical

Backup

check_disk check_disk_backup

Page 34: OSMC 2014: Why we do monitoring wrong | Michael Medin

0

500

1000

1500

2000

2500

Value Warning Critical

check_disk warn=usage>80% and not_backup

Backup

Page 35: OSMC 2014: Why we do monitoring wrong | Michael Medin

No it is tags

Magic?

Page 36: OSMC 2014: Why we do monitoring wrong | Michael Medin

Other

TAKE:1

Page 37: OSMC 2014: Why we do monitoring wrong | Michael Medin

check_load -w 1 –c 2

Page 38: OSMC 2014: Why we do monitoring wrong | Michael Medin

Bad CPU load?80%

90% 100%

0%

Page 39: OSMC 2014: Why we do monitoring wrong | Michael Medin

0

10

20

30

40

50

60

70

80

90

100

Value Yesterday Last Week

Page 40: OSMC 2014: Why we do monitoring wrong | Michael Medin

No, still math

Magic?

Page 41: OSMC 2014: Why we do monitoring wrong | Michael Medin
Page 42: OSMC 2014: Why we do monitoring wrong | Michael Medin

check_load -w 1 –c 2

Page 43: OSMC 2014: Why we do monitoring wrong | Michael Medin

High Load???GOOD BAD

DO WE CARE?

Page 44: OSMC 2014: Why we do monitoring wrong | Michael Medin
Page 45: OSMC 2014: Why we do monitoring wrong | Michael Medin

No, still math

Magic?

Page 46: OSMC 2014: Why we do monitoring wrong | Michael Medin

TAKE:2

Page 47: OSMC 2014: Why we do monitoring wrong | Michael Medin

check_mem -w 80 –c 90

Page 48: OSMC 2014: Why we do monitoring wrong | Michael Medin

Bad Memory?80%

90% 100%

0%

Page 49: OSMC 2014: Why we do monitoring wrong | Michael Medin

Managed…Java

JVM .net

CLR

Page 50: OSMC 2014: Why we do monitoring wrong | Michael Medin

check_mem

check_jmx check_counter

check_wmi

Page 51: OSMC 2014: Why we do monitoring wrong | Michael Medin

check_disk -w 80 –c 90

Page 52: OSMC 2014: Why we do monitoring wrong | Michael Medin

FULL DISK???GOOD BAD

DO WE CARE?

Page 53: OSMC 2014: Why we do monitoring wrong | Michael Medin

Because we can?Why do we monitor?

Because we do?Because…

Page 54: OSMC 2014: Why we do monitoring wrong | Michael Medin

Business!Technology

NOT

Page 55: OSMC 2014: Why we do monitoring wrong | Michael Medin

IT

BUSINESS

Page 56: OSMC 2014: Why we do monitoring wrong | Michael Medin

No, common sense

Magic?

Page 57: OSMC 2014: Why we do monitoring wrong | Michael Medin

TAKE:1

Page 58: OSMC 2014: Why we do monitoring wrong | Michael Medin

Nagios™ is Old

EasySimple

What we always do

Page 59: OSMC 2014: Why we do monitoring wrong | Michael Medin

bischeckAddons

Other solutions“the new stuff”

forks

Page 60: OSMC 2014: Why we do monitoring wrong | Michael Medin

Why a tool?

fast forward 15 yearsNagios™Naemon™could do this!

Why an addon?

Page 61: OSMC 2014: Why we do monitoring wrong | Michael Medin

cron*/5 * * * * wrap.sh mycheck

#!/bin/bash $* if [ $? == 1 ];then send-email.sh fi;

Page 62: OSMC 2014: Why we do monitoring wrong | Michael Medin
Page 63: OSMC 2014: Why we do monitoring wrong | Michael Medin
Page 64: OSMC 2014: Why we do monitoring wrong | Michael Medin

TAKE:2

Page 65: OSMC 2014: Why we do monitoring wrong | Michael Medin
Page 66: OSMC 2014: Why we do monitoring wrong | Michael Medin
Page 67: OSMC 2014: Why we do monitoring wrong | Michael Medin
Page 68: OSMC 2014: Why we do monitoring wrong | Michael Medin
Page 69: OSMC 2014: Why we do monitoring wrong | Michael Medin

TAKE:1

Page 70: OSMC 2014: Why we do monitoring wrong | Michael Medin
Page 71: OSMC 2014: Why we do monitoring wrong | Michael Medin
Page 72: OSMC 2014: Why we do monitoring wrong | Michael Medin

TAKE:2

Page 73: OSMC 2014: Why we do monitoring wrong | Michael Medin

Photo by Olga Berrios

Page 74: OSMC 2014: Why we do monitoring wrong | Michael Medin

Information about NSClient++ http://nsclient.org

facebook.com/nsclient

Slides, and examples http://nsclient.org/nscp/conferances

My Blog http://blog.medin.name