deep dive into nagios analytics

Post on 29-Jun-2015

157 Views

Category:

Technology

8 Downloads

Preview:

Click to see full reader

DESCRIPTION

Performance metrics + Nagios traffic + other sources + Datadog in the cloud = real time graphs + analytics

TRANSCRIPT

A Deep Dive into Nagios Analytics

Alexis Lê-Quôc (@alq)http://datadoghq.com

A Deep Dive into Nagios Analytics

Alexis Lê-Quôc (@alq)http://datadoghq.com

@alqDev & OpsNagios user since 2008Datadog co-founder

A little survey

Top 3 failed checks

Top 3 failed checks

That I responded tolast week

That woke me up

That most of my teamresponded to at least once

That impacts our businessthe most?

That I responded to5 weeks ago

Top 3 failed checks

That I responded tolast week

That woke me up

That most of my teamresponded to at least once

That impacts our businessthe most?

That I responded to5 weeks ago

Using memory to prioritize remediation...

At best, finding local optimums

At worst, brownian motion

Analytics

Performance Metrics Nagios Traffic Other Sources

In the “Cloud”

Real-time graphs + analytics

Aggregation

Real-time Analytics(Nagios et al.)

Real-time Analytics

Nagios Traffic

In the “Cloud”

Real-time graphs + analytics

Nagios a “chatty” source out of 40+ Datadog supports

One example

Almost 13000 Nagios “events”over past week

Constant stream

86 notifications!

Pattern

Pattern

More data? More questions.

A dialog with dataNot a scientific study

0

2

4

6

0 250 500 750Host count

Popu

latio

n

factor(quartile)

1

2

3

4

Nagios samples

Population

 25%    50%    75%  100%        20      93    322    904  

Does size matter?

0

10

20

30

40

0

10

20

30

40

0

10

20

30

40

0

10

20

30

40

12

34

0 250 500 750 1000Nagios alert per host

coun

t per

wee

k

Weekly count per host split by quartile

0

10

20

30

40

0

10

20

30

40

0

10

20

30

40

0

10

20

30

40

12

34

0 250 500 750 1000Nagios alert per host

coun

t per

wee

k

Weekly count per host split by quartile

Outliers Sick hosts,

silenced checks

Notifications

Notifications1-3% of alerts notify

Little difference per quartile

Does time of day matter?

●●

●●

● ●

●● ●

●●

●●

●●

4

8

12

4

8

12

4

8

12

4

8

12

12

34

0 5 10 15 20Hour of Day (UTC)

Aler

ts p

er h

our

●●

●●

● ●

●● ●

●●

●●

●●

4

8

12

4

8

12

4

8

12

4

8

12

12

34

0 5 10 15 20Hour of Day (UTC)

Aler

ts p

er h

our

Mean about the sameacross quartiles

Time-based deviation?

Does the day of week matter?

0

10

20

30

40

0

10

20

30

40

0

10

20

30

40

0

10

20

30

40

12

34

Sun Mon Tue Wed Thu Fri SatDay of week

Aler

ts p

er h

our

Notifying Alerts per Day

0

10

20

30

40

0

10

20

30

40

0

10

20

30

40

0

10

20

30

40

12

34

Sun Mon Tue Wed Thu Fri SatDay of week

Aler

ts p

er h

our

Notifying Alerts per Day

Not really

Squeaky wheels? (checks)

0

10

20

30

0

10

20

30

0

10

20

30

0

10

20

30

12

34

0 50 100 150 200 250Checks ranked by noise

Aler

ts p

er h

our

Noisiest checks (overall)

Outlier

● ●● ●

● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ●

0

10

20

30

0 20 40Checks ranked by noise

Aler

ts p

er h

our

Noisiest checks (outlier)

Outlier in more detail

0

2

4

6

8

0

2

4

6

8

0

2

4

6

8

0

2

4

6

8

12

34

0 50 100 150 200Checks ranked by noise

Aler

ts p

er h

our

Noisiest checks (without outlier)

Long Tail

Squeaky wheel? (hosts)

0

10

20

30

0

10

20

30

0

10

20

30

0

10

20

30

12

34

0 50 100 150 200Hosts ranked by noise

Aler

ts p

er h

our

Noisiest hosts (overall)

Same outlier

● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●

0

10

20

30

3

0 20 40 60Hosts ranked by noise

Aler

ts p

er h

our

Noisiest hosts (outlier)

Similar pattern as checks

0

2

4

6

8

0

2

4

6

8

0

2

4

6

8

0

2

4

6

8

12

34

0 50 100 150 200Checks ranked by noise

Aler

ts p

er h

our

Noisiest checks (without outlier)

Long Tail

Recurring alerts

●●●●

●●●

●●●

●●

●●●●●●

●●●

●●

●●

●●●●●

●●●●●●

●●●●

●●

●●

●●

●●

●●

●●●

●●●

●●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●●●

●●

●●

●●●

●●

●●

●●●

●●●●

●●

●●●●●●

●●●●●

●●●

●●●●

●●●●

●●●●

●●

●●

●●●●●●

●●●●●

●●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●●

●●●

●●

●●●●

●●

●●

●●●●

●●

●●●●●●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●●●

●●

●●●●●●

●●●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●●●●●●

●●●

●●

●●●●●●

●●

●●

●●

●●●

●●

●●●

●●●●●

●●●●●●●●●●●

●●●●●●●●●

●●●

●●●●

●●

●●

●●●●●●

●●●●●

●●●●

●●●

●●●

●●

●●●●●●

●●

●●●●

●●●●●

●●●●●●

●●

●●●●●●●

●●●●●●●●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●●●

●●

●●●●

●●●

●●●●

●●●

●●●●

●●

●●

●●

●●●●●●●●

●●

●●●●

●●●●●●

●●●●●●●●●

●●●

●●●●●●●●●●●●

●●●●

●●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●

●●

●●●

●●●●●●●●●●

●●

●●●●●

●●

●●

●●

●●●

●●

●●

●●●●●

●●●●

●●●

●●●●●

●●

●●

●●●

●●●●●

●●

●●●●

●●●●●●●●

●●

●●

●●

●●●●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●●●

●●●

●●

●●

●●

●●●

●●

●●●●

●●●●●●●●

●●●●●

●●

●●

●●●●●●●●

●●

●●●

●●●

●●●●●●●

●●●

●●

●●

●●

●●

●●

●●

●●●●●

●●●●●●

●●

●●●●●●●●●●

●●

●●●●●●●●●●

●●

●●●●●

●●

●●

●●●●●

●●

●●●●

●●●●●●●●

●●●●●●

●●●●●●

●●●●

●●●

●●

●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●

0

50

100

150

0 100 200 300Age between earliest and latest occurrence

Num

ber o

f day

s oc

curri

ng

factor(quartile)

1

2

3

4

Alert age & frequency of occurrence

Young Old

Seldom happens

HappensOften

●●●●

●●●

●●●

●●

●●●●●●

●●●

●●

●●

●●●●●

●●●●●●

●●●●

●●

●●

●●

●●

●●

●●●

●●●

●●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●●●

●●

●●

●●●

●●

●●

●●●

●●●●

●●

●●●●●●

●●●●●

●●●

●●●●

●●●●

●●●●

●●

●●

●●●●●●

●●●●●

●●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●●

●●●

●●

●●●●

●●

●●

●●●●

●●

●●●●●●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●●●

●●

●●●●●●

●●●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●●●●●●

●●●

●●

●●●●●●

●●

●●

●●

●●●

●●

●●●

●●●●●

●●●●●●●●●●●

●●●●●●●●●

●●●

●●●●

●●

●●

●●●●●●

●●●●●

●●●●

●●●

●●●

●●

●●●●●●

●●

●●●●

●●●●●

●●●●●●

●●

●●●●●●●

●●●●●●●●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●●●

●●

●●●●

●●●

●●●●

●●●

●●●●

●●

●●

●●

●●●●●●●●

●●

●●●●

●●●●●●

●●●●●●●●●

●●●

●●●●●●●●●●●●

●●●●

●●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●

●●

●●●

●●●●●●●●●●

●●

●●●●●

●●

●●

●●

●●●

●●

●●

●●●●●

●●●●

●●●

●●●●●

●●

●●

●●●

●●●●●

●●

●●●●

●●●●●●●●

●●

●●

●●

●●●●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●●●

●●●

●●

●●

●●

●●●

●●

●●●●

●●●●●●●●

●●●●●

●●

●●

●●●●●●●●

●●

●●●

●●●

●●●●●●●

●●●

●●

●●

●●

●●

●●

●●

●●●●●

●●●●●●

●●

●●●●●●●●●●

●●

●●●●●●●●●●

●●

●●●●●

●●

●●

●●●●●

●●

●●●●

●●●●●●●●

●●●●●●

●●●●●●

●●●●

●●●

●●

●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●

0

50

100

150

0 100 200 300Age between earliest and latest occurrence

Num

ber o

f day

s oc

curri

ng

factor(quartile)

1

2

3

4

Alert age & frequency of occurrence

Happen once in a while

Occur often, for a long time Tolerated

More data? More questions.

HOWTO?

Find out tomorrow!Awk

Postgres

R

d3

ggplot2

Presentation matters

Take-away?

Take-aways

• Don’t rely on your memory to prioritize

• Your Nagios logs are a treasure trove

• Have a dialog with your data

• Presentation matters

http://dtdg.co/nagios2012

Curious about Datadog?

Like cute logos?

top related