lemon monitoring and lemon alarm system (sensors, exception, alarm)

25
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/ CF Lemon monitoring and Lemon Alarm System (sensors, exception, alarm) Ivan Fedorko 22/11/2010

Upload: alia

Post on 05-Feb-2016

112 views

Category:

Documents


0 download

DESCRIPTION

Lemon monitoring and Lemon Alarm System (sensors, exception, alarm). Ivan Fedorko 22/11/2010. Overview. Lemon overview Lemon agent and sensors How to write new sensor Exception sensor LAS. Lemon components. Measurement Repository. Lemon-web. RRD tool / Python. Repository Backend. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Lemon monitoring and Lemon Alarm System (sensors, exception, alarm)

Computing Facilities

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

CF

Lemon monitoringand

Lemon Alarm System(sensors, exception, alarm)

Ivan Fedorko22/11/2010

Page 2: Lemon monitoring and Lemon Alarm System (sensors, exception, alarm)

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

CF Overview

• Lemon overview• Lemon agent and sensors• How to write new sensor• Exception sensor• LAS

Page 3: Lemon monitoring and Lemon Alarm System (sensors, exception, alarm)

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

CF Lemon components

SQL

TCP/UDP HTTP

Sensor Sensor Sensor

Monitoring Agent Local Cache

OracleDatabase

Repository BackendApplication

Server

Lemon CLI

Lemon-host-check

Web Browser

RRD tool / Python

Apache/ PHP

(command line tool to access data)

(command line tool node exceptions)

Measurement Repository

Lemon-web

User InterfacesNode Monitoring

Page 4: Lemon monitoring and Lemon Alarm System (sensors, exception, alarm)

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

CF Lemon agent and sensors

Class Instance

Class Instance

Monitoring Agent

Class Instance

MetricClass

MetricClass

Class Instance

Class Instance

Class Instance Class

Instance

Class Instance

Class Instance

Sensor

MetricClass Metric

Class

Sensor

Sensor:A process or script which is connected to the lemon-agent via a bi-directional pipe and collects information on behalf of the agent. Sensors implement:Metric Classes:

The equivalent to a class in OOP (Object Orientated Programming)

Metric Instance: Is an instance (an object) of a metric class which has its own configuration data.

Metric ID: A unique identifier associated with a particular metric instance of a particular metric class.

Page 5: Lemon monitoring and Lemon Alarm System (sensors, exception, alarm)

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

CF Lemon agent and sensors

MSA (Monitoring Sensor Agent) • forks sensors and communicate with them using

custom protocol over a bi-directional “pipes”• configures metric instances of metric classes of a

sensor and pulls for metrics• to configure: ncm-ncd --configure fmonagent• configuration: /etc/lemon/agent/• log: /var/log/lemon-agent.log

• checks on status of sensors• caches data locally ( e.g. /var/spool/lemon-agent/ )

Class Instance

Class Instance

Monitoring Agent

Class Instance

MetricClass

MetricClass

Class Instance

Class Instance

Class Instance Class

Instance

Class Instance

Class Instance

Sensor

MetricClass Metric

Class

Sensor

http://lemon.web.cern.ch/lemon/doc/sensors.shtml

Supported Lemon Sensors

Page 6: Lemon monitoring and Lemon Alarm System (sensors, exception, alarm)

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

CF Lemon agent and sensors

Class Instance

Class Instance

Monitoring Agent

Class Instance

MetricClass

MetricClass

Class Instance

Class Instance

Class Instance Class

Instance

Class Instance

Class Instance

Sensor

MetricClass Metric

Class

Sensor

/etc/lemon/agent/sensors/linux_CDB.conf

Page 7: Lemon monitoring and Lemon Alarm System (sensors, exception, alarm)

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

CF Lemon agent and sensors

What we can store:• StoreSample01:

• agent: <metric_id> <timestamp>• user: <value>

• StoreSample02:• agent: <metric_id> • user: <timestamp> <node> <value>

• StoreSample03:• agent: • user: <node> <metric_id> <timestamp> <value>

Example of linux sensor

Lemon API

Reporting on behalf

Page 8: Lemon monitoring and Lemon Alarm System (sensors, exception, alarm)

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

CF Create new sensor

• Check instruction and example code– http://lemon.web.cern.ch/lemon/doc/howto/sensor_tutorial.shtml

• Prepare your code and test – Test can be done localy

• Prepare templates– For sensor will be transformed to DB table definition– For metrics create table/metric, metadata check on

server • Ask for metric ID at lemon.support• Commit templates with ID to CDB and inform lemon.support• If new exception is introduced, should be alarmed?

Page 9: Lemon monitoring and Lemon Alarm System (sensors, exception, alarm)

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

CF Sensor template example

Somewhere in your node template

Sensor template

For configuration

For db backend

Page 10: Lemon monitoring and Lemon Alarm System (sensors, exception, alarm)

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

CF Exception sensor

• Objective:– To run corrective action when the occupancy of

the /tmp partition is greater then 80%. • Involved Metrics

– With ID 9104 (system.partitionInfo)– Field 1 = mountname, field 5 = percentage

occupancy

• CorrelationCorrelation ((9104:1 eq '/tmp') && (9104:5 > 80)) Actuator /usr/local/sbin/clean-tmp-partition -o 75

Page 11: Lemon monitoring and Lemon Alarm System (sensors, exception, alarm)

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

CF Exception sensor

– an officially supported Lemon sensor coded in C++– developed in collaboration between CERN and BARC– implements the Lemon alarm protocol

– has a correlation engine which allows it to evaluate 1 or more metrics to determine if a problem exists on a machine

– supports reporting on behalf of other monitored entities– allows corrective actions (actuators) up to n-times or

within a given time window– is the primary interface to inserting alarms into the Lemon

framework. – Provides one and only one metric class “alarm.exception”Full documentation at:– http://lemon.web.cern.ch/lemon/doc/sensors/exception.shtml

Page 12: Lemon monitoring and Lemon Alarm System (sensors, exception, alarm)

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

CF Exception sensor

Alarm evolution

Page 13: Lemon monitoring and Lemon Alarm System (sensors, exception, alarm)

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

CF Exception metric

"/system/monitoring/exception/_30010" = nlist( "name", "tmp_full", "descr", "tmp utilization exceeds limit", "active", true, "latestonly", false, "importance", 2, "correlation", "((9104:1 eq '/tmp') && (9104:5 > 80))", "actuator", nlist("execve", "/usr/local/sbin/clean-tmp-partition -o 75", "maxruns", 3, "timeout", 300, "window", 900, "active", true) );

what is the alarm's importance? Not used now!• 0 - informative -> to be handled at convenience• 1 - low - 9/5 support - to be handled within working hours, e-

mail outside working hours• 2 - high - 24/24 support - requires immediate action - PK or

expert call

name of the exception to be used on the web and later for operator's GUI

short description to be also passed to GUI

if false, ncm component will not include exception to config

if false, value are stored in lemon DB archive

Page 14: Lemon monitoring and Lemon Alarm System (sensors, exception, alarm)

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

CF Correlation

• Basic format of a correlation is: [entity_name]:<metric_id>:<field_position>         <operator>         <reference_value> ... • Where,

– entity_name• An optional parameter, used for reporting on behalf of other entities• The name of the entity (wildcards ‘*’ are supported)

– metric_id• The id of the metric to check

– field_position• The field to use within the metric. • Allows the correlation to extract a single value from a multi-valued metric

– Operater• E.g. ==, !=, >, <, eq, ne, regex, !regex …

– reference_value• A string or number used to compare the metric_id:field_position against

Page 15: Lemon monitoring and Lemon Alarm System (sensors, exception, alarm)

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

CF Correlation

1. Use basic object for correlation<metric_id>:<field_position>

2. Combine exceptions10004:1 > 600 && (10004:7 > 10 || (10004:8 > 150000 && 4109:3

eq 'i386') || (10004:8 > 600000 && 4109:3 regex '64') || 10007:2 > 50 || 10007:3 > 10 || 10007:4 > 0)

4. Join the metrics(9200:1 == 9208:1)

3. You can collect information on behalf, you can define exception on behalf[entity_name]]:<metric_id>:<field_position> e.g. (*:9501:5 != 200) && (*:9501:5 != 301)

5. "correlation", "4109:2 ne 'symlink('/system/kernel/version')'"

Page 16: Lemon monitoring and Lemon Alarm System (sensors, exception, alarm)

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

CF Actuator

• Information:– Run as forked processes.– Are connected to the sensor via a pipe.– All information written to stdout or stderr by the actuator

is caught and recorded in the agents log file.– All actuator attempts are logged centrally and recorded

locally in the agents log file.• Running shell style actuators:

– The system call used to run actuator doesn’t provide shell style conveniences.

– To use shell style syntax like *, &&, | etc you must define you actuator like this:

Actuator /bin/sh –c \\” /bin/echo ‘This is a demo message from $HOSTNAME’ \\”

Page 17: Lemon monitoring and Lemon Alarm System (sensors, exception, alarm)

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

CF Actuator

"/system/monitoring/exception/_30010" = nlist( "name", "tmp_full", "descr", "tmp utilization exceeds limit", "active", true, "latestonly", false, "importance", 2, "correlation", "((9104:1 eq '/tmp') && (9104:5 > 80))", "actuator", nlist("execve", "/usr/local/sbin/clean-tmp-partition -o 75", "maxruns", 3, "timeout", 300, "window", 900, "active", true) );

The maximum number of times an actuator can run consecutively before a final alarm is generating

The maximum number of seconds that an actuator is allowed to run before being terminated by the sensor.

Time window to execute all maxruns of actuator

Actual value of correlation objects are accessible for actuatorActuator /bin/sh -c \\"/bin/echo '$act_value_01 $act_value_02' \\"

Actuator /bin/sh -c \\"/bin/echo 'Died lemonmrd daemon $act_value_01 ' | /bin/mail -s 'Lemon RRD Daemon problem' [email protected]\\"

Page 18: Lemon monitoring and Lemon Alarm System (sensors, exception, alarm)

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

CF Exception config"minoccurs", 5specifies how many time the exception should occur before rising the exception 

“local", yesTechnically this value has no affect within the sensor but is an instruction to the lemon-agent to not transmit data for this exception to the remote application servers. As remote transmission does not occur the outcome of the exception can never appear on LAS (Lemon Alarm System) and is only visible locally on the machine using lemon-host-check.

“silent", yesAn exception which is considered silent effectively sets the exception state to the value 2 for all transitions. The exception is disabled, no actuators will run and no alarms will be displayed on the LAS console

Page 19: Lemon monitoring and Lemon Alarm System (sensors, exception, alarm)

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

CF LAS

ExceptionMetrics

EventManagemen

t System

Lemon-web

LAS GUI

LemonOracle

DB

LASBusiness Logic

PL/SQLOperator

Administrator

CDBSMS

Page 20: Lemon monitoring and Lemon Alarm System (sensors, exception, alarm)

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

CF LASRaw alarm • represents an processed exception on one of the monitored entities (reported by Lemon). Not all exceptions become an alarm (only one with code 000, 005, 135). And only from host not in maintenance.

L-alarm • can represent one or more alarms. It is an item visible on the operators screen in the LAS GUI.• every L-alarm must be acknowledged with created ticket in ITCM• states: active, inactive, acknowledge, inhibited• alarms may be grouped to L-alarm (by entities, exceptions, cluster rules…)

What is no_contact alarm?Exception not evaluated on host! Alarm means that data are not arriving from host to DB.One of heartbeat metrics (6335, 6336, 9500, 10005) reported every (usually) 5 minProcedure in db is checking if last entry not older than (usually) 10 min

Page 21: Lemon monitoring and Lemon Alarm System (sensors, exception, alarm)

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

CF LAS

Page 22: Lemon monitoring and Lemon Alarm System (sensors, exception, alarm)

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

CF LAS

How to avoid alarm• disable (e.g. by lemon-host-check)• make exception local• make exception silent• set host to maintenance actuator will run

/etc/lemon/exceptions/state.conf

Page 23: Lemon monitoring and Lemon Alarm System (sensors, exception, alarm)

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

CF Backup

From now on backup

Page 24: Lemon monitoring and Lemon Alarm System (sensors, exception, alarm)

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

CF Sensor parameters

Page 25: Lemon monitoring and Lemon Alarm System (sensors, exception, alarm)

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

CF Smoothing