eu 2nd year review – 04-05 feb. 2003 – wp4 demo – n° 1 wp4 demonstration fabric monitoring...

15
EU 2nd Year Review – 04-05 Feb. 2003 – WP4 demo – n° 1 WP4 demonstration Fabric Monitoring and Fault Tolerance Sylvain Chapeland Lord Hess

Upload: phyllis-reynolds

Post on 17-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: EU 2nd Year Review – 04-05 Feb. 2003 – WP4 demo – n° 1 WP4 demonstration Fabric Monitoring and Fault Tolerance Sylvain Chapeland Lord Hess

EU 2nd Year Review – 04-05 Feb. 2003 – WP4 demo – n° 1

WP4 demonstration

Fabric Monitoring and

Fault Tolerance

Sylvain Chapeland

Lord Hess

Page 2: EU 2nd Year Review – 04-05 Feb. 2003 – WP4 demo – n° 1 WP4 demonstration Fabric Monitoring and Fault Tolerance Sylvain Chapeland Lord Hess

EU 2nd Year Review – 04-05 Feb. 2003 – WP4 demo – n° 2

Workload Management (WP1)

Data Management (WP2)

Storage Element (WP5)

Fabric Management (WP4)

Networking (WP7)

Information Service (WP3)

Fabric Monitoring and Fault Tolerancein the global picture

Page 3: EU 2nd Year Review – 04-05 Feb. 2003 – WP4 demo – n° 1 WP4 demonstration Fabric Monitoring and Fault Tolerance Sylvain Chapeland Lord Hess

EU 2nd Year Review – 04-05 Feb. 2003 – WP4 demo – n° 3

Outline

System architecture Fabric Monitoring

Fault Tolerance

Demonstration Hardware setup

Use case

Summary

Questions

Page 4: EU 2nd Year Review – 04-05 Feb. 2003 – WP4 demo – n° 1 WP4 demonstration Fabric Monitoring and Fault Tolerance Sylvain Chapeland Lord Hess

EU 2nd Year Review – 04-05 Feb. 2003 – WP4 demo – n° 4

Sensor

3

ConsumerConsumer

SensorSensor

Sensor

2

ConsumerConsumer

SensorSensor

Fabric Monitoring architecture

Measurement Repository (MR)

Database

Monitored nodes

Monitoring Sensor

Agent (MSA)

1

CacheConsumer

Sensor

Consumer

Page 5: EU 2nd Year Review – 04-05 Feb. 2003 – WP4 demo – n° 1 WP4 demonstration Fabric Monitoring and Fault Tolerance Sylvain Chapeland Lord Hess

EU 2nd Year Review – 04-05 Feb. 2003 – WP4 demo – n° 5

Sensor MSASensor

Sensor

Fault Tolerance architecture

Local Node

Decisionunit

Actuatoragent

monitoring

Rules

Fault Tolerance daemon (FTd)

Cache

Actuator

Page 6: EU 2nd Year Review – 04-05 Feb. 2003 – WP4 demo – n° 1 WP4 demonstration Fabric Monitoring and Fault Tolerance Sylvain Chapeland Lord Hess

EU 2nd Year Review – 04-05 Feb. 2003 – WP4 demo – n° 6

Demonstration setup

SlidesMonitoring

dataShells

LaptopBeamer 1 Beamer 2

MSA

FTd

MR

Monitored node Server node

FT Ruleeditor

Page 7: EU 2nd Year Review – 04-05 Feb. 2003 – WP4 demo – n° 1 WP4 demonstration Fabric Monitoring and Fault Tolerance Sylvain Chapeland Lord Hess

EU 2nd Year Review – 04-05 Feb. 2003 – WP4 demo – n° 7

Demonstration

Use case based on daemon monitoring

Fabric Monitoring Check a daemon status with the monitoring system while killing

and restarting it

Fault Tolerance Edit a rule to restart the daemon automatically

Kill the daemon while following its status in monitoring

Page 8: EU 2nd Year Review – 04-05 Feb. 2003 – WP4 demo – n° 1 WP4 demonstration Fabric Monitoring and Fault Tolerance Sylvain Chapeland Lord Hess

EU 2nd Year Review – 04-05 Feb. 2003 – WP4 demo – n° 8

Monitored node Server node

MSAMR

MSA monitors a daemon status. Information is propagated to repository and consumers.

daemon

Daemonstatus

Checkok

Transport Store

Notify

Daemonok

Status display : consumer applicationconnected to repository

Page 9: EU 2nd Year Review – 04-05 Feb. 2003 – WP4 demo – n° 1 WP4 demonstration Fabric Monitoring and Fault Tolerance Sylvain Chapeland Lord Hess

EU 2nd Year Review – 04-05 Feb. 2003 – WP4 demo – n° 9

Monitored node Server node

MSAMR

When daemon killed, MSA updates the daemon status in the repository. Consumers are notified of the new metric value.

daemon

Daemonstatus

Check not ok

Transport Store

Notify

DaemondeadShell

Kill

Status display : consumer applicationconnected to repository

Page 10: EU 2nd Year Review – 04-05 Feb. 2003 – WP4 demo – n° 1 WP4 demonstration Fabric Monitoring and Fault Tolerance Sylvain Chapeland Lord Hess

EU 2nd Year Review – 04-05 Feb. 2003 – WP4 demo – n° 10

Monitored node Server node

MSAMR

A manual operation is required to get back to normal status.

Daemonstatus

Checkok

Transport Store

Notify

DaemonokShell

Relaunch

daemon

Status display : consumer applicationconnected to repository

Page 11: EU 2nd Year Review – 04-05 Feb. 2003 – WP4 demo – n° 1 WP4 demonstration Fabric Monitoring and Fault Tolerance Sylvain Chapeland Lord Hess

EU 2nd Year Review – 04-05 Feb. 2003 – WP4 demo – n° 11

Monitored node Server node

A rule is added to automatically restart the daemon when dead.

Webbrowser

Ruleeditor

FTd

Rule editor accessed byweb browser

Rule editor

HTTP

rule

Transport

Page 12: EU 2nd Year Review – 04-05 Feb. 2003 – WP4 demo – n° 1 WP4 demonstration Fabric Monitoring and Fault Tolerance Sylvain Chapeland Lord Hess

EU 2nd Year Review – 04-05 Feb. 2003 – WP4 demo – n° 12

Monitored node Server node

MSA

MR

daemon

Daemonok

Status display : consumer applicationconnected to repository

Check

Shell

Kill

When daemon killed, FTd is notified and triggers recovery action as specified in rule.

FTd

rule

daemon

Transport Store

Notify

Daemondead

Notify

rule

Daemonok

Relaunch

Page 13: EU 2nd Year Review – 04-05 Feb. 2003 – WP4 demo – n° 1 WP4 demonstration Fabric Monitoring and Fault Tolerance Sylvain Chapeland Lord Hess

EU 2nd Year Review – 04-05 Feb. 2003 – WP4 demo – n° 13

Monitored node Server node

MSA

MR

daemon

Daemonok

Recovery actions are also fed back to the monitoring.

FTd

Transport Store

Notify

Daemondead

Log

Daemonrestarted

Log viewer: consumer applicationconnected to repository

Page 14: EU 2nd Year Review – 04-05 Feb. 2003 – WP4 demo – n° 1 WP4 demonstration Fabric Monitoring and Fault Tolerance Sylvain Chapeland Lord Hess

EU 2nd Year Review – 04-05 Feb. 2003 – WP4 demo – n° 14

Monitored node Server node

Webbrowser

MSA

History onweb browser.

HTTP

Metric history is available in the measurement repository.

MR

Page 15: EU 2nd Year Review – 04-05 Feb. 2003 – WP4 demo – n° 1 WP4 demonstration Fabric Monitoring and Fault Tolerance Sylvain Chapeland Lord Hess

EU 2nd Year Review – 04-05 Feb. 2003 – WP4 demo – n° 15

Summary

Monitoring system to get live status of a node Centralization of data Measures available remotely

Fault Tolerance as monitoring data consumer Rule edition of recovery actions Automatic actions taken according to monitoring status

Deployment status Monitoring agent runs in production mode on ~1000 nodes in CERN

computer center Will be available in next EDG release