eu 2nd year review – 04-05 feb. 2003 – wp4 demo – n° 1 wp4 demonstration fabric monitoring...
TRANSCRIPT
EU 2nd Year Review – 04-05 Feb. 2003 – WP4 demo – n° 1
WP4 demonstration
Fabric Monitoring and
Fault Tolerance
Sylvain Chapeland
Lord Hess
EU 2nd Year Review – 04-05 Feb. 2003 – WP4 demo – n° 2
Workload Management (WP1)
Data Management (WP2)
Storage Element (WP5)
Fabric Management (WP4)
Networking (WP7)
Information Service (WP3)
Fabric Monitoring and Fault Tolerancein the global picture
EU 2nd Year Review – 04-05 Feb. 2003 – WP4 demo – n° 3
Outline
System architecture Fabric Monitoring
Fault Tolerance
Demonstration Hardware setup
Use case
Summary
Questions
EU 2nd Year Review – 04-05 Feb. 2003 – WP4 demo – n° 4
Sensor
3
ConsumerConsumer
SensorSensor
Sensor
2
ConsumerConsumer
SensorSensor
Fabric Monitoring architecture
Measurement Repository (MR)
Database
Monitored nodes
Monitoring Sensor
Agent (MSA)
1
CacheConsumer
Sensor
Consumer
EU 2nd Year Review – 04-05 Feb. 2003 – WP4 demo – n° 5
Sensor MSASensor
Sensor
Fault Tolerance architecture
Local Node
Decisionunit
Actuatoragent
monitoring
Rules
Fault Tolerance daemon (FTd)
Cache
Actuator
EU 2nd Year Review – 04-05 Feb. 2003 – WP4 demo – n° 6
Demonstration setup
SlidesMonitoring
dataShells
LaptopBeamer 1 Beamer 2
MSA
FTd
MR
Monitored node Server node
FT Ruleeditor
EU 2nd Year Review – 04-05 Feb. 2003 – WP4 demo – n° 7
Demonstration
Use case based on daemon monitoring
Fabric Monitoring Check a daemon status with the monitoring system while killing
and restarting it
Fault Tolerance Edit a rule to restart the daemon automatically
Kill the daemon while following its status in monitoring
EU 2nd Year Review – 04-05 Feb. 2003 – WP4 demo – n° 8
Monitored node Server node
MSAMR
MSA monitors a daemon status. Information is propagated to repository and consumers.
daemon
Daemonstatus
Checkok
Transport Store
Notify
Daemonok
Status display : consumer applicationconnected to repository
EU 2nd Year Review – 04-05 Feb. 2003 – WP4 demo – n° 9
Monitored node Server node
MSAMR
When daemon killed, MSA updates the daemon status in the repository. Consumers are notified of the new metric value.
daemon
Daemonstatus
Check not ok
Transport Store
Notify
DaemondeadShell
Kill
Status display : consumer applicationconnected to repository
EU 2nd Year Review – 04-05 Feb. 2003 – WP4 demo – n° 10
Monitored node Server node
MSAMR
A manual operation is required to get back to normal status.
Daemonstatus
Checkok
Transport Store
Notify
DaemonokShell
Relaunch
daemon
Status display : consumer applicationconnected to repository
EU 2nd Year Review – 04-05 Feb. 2003 – WP4 demo – n° 11
Monitored node Server node
A rule is added to automatically restart the daemon when dead.
Webbrowser
Ruleeditor
FTd
Rule editor accessed byweb browser
Rule editor
HTTP
rule
Transport
EU 2nd Year Review – 04-05 Feb. 2003 – WP4 demo – n° 12
Monitored node Server node
MSA
MR
daemon
Daemonok
Status display : consumer applicationconnected to repository
Check
Shell
Kill
When daemon killed, FTd is notified and triggers recovery action as specified in rule.
FTd
rule
daemon
Transport Store
Notify
Daemondead
Notify
rule
Daemonok
Relaunch
EU 2nd Year Review – 04-05 Feb. 2003 – WP4 demo – n° 13
Monitored node Server node
MSA
MR
daemon
Daemonok
Recovery actions are also fed back to the monitoring.
FTd
Transport Store
Notify
Daemondead
Log
Daemonrestarted
Log viewer: consumer applicationconnected to repository
EU 2nd Year Review – 04-05 Feb. 2003 – WP4 demo – n° 14
Monitored node Server node
Webbrowser
MSA
History onweb browser.
HTTP
Metric history is available in the measurement repository.
MR
EU 2nd Year Review – 04-05 Feb. 2003 – WP4 demo – n° 15
Summary
Monitoring system to get live status of a node Centralization of data Measures available remotely
Fault Tolerance as monitoring data consumer Rule edition of recovery actions Automatic actions taken according to monitoring status
Deployment status Monitoring agent runs in production mode on ~1000 nodes in CERN
computer center Will be available in next EDG release