monitoring and fault tolerance
DESCRIPTION
Monitoring and Fault Tolerance. Helge Meinhard / CERN-IT OpenLab workshop 08 July 2003. Fault Mgmt System. Monitoring System. Node. Configuration System. Installation System. Monitoring and Fault Tolerance: Context. History (1). - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Monitoring and Fault Tolerance](https://reader035.vdocuments.site/reader035/viewer/2022072111/56812b67550346895d8f8822/html5/thumbnails/1.jpg)
Monitoring and Fault Tolerance
Helge Meinhard / CERN-IT
OpenLab workshop
08 July 2003
![Page 2: Monitoring and Fault Tolerance](https://reader035.vdocuments.site/reader035/viewer/2022072111/56812b67550346895d8f8822/html5/thumbnails/2.jpg)
NodeConfiguration
SystemMonitoring
System
InstallationSystem
Fault MgmtSystem
Monitoring and Fault Tolerance: Context
![Page 3: Monitoring and Fault Tolerance](https://reader035.vdocuments.site/reader035/viewer/2022072111/56812b67550346895d8f8822/html5/thumbnails/3.jpg)
History (1)
In the 1990s, “massive” deployments of Unix boxes required automated monitoring of system state
Answer: SURE Pure exception/alarm system No archiving of values, hence not useful for
performance monitoring Not scalable to O(1000) nodes
![Page 4: Monitoring and Fault Tolerance](https://reader035.vdocuments.site/reader035/viewer/2022072111/56812b67550346895d8f8822/html5/thumbnails/4.jpg)
History (2)
PEM project at CERN (1999/2000) took fresh look at fabric mgmt, in particular monitoring
PEM tool survey: Commercial tools found not flexible enough and too expensive; free solutions not appropriate
Architecture, design and implementation from scratch
![Page 5: Monitoring and Fault Tolerance](https://reader035.vdocuments.site/reader035/viewer/2022072111/56812b67550346895d8f8822/html5/thumbnails/5.jpg)
History (3)
2001 - 2003: European DataGrid project with work package on Fabric Management Subtasks: configuration, installation,
monitoring, fault tolerance, resource management, gridification
Profited from PEM work, developed ideas further
![Page 6: Monitoring and Fault Tolerance](https://reader035.vdocuments.site/reader035/viewer/2022072111/56812b67550346895d8f8822/html5/thumbnails/6.jpg)
History (4)
In 2001, some doubts about ‘do-it-all-ourselves’ approach of EDG WP4
Parallel to EDG WP4, project launched to investigate whether commercial SCADA system could be used
Architecture deliberately kept similar to WP4
![Page 7: Monitoring and Fault Tolerance](https://reader035.vdocuments.site/reader035/viewer/2022072111/56812b67550346895d8f8822/html5/thumbnails/7.jpg)
Monitoring and FT architecture (1)
Monitoring: Captures non-intrusively actual state of a system (supposed not to change its state)
Fault Tolerance: Reads and correlates data from monitoring system, triggers corrective actions (state-changing)
![Page 8: Monitoring and Fault Tolerance](https://reader035.vdocuments.site/reader035/viewer/2022072111/56812b67550346895d8f8822/html5/thumbnails/8.jpg)
Monitoring and FT architecture (2)
MonitoringSensor
Agent (MSA)
SensorSensorSensor
Localcache
LocalconsumersLocalconsumersLocalconsumers
API
MR – Monitoring Repository
WP4: MR code with lower layer as flat file archive, or using Oracle
CCS: PVSS system
DBAPI
![Page 9: Monitoring and Fault Tolerance](https://reader035.vdocuments.site/reader035/viewer/2022072111/56812b67550346895d8f8822/html5/thumbnails/9.jpg)
Monitoring and FT architecture (3)
MSA controls communication with Monitoring Repository, configures sensors, requests samples, listens to sensors
Sensors send metrics on request or spontaneously to MSA
Communication MSA – MR: UDP or TCP based
![Page 10: Monitoring and Fault Tolerance](https://reader035.vdocuments.site/reader035/viewer/2022072111/56812b67550346895d8f8822/html5/thumbnails/10.jpg)
Monitoring and FT architecture (4)
FT system subscribing to metrics from monitoring subsystem
Rule-based correlation engine takes decisions on firing actuators
Actuators controlled by Actuator Agent, all actions logged by monitoring system
![Page 11: Monitoring and Fault Tolerance](https://reader035.vdocuments.site/reader035/viewer/2022072111/56812b67550346895d8f8822/html5/thumbnails/11.jpg)
Deployment (1)
End 2001: Put early versions of MSA and sensors on big clusters (~800 Linux machines), sending data (~100 metrics per machine, 1/min…1/day) to a PVSS-based repository
At the same time, ~300 machines started sending performance metrics into flat file WP4 repository
![Page 12: Monitoring and Fault Tolerance](https://reader035.vdocuments.site/reader035/viewer/2022072111/56812b67550346895d8f8822/html5/thumbnails/12.jpg)
Deployment (2)
Sensors more refined over time (metrics added according to operational needs)
Both exception and performance oriented sensors now deployed in parallel (some 150 metrics per node)
More special machines added, currently ~1500 machines being monitored
Test in May 2003: some 500 metric changes per second into the repository (~150 changes/s after “smoothing”)
![Page 13: Monitoring and Fault Tolerance](https://reader035.vdocuments.site/reader035/viewer/2022072111/56812b67550346895d8f8822/html5/thumbnails/13.jpg)
Deployment (3)
Repository requirements: Repository API implementation Oracle based fully functional alarm display for operators
Currently using both an Oracle-MR based repository, and a PVSS based one
Operators using PVSS based alarm screen as alternative to Sure display
![Page 14: Monitoring and Fault Tolerance](https://reader035.vdocuments.site/reader035/viewer/2022072111/56812b67550346895d8f8822/html5/thumbnails/14.jpg)
Deployment (4)
Interfaces: C API available, simple command line interface by end July, prototype Web access to time series of a metric available
Fault tolerance: Just starting to look at WP4 prototype
Configuration of monitoring: ad-hoc, to be migrated to CDB
![Page 15: Monitoring and Fault Tolerance](https://reader035.vdocuments.site/reader035/viewer/2022072111/56812b67550346895d8f8822/html5/thumbnails/15.jpg)
Outlook
Near term: Production services for LCG-1 Add more machines (e.g. network), metrics Software and service monitoring
Medium term (end 2003): Monitoring for Solaris and Windows, …
2004 or 2005: Review of chosen solution for monitoring and FT Some of 1999 arguments no longer valid Will look at commercial and freeware solutions
![Page 16: Monitoring and Fault Tolerance](https://reader035.vdocuments.site/reader035/viewer/2022072111/56812b67550346895d8f8822/html5/thumbnails/16.jpg)
Machine control
High level: interplay of State Management System, Configuration Management, Monitoring, Fault Tolerance, …
Low level: Past: CPU boxes didn’t have anything (5 rolling tables
with monitors and keyboards per 500…1000 machines), disk and tape servers with analog KVM switches
Future: Have investigated various options, benefit/cost analysis. Will go to serial consoles on all machines, 1 head node per 50…100 machines with serial multiplexers