Download - A Service-Based SLA Model
Overview
Facility operations is a manpower-intensive activity at the RACF.
Sub-groups responsible for systems within the facility (tape storage, disk storage, linux farm, grid computing, network, etc) Software upgrades Hardware lifecycle management Integrity of facility services User account lifecycle management Cyber-security
Experience with RHIC operations for the past 9 years.
Support for ATLAS Tier 1 facility operations.
Experience with RHIC Operations
24x7 year-round operations since 2000.
Facility systems classified into 3 categories: non-essential, essential and critical.
Response to system failure depends on component classification: Critical components are covered 24x7 year-round. Immediate response is
expected from on-call staff. Essential components have built-in redundancy/duplication and are
addressed the next business day. Escalated to “critical” if large number of essential components fail and compromise service availability.
Non-essential components are addressed the next business day.
Staff provides primary coverage during normal business hours.
Operators contact on-call person during off-hours and weekends.
Experience with RHIC Operations (cont.)
Users report problems via ticket system, pagers and/or phone.
Monitoring software instrumented with alarm system. Alarm system connected to selected pagers and cell phones. Limited alarm escalation procedure (ie, contact back-up if primary is not
available) during off-hours and weekends. Periodic rotation of primary and back-up on-call list for each subsystem. Automatic response to alarm conditions in certain cases (ie, shutdown of
Linux Farm cluster in case of cooling failure).
Facility operations in RHIC has worked well over past 8 years.
Service Level AgreementService Server Rank Comments
Network to Ring 1 Internal Network 1 External Network 1 ITD handlesRCF firewall 1 ITD handlesHPSS rmdsXX 1 AFS Server rafsXX 1 AFS File systems 1 NFS Server 1 NFS home directories rmineXX 1 CRS Management rcrsfm, rcras 1 Rcrsfm is 1, rcras is 2Web server (internet) www.rhic.bnl.gov 1 Web server (intranet) www.rcf.bnl.gov 1 NFS data disks rmineXX 1 Instrumentation 2 SAMBA rsmb00 DNS rnisXX 2 Should fail overNIS rnisXX 2 Should fail overNTP rnisXX 2 Should fail overRCF gateways 2 Multiple gateway machinesADSM backup 2 Wincenter rnts00 2/3 CRS Farm 2 LSF rlsf00 2 CAS Farm 2 rftp 2 Oracle 2 Objectivity 2 MySQL 2 Email 2/3 Printers 3
A New Operational Model for the RACF
RHIC facility operations is a system-based approach.
Some systems support more than one service, and some services depend on multiple systems – unclear lines of responsibilities.
Service-based operational approach better suited for distributed computing environment in ATLAS.
Tighter integration of monitoring, alarm mechanism and problem tracking – automate where possible.
Define a system and service dependency matrix.
Monitoring in the new SLA
Monitor service and system availability, system performance and facility infrastructure (power, cooling, network).
Mixture of open-source and RACF-written components. Nagios Infrastructure Condor RT
Choices guided by desired features: historical logs, ease of integration with other software, support from open-source community, ease of configuration, etc.
Nagios
Monitor service availability.
Host-based daemons configured to use externally-supplied “plugins” to obtain service status.
Host-based alarm response customized (e-mail notification, system reboot, etc).
Connected to RT ticketing system for alarm logging and escalation.
Infrastructure (Cooling)
The growth of the RACF has put considerable strain on power and cooling.
UPS back-up power for RACF equipment.
Custom RACF-written script to monitor power and cooling issues.
Alarm logging and escalation through RT ticketing system.
Controlled automatic shutdown of Linux Farm during cooling or power failures.
Infrastructure (Network)
Use of cacti to monitor network traffic and performance.
Can be used at switch or system level.
Historical information and logs.
To be instrumented with alarms and be integrated in the alarm logging and escalation.
Condor
Condor does not have native monitoring interface.
RACF created its own web-based, monitoring interface.
Interface used by staff for performance tuning.
Connected to RT for alarm logging and escalation.
Monitoring functions Throughput Service Availability Configuration Optimization
RT
Flexible ticketing system.
Historical records available.
Coupled to monitoring software for alarm logging and escalation.
Integrated in service-based SLA.
Implementing new SLA
Create Alarm Management Layer (AML) to interface monitoring to RT.
Alarm conditions configurable via custom-written rule engine.
Clearer lines of responsibilities for creating, maintaining and responding to alarms.
AML creates RT ticket in appropriate category and keeps track of responses.
AML escalates alarm when RT ticket is not addressed within (configurable) amount of time.
Service Coordinators oversee management of service alarms.
What data is logged?
Host, service, host group, and service group
Alarm timestamp
NRPE (Nagios) message content
Alarm status
Notification status
RT ticket status (new, open, resolved)
Timestamp of lastest RT update
Due date
RT ticket information (number, queue, owner, priority, etc)
Example Configuration (rule) File
[linuxfarm-testrule]
host: testhost(\d) (Regular expression compatible)
service: condorq, condor
hostgroup: any
queue: Test
after_hours_PageTime: 30
work_hours_PageTime: 60
work_hours_response_time: 120 (When does the problem need to be resolved by)
after_hours_response_time: 720 (When does the problem need to be resolved by)
auto_up: 1 (Page people)
down_hosts: 2 (Number of down hosts to be a real problem)
firstContact: test-person@pager
secondContact: [email protected]
Summary
Well-established procedures from RHIC operational experience.
Need service-based SLA for distributed computing environment.
Create Alarm Management Layer (AML) to integrate RT with monitoring tools and create clearer lines of responsibilities for staff.
Some features already functional.
Expect full implementation by late summer 2008.