lesson learned after our recent cooling problem
DESCRIPTION
Lesson learned after our recent cooling problem. Michele Onofri , Stefano Zani , Andrea Chierici HEPiX Spring 2014. Outline. INFN-T1 on-call procedure Incident Recover Procedure What we learned Conclusions. INFN-T1 on-call procedure. On-call service. - PowerPoint PPT PresentationTRANSCRIPT
Lesson learned after our recent cooling problem
Michele Onofri, Stefano Zani,Andrea Chierici
HEPiX Spring 2014
Andrea Chierici 2
Outline
• INFN-T1 on-call procedure• Incident• Recover Procedure• What we learned• Conclusions
21/05/2013
INFN-T1 on-call procedure
Andrea Chierici 4
On-call service
• CNAF staff on-call on a weekly basis– 2/3 times per year– Must live within 30min from CNAF– Service phone receiving alarm SMSes – Periodic training on security and intervention
procedures• 3 incidents in last three years– only this last one required the site to be totally
powered off 21/05/2013
Andrea Chierici 5
Service Dashboard
21/05/2013
Incident
Andrea Chierici 7
What happened on the 9th of March
• 1.08am: fire alarm– On-call person intervenes and calls Firefighters
• 2.45am: fire extinguished• 3.18am: high temp warning
– Air conditioning blocked– On-call person calls for help
• 4.40am: decision is taken to shut down the center• 12.00pm: chiller under maintenance• 17.00pm: chiller fixed, center can be turned back on• 21.00pm: farm back on-line, waiting for storage
21/05/2013
Andrea Chierici 8
10th of March
• 9.00am: support call to switch storage back on• 6.00pm: center open again for LHC
experiments
• Next day: center fully open again
21/05/2013
Andrea Chierici 9
Chiller power supply
21/05/2013
Andrea Chierici 10
Incident representation
21/05/2013
Chiller 1 Chiller 2 Chiller 3 Chiller 4 Chiller 5 Chiller 6
ControlSystem Head
Ctrl sys
Pow 1
Ctrl sys
Pow 2
Andrea Chierici 11
Incident examination
• 6 chillers for the computing room• 5 share the same power supply for the control
logic (we did not know that!)• Fire in one of the control logic, power was cut to 5
chillers out of 6– 1 chiller was still working and we weren’t aware of
that!– Could have avoided turning the whole center off?
Probably not! But a controlled shutdown could have been done.
21/05/2013
Andrea Chierici 12
Facility monitoring app
21/05/2013
Andrea Chierici 13
Chiller n.4
21/05/2013
BLACK: Electric Power in (kW) BLUE: Water temp IN (°C) YELLOW: Water temp. OUT (°C) CYAN: Ch. Room temp. (°C)
Andrea Chierici 14
Incident seen by inside
21/05/2013
Andrea Chierici 15
Incident seen by outside
21/05/2013
Recover Procedure
Andrea Chierici 17
Recover procedure
• Facility: support call for an emergency intervention on chiller – recovered the burned bus and the control logic n.4
• Storage: support call • Farming: took the chance to apply all security
patches and latest kernel to nodes– Switch on order: LSF server, CEs, UIs– For a moment we were thinking about upgrading to
LSF 9
21/05/2013
Andrea Chierici 18
Failures (1)
• Old WNs – BIOS battery exhausted,
configuration reset• PXE boot, hyper-threading, disk
configuration (AHCI)– lost IPMI configuration (30%
broken)
21/05/2013
Andrea Chierici 19
Failures (2)
• Some storage controllers were replaced
• 1% PCI cards (mainly 10Gbit network) replaced
• Disks, power supplies and network switches were almost not damaged
21/05/2013
What we learned
Andrea Chierici 21
We fixed our weak point
21/05/2013
Chiller 1 Chiller 2 Chiller 3 Chiller 4 Chiller 5 Chiller 6
ControlSystem Head
Ctrl sys
Pow 1
Ctrl sys
Pow 6
Ctrl sys
Pow 2
Ctrl sys
Pow 3
Ctrl sys
Pow 4
Ctrl sys
Pow 5
Andrea Chierici 22
We miss an emergency button
• Shut the center down is not easy: a real “emergency shutdown” procedure is missing– We could have avoided switching
down the whole center if we have had more control
– Depending on the incident, some services may be left on-line
• Person on-call can’t know all the site details
21/05/2013
Andrea Chierici 23
Hosted services
• Our computing room hosts services and nodes outside our direct supervision, for which it’s difficult to gather full control– We need an emergency
procedure for those too– We need a better
understanding of the SLAs
21/05/2013
Conclusions
Andrea Chierici 25
We benchmarked ourselves
21/05/2013
• It took 2 days to get the center back on-line– less than one to open LHC
experiments– everyone was aware about
what to do– All working nodes rebooted
with a solid configuration– A few nodes were
reinstalled and put back on line in a few minutes
Andrea Chierici 26
Lesson learned
• We must have a clearer evidence of which chiller is working at every moment (on-call person does not have it right now)– The new dashboard appears to be the right place
• We created a task-force to implement a controlled shutdown procedure– Establish a shutdown order
• WNs should be switched off first, then disk-servers, grid and non grid services, bastions and finally network switches
• In case of emergency, on-call person is required to take a difficult decision
21/05/2013
Andrea Chierici 27
Testing shutdown procedure
• The shutdown procedure we are implementing can’t be easily tested
• How to perform a “simulation”?– Doesn’t sound right to switch the center off just to
prove we can do it safely• How do other sites address this?• Should periodic bios battery replacements be
scheduled?
21/05/2013