lesson learned after our recent cooling problem

Lesson learned after our recent cooling problem

Michele Onofri, Stefano Zani,Andrea Chierici

HEPiX Spring 2014

Andrea Chierici 2

Outline

• INFN-T1 on-call procedure• Incident• Recover Procedure• What we learned• Conclusions

21/05/2013

INFN-T1 on-call procedure

Andrea Chierici 4

On-call service

• CNAF staff on-call on a weekly basis– 2/3 times per year– Must live within 30min from CNAF– Service phone receiving alarm SMSes – Periodic training on security and intervention

procedures• 3 incidents in last three years– only this last one required the site to be totally

powered off 21/05/2013

Andrea Chierici 5

Service Dashboard

21/05/2013

Incident

Andrea Chierici 7

What happened on the 9th of March

• 1.08am: fire alarm– On-call person intervenes and calls Firefighters

• 2.45am: fire extinguished• 3.18am: high temp warning

– Air conditioning blocked– On-call person calls for help

• 4.40am: decision is taken to shut down the center• 12.00pm: chiller under maintenance• 17.00pm: chiller fixed, center can be turned back on• 21.00pm: farm back on-line, waiting for storage

21/05/2013

Andrea Chierici 8

10th of March

• 9.00am: support call to switch storage back on• 6.00pm: center open again for LHC

experiments

• Next day: center fully open again

21/05/2013

Andrea Chierici 9

Chiller power supply

21/05/2013

Andrea Chierici 10

Incident representation

21/05/2013

Chiller 1 Chiller 2 Chiller 3 Chiller 4 Chiller 5 Chiller 6

ControlSystem Head

Ctrl sys

Pow 1

Ctrl sys

Pow 2

Andrea Chierici 11

Incident examination

• 6 chillers for the computing room• 5 share the same power supply for the control

logic (we did not know that!)• Fire in one of the control logic, power was cut to 5

chillers out of 6– 1 chiller was still working and we weren’t aware of

that!– Could have avoided turning the whole center off?

Probably not! But a controlled shutdown could have been done.

21/05/2013

Andrea Chierici 12

Facility monitoring app

21/05/2013

Andrea Chierici 13

Chiller n.4

21/05/2013

BLACK: Electric Power in (kW) BLUE: Water temp IN (°C) YELLOW: Water temp. OUT (°C) CYAN: Ch. Room temp. (°C)

Andrea Chierici 14

Incident seen by inside

21/05/2013

Andrea Chierici 15

Incident seen by outside

21/05/2013

Recover Procedure

Andrea Chierici 17

Recover procedure

• Facility: support call for an emergency intervention on chiller – recovered the burned bus and the control logic n.4

• Storage: support call • Farming: took the chance to apply all security

patches and latest kernel to nodes– Switch on order: LSF server, CEs, UIs– For a moment we were thinking about upgrading to

LSF 9

21/05/2013

Andrea Chierici 18

Failures (1)

• Old WNs – BIOS battery exhausted,

configuration reset• PXE boot, hyper-threading, disk

configuration (AHCI)– lost IPMI configuration (30%

broken)

21/05/2013

Andrea Chierici 19

Failures (2)

• Some storage controllers were replaced

• 1% PCI cards (mainly 10Gbit network) replaced

• Disks, power supplies and network switches were almost not damaged

21/05/2013

What we learned

Andrea Chierici 21

We fixed our weak point

21/05/2013

Chiller 1 Chiller 2 Chiller 3 Chiller 4 Chiller 5 Chiller 6

ControlSystem Head

Ctrl sys

Pow 1

Ctrl sys

Pow 6

Ctrl sys

Pow 2

Ctrl sys

Pow 3

Ctrl sys

Pow 4

Ctrl sys

Pow 5

Andrea Chierici 22

We miss an emergency button

• Shut the center down is not easy: a real “emergency shutdown” procedure is missing– We could have avoided switching

down the whole center if we have had more control

– Depending on the incident, some services may be left on-line

• Person on-call can’t know all the site details

21/05/2013

Andrea Chierici 23

Hosted services

• Our computing room hosts services and nodes outside our direct supervision, for which it’s difficult to gather full control– We need an emergency

procedure for those too– We need a better

understanding of the SLAs

21/05/2013

Conclusions

Andrea Chierici 25

We benchmarked ourselves

21/05/2013

• It took 2 days to get the center back on-line– less than one to open LHC

experiments– everyone was aware about

what to do– All working nodes rebooted

with a solid configuration– A few nodes were

reinstalled and put back on line in a few minutes

Andrea Chierici 26

Lesson learned

• We must have a clearer evidence of which chiller is working at every moment (on-call person does not have it right now)– The new dashboard appears to be the right place

• We created a task-force to implement a controlled shutdown procedure– Establish a shutdown order

• WNs should be switched off first, then disk-servers, grid and non grid services, bastions and finally network switches

• In case of emergency, on-call person is required to take a difficult decision

21/05/2013

Andrea Chierici 27

Testing shutdown procedure

• The shutdown procedure we are implementing can’t be easily tested

• How to perform a “simulation”?– Doesn’t sound right to switch the center off just to

prove we can do it safely• How do other sites address this?• Should periodic bios battery replacements be

scheduled?

21/05/2013

lesson learned after our recent cooling problem

Documents

andrea chierici17failures

andrea chierici14incident

andrea chierici5incidentwhat

controlsystem headctrl

andrea chierici13 black

andrea chiericihepix

andrea chierici18 failures

c incident