lesson learned after our recent cooling problem

27
Lesson learned after our recent cooling problem Michele Onofri, Stefano Zani, Andrea Chierici HEPiX Spring 2014

Upload: zhen

Post on 20-Feb-2016

37 views

Category:

Documents


0 download

DESCRIPTION

Lesson learned after our recent cooling problem. Michele Onofri , Stefano Zani , Andrea Chierici HEPiX Spring 2014. Outline. INFN-T1 on-call procedure Incident Recover Procedure What we learned Conclusions. INFN-T1 on-call procedure. On-call service. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Lesson learned after our recent cooling problem

Lesson learned after our recent cooling problem

Michele Onofri, Stefano Zani,Andrea Chierici

HEPiX Spring 2014

Page 2: Lesson learned after our recent cooling problem

Andrea Chierici 2

Outline

• INFN-T1 on-call procedure• Incident• Recover Procedure• What we learned• Conclusions

21/05/2013

Page 3: Lesson learned after our recent cooling problem

INFN-T1 on-call procedure

Page 4: Lesson learned after our recent cooling problem

Andrea Chierici 4

On-call service

• CNAF staff on-call on a weekly basis– 2/3 times per year– Must live within 30min from CNAF– Service phone receiving alarm SMSes – Periodic training on security and intervention

procedures• 3 incidents in last three years– only this last one required the site to be totally

powered off 21/05/2013

Page 5: Lesson learned after our recent cooling problem

Andrea Chierici 5

Service Dashboard

21/05/2013

Page 6: Lesson learned after our recent cooling problem

Incident

Page 7: Lesson learned after our recent cooling problem

Andrea Chierici 7

What happened on the 9th of March

• 1.08am: fire alarm– On-call person intervenes and calls Firefighters

• 2.45am: fire extinguished• 3.18am: high temp warning

– Air conditioning blocked– On-call person calls for help

• 4.40am: decision is taken to shut down the center• 12.00pm: chiller under maintenance• 17.00pm: chiller fixed, center can be turned back on• 21.00pm: farm back on-line, waiting for storage

21/05/2013

Page 8: Lesson learned after our recent cooling problem

Andrea Chierici 8

10th of March

• 9.00am: support call to switch storage back on• 6.00pm: center open again for LHC

experiments

• Next day: center fully open again

21/05/2013

Page 9: Lesson learned after our recent cooling problem

Andrea Chierici 9

Chiller power supply

21/05/2013

Page 10: Lesson learned after our recent cooling problem

Andrea Chierici 10

Incident representation

21/05/2013

Chiller 1 Chiller 2 Chiller 3 Chiller 4 Chiller 5 Chiller 6

ControlSystem Head

Ctrl sys

Pow 1

Ctrl sys

Pow 2

Page 11: Lesson learned after our recent cooling problem

Andrea Chierici 11

Incident examination

• 6 chillers for the computing room• 5 share the same power supply for the control

logic (we did not know that!)• Fire in one of the control logic, power was cut to 5

chillers out of 6– 1 chiller was still working and we weren’t aware of

that!– Could have avoided turning the whole center off?

Probably not! But a controlled shutdown could have been done.

21/05/2013

Page 12: Lesson learned after our recent cooling problem

Andrea Chierici 12

Facility monitoring app

21/05/2013

Page 13: Lesson learned after our recent cooling problem

Andrea Chierici 13

Chiller n.4

21/05/2013

BLACK: Electric Power in (kW) BLUE: Water temp IN (°C) YELLOW: Water temp. OUT (°C) CYAN: Ch. Room temp. (°C)

Page 14: Lesson learned after our recent cooling problem

Andrea Chierici 14

Incident seen by inside

21/05/2013

Page 15: Lesson learned after our recent cooling problem

Andrea Chierici 15

Incident seen by outside

21/05/2013

Page 16: Lesson learned after our recent cooling problem

Recover Procedure

Page 17: Lesson learned after our recent cooling problem

Andrea Chierici 17

Recover procedure

• Facility: support call for an emergency intervention on chiller – recovered the burned bus and the control logic n.4

• Storage: support call • Farming: took the chance to apply all security

patches and latest kernel to nodes– Switch on order: LSF server, CEs, UIs– For a moment we were thinking about upgrading to

LSF 9

21/05/2013

Page 18: Lesson learned after our recent cooling problem

Andrea Chierici 18

Failures (1)

• Old WNs – BIOS battery exhausted,

configuration reset• PXE boot, hyper-threading, disk

configuration (AHCI)– lost IPMI configuration (30%

broken)

21/05/2013

Page 19: Lesson learned after our recent cooling problem

Andrea Chierici 19

Failures (2)

• Some storage controllers were replaced

• 1% PCI cards (mainly 10Gbit network) replaced

• Disks, power supplies and network switches were almost not damaged

21/05/2013

Page 20: Lesson learned after our recent cooling problem

What we learned

Page 21: Lesson learned after our recent cooling problem

Andrea Chierici 21

We fixed our weak point

21/05/2013

Chiller 1 Chiller 2 Chiller 3 Chiller 4 Chiller 5 Chiller 6

ControlSystem Head

Ctrl sys

Pow 1

Ctrl sys

Pow 6

Ctrl sys

Pow 2

Ctrl sys

Pow 3

Ctrl sys

Pow 4

Ctrl sys

Pow 5

Page 22: Lesson learned after our recent cooling problem

Andrea Chierici 22

We miss an emergency button

• Shut the center down is not easy: a real “emergency shutdown” procedure is missing– We could have avoided switching

down the whole center if we have had more control

– Depending on the incident, some services may be left on-line

• Person on-call can’t know all the site details

21/05/2013

Page 23: Lesson learned after our recent cooling problem

Andrea Chierici 23

Hosted services

• Our computing room hosts services and nodes outside our direct supervision, for which it’s difficult to gather full control– We need an emergency

procedure for those too– We need a better

understanding of the SLAs

21/05/2013

Page 24: Lesson learned after our recent cooling problem

Conclusions

Page 25: Lesson learned after our recent cooling problem

Andrea Chierici 25

We benchmarked ourselves

21/05/2013

• It took 2 days to get the center back on-line– less than one to open LHC

experiments– everyone was aware about

what to do– All working nodes rebooted

with a solid configuration– A few nodes were

reinstalled and put back on line in a few minutes

Page 26: Lesson learned after our recent cooling problem

Andrea Chierici 26

Lesson learned

• We must have a clearer evidence of which chiller is working at every moment (on-call person does not have it right now)– The new dashboard appears to be the right place

• We created a task-force to implement a controlled shutdown procedure– Establish a shutdown order

• WNs should be switched off first, then disk-servers, grid and non grid services, bastions and finally network switches

• In case of emergency, on-call person is required to take a difficult decision

21/05/2013

Page 27: Lesson learned after our recent cooling problem

Andrea Chierici 27

Testing shutdown procedure

• The shutdown procedure we are implementing can’t be easily tested

• How to perform a “simulation”?– Doesn’t sound right to switch the center off just to

prove we can do it safely• How do other sites address this?• Should periodic bios battery replacements be

scheduled?

21/05/2013