data centre incident nov 2010 v3

Disaster and Recovery

By Alan Davies

Gregynog Colloquium 17th June 2011

TOPICS Before the Flood

The “Disaster” !

The Recovery

Future

BEFORE SERVER VIRTUALISATIONHOW THE ROOM LOOKED IN 2009

SERVERS Over 200 standalone

Virtualisation – 200 into 20 will go !

9 new Host Servers, holding 155 Virtual Servers

Power Savings

Space Savings

Resilience ??

STORAGE

60TB of data (100,000 CDs)

10GB per staff

Resilience ??

DATA BACKUP

Disk-to-Disk-to-Tape

40TB Disk capacity

Tape cartridges 1.6TB

48 Cartridge Tape Library

Secure Fireproof Safes

ENVIRONMENT CONTROL

Power UPS Diesel Generator

Cooling

Humidity !!

SECONDARY DATA CENTRE

THE DISASTERSUNDAY 28 NOVEMBER

Freezing Temperatures

Rooftop Air Handler

Water, Water, Everywhere !!

WATER TRASHED OUR LOVELY SERVER ROOM !

Backup Device survived!! But Not the

overnight tapes

WATER TRASHED OUR LOVELY SERVER ROOM !

Library Servers

LETS BUILD ANOTHER ONE..!

LETS BUILD ANOTHER ONE..! Boxes x 300

LETS BUILD ANOTHER ONE..!

Production Line .. bit by bit .... Luverly !

NOW TO RESTORE SERVICES ! University Gold Team (Chaired by the VC)

Business Continuity and Recovery Prioritising Services Tracking Progress Communicating Regular meetings, 29 Nov to 15 Dec

ISD Contingency Team Recovery and Business Continuity Mapping Service Dependencies Managing Resources (people, procurement, time) Directing operations Dealing with Insurance Claim

Lots of staff involved Everyone in the department had a part to play.

NOW TO RESTORE SERVICES ! Scale of Operation

165 Servers destroyed 121 Live Services

Core Services – 39 (Telephone, Web Site, Email, VLE...) Non Core Services – 82 (Tills, HR, Invoicing...)

20 Test & Development Environments

Process Cleaning the room and salvaging equipment Limiting further risk by removing the cause Identifying what services were working (not working) Recovering services by alternative means (where we could) Procuring equipment prior to the rebuild Building a new server infrastructure Recovering services by priority Keeping the Gold Team informed

NOW TO RESTORE SERVICES !Timeline

WHAT NEXT ?

Options Paper DISAG

Independent Review Prof David Baker

Secondary Server Room

External Services?

LESSONS LEARNT – MANAGEMENT PERSPECTIVE. People

Successful recovery is based on staff goodwill, commitment, professionalism. Having and maintaining good relationships with suppliers. Having a strong recovery team with management, operational and administration

experience. Having the Gold team to agree priorities. Everyone wants to help!

Communications Having a contacts list to get hold of key staff, and key suppliers. People are patient and will wait for their systems if they understand the situation The value of having a staff and student portal (especially when you don’t have it!) The value of Facebook to get messages out to staff and students. Sharing personal emails and mobile phone numbers to ease communication. Communicating ‘what is happening with the recovery process’ is important for

your own department staff. Tempering expectations by communicating the right message to the organisation

and customers.

LESSONS LEARNT – MANAGEMENT PERSPECTIVE. Inventory

Keeping an itemised list of parts of equipment held in your Data Centre will allow you to replace equipment quickly.

Having a list of core services and their dependencies so that you can agree priorities for restoring.

Resilience Don’t put all your eggs in one basket Not to keep your backup/restore device in the same building Never put equipment in front of a room cooling system which has a

fan that is capable of blowing water across the room. Never assume that because there is no water in the data centre that

water cannot find a way into the building. Procurement

Having the ability to raise orders quickly. Using existing framework agreements to reduce time for

procurements and European competition.

LESSONS LEARNT – MANAGEMENT PERSPECTIVE. Operations

Keep a log of all decisions and actions taken. If there is a risk, don’t delay in dealing with it. Ensure that every system is backed up.

THE FUTURE - HOW IT LOOKS TODAY.

HOW IT LOOKS TODAY.

AN IT INFRASTRUCTURE INCIDENT

Any Questions?

data centre incident nov 2010 v3

Education

lovely server room

workingrecovering services

list of core services

key staff

department staff

server virtualisationhow

staff goodwill

management perspective