data centre incident nov 2010 v3
DESCRIPTION
University of Glamorgan's data centre incident.TRANSCRIPT
Disaster and Recovery
By Alan Davies
Gregynog Colloquium 17th June 2011
TOPICS Before the Flood
The “Disaster” !
The Recovery
Future
BEFORE SERVER VIRTUALISATIONHOW THE ROOM LOOKED IN 2009
SERVERS Over 200 standalone
Virtualisation – 200 into 20 will go !
9 new Host Servers, holding 155 Virtual Servers
Power Savings
Space Savings
Resilience ??
STORAGE
60TB of data (100,000 CDs)
10GB per staff
Resilience ??
DATA BACKUP
Disk-to-Disk-to-Tape
40TB Disk capacity
Tape cartridges 1.6TB
48 Cartridge Tape Library
Secure Fireproof Safes
ENVIRONMENT CONTROL
Power UPS Diesel Generator
Cooling
Humidity !!
SECONDARY DATA CENTRE
THE DISASTERSUNDAY 28 NOVEMBER
Freezing Temperatures
Rooftop Air Handler
Water, Water, Everywhere !!
WATER TRASHED OUR LOVELY SERVER ROOM !
WATER TRASHED OUR LOVELY SERVER ROOM !
WATER TRASHED OUR LOVELY SERVER ROOM !
Backup Device survived!! But Not the
overnight tapes
WATER TRASHED OUR LOVELY SERVER ROOM !
Library Servers
LETS BUILD ANOTHER ONE..!
LETS BUILD ANOTHER ONE..! Boxes x 300
LETS BUILD ANOTHER ONE..!
Production Line .. bit by bit .... Luverly !
NOW TO RESTORE SERVICES ! University Gold Team (Chaired by the VC)
Business Continuity and Recovery Prioritising Services Tracking Progress Communicating Regular meetings, 29 Nov to 15 Dec
ISD Contingency Team Recovery and Business Continuity Mapping Service Dependencies Managing Resources (people, procurement, time) Directing operations Dealing with Insurance Claim
Lots of staff involved Everyone in the department had a part to play.
NOW TO RESTORE SERVICES ! Scale of Operation
165 Servers destroyed 121 Live Services
Core Services – 39 (Telephone, Web Site, Email, VLE...) Non Core Services – 82 (Tills, HR, Invoicing...)
20 Test & Development Environments
Process Cleaning the room and salvaging equipment Limiting further risk by removing the cause Identifying what services were working (not working) Recovering services by alternative means (where we could) Procuring equipment prior to the rebuild Building a new server infrastructure Recovering services by priority Keeping the Gold Team informed
NOW TO RESTORE SERVICES !Timeline
WHAT NEXT ?
Options Paper DISAG
Independent Review Prof David Baker
Secondary Server Room
External Services?
LESSONS LEARNT – MANAGEMENT PERSPECTIVE. People
Successful recovery is based on staff goodwill, commitment, professionalism. Having and maintaining good relationships with suppliers. Having a strong recovery team with management, operational and administration
experience. Having the Gold team to agree priorities. Everyone wants to help!
Communications Having a contacts list to get hold of key staff, and key suppliers. People are patient and will wait for their systems if they understand the situation The value of having a staff and student portal (especially when you don’t have it!) The value of Facebook to get messages out to staff and students. Sharing personal emails and mobile phone numbers to ease communication. Communicating ‘what is happening with the recovery process’ is important for
your own department staff. Tempering expectations by communicating the right message to the organisation
and customers.
LESSONS LEARNT – MANAGEMENT PERSPECTIVE. Inventory
Keeping an itemised list of parts of equipment held in your Data Centre will allow you to replace equipment quickly.
Having a list of core services and their dependencies so that you can agree priorities for restoring.
Resilience Don’t put all your eggs in one basket Not to keep your backup/restore device in the same building Never put equipment in front of a room cooling system which has a
fan that is capable of blowing water across the room. Never assume that because there is no water in the data centre that
water cannot find a way into the building. Procurement
Having the ability to raise orders quickly. Using existing framework agreements to reduce time for
procurements and European competition.
LESSONS LEARNT – MANAGEMENT PERSPECTIVE. Operations
Keep a log of all decisions and actions taken. If there is a risk, don’t delay in dealing with it. Ensure that every system is backed up.
THE FUTURE - HOW IT LOOKS TODAY.
HOW IT LOOKS TODAY.
HOW IT LOOKS TODAY.
AN IT INFRASTRUCTURE INCIDENT
Any Questions?