disaster recovery exercise report - globalnoc at...
Post on 26-Apr-2018
249 Views
Preview:
TRANSCRIPT
Disaster Recovery Exercise Report:
Exercise conducted November 9, 2013
Video wall monitoring network health - Service Desk, Indiana University Bloomington
Background The Global Network Operations Center (GlobalNOC) at Indiana University
supports some of the world’s most advanced research and education networks.
This includes campus networks, statewide research and education networks,
regional optical networks, national research and education networks, and
international interconnects – and the projects and research that traverse them.
Our mission is to "Advance the future of research, education, and the public
interest through the support and evolution of networks."
The GlobalNOC Service Desk operates fully redundant facilities located in
Indiana on the Indiana University-Purdue University
Indianapolis (IUPUI) and Indiana University
Bloomington campuses. Located 52 miles apart,
these facilities run 24 hours per day, 7 days per
week, and 365 days per year.
The GlobalNOC Systems Engineering group configures
and maintains the hardware and software systems the Service Desk uses to
proactively monitor, identify, triage, track, escalate, and resolve any service or
performance impacting issues.
The mission critical nature of the supported networks requires that the Service
Desk and its network health monitoring systems be available 100% of the time
and sustain continuity of operations in the event one of the sites is lost.
Service Desk - Indianapolis
Disaster Preparedness Summary
The GlobalNOC Service Desk’s fully redundant facilities in Indianapolis and
Bloomington oversee all mission critical systems – including, but not limited to,
phone, email, network health monitoring, and ticketing systems.
In addition to working onsite at the two Operation Centers, Service Desk and
Engineering team members regularly work remotely. Because they reside around
the state of Indiana, this further increases our
geographic dispersion. Remote team members
have access to the full suite of tools and are
continuously video connected to one of the
Operation Centers. Using two way video and
audio, the team collaborates as if remote staff
were physically in the Operation Center. The
Service Desk also has an agreement to access and operate in an alternative
building, dubbed the Contingency Network Operations Center (CNOC), should
one of the Operation Centers be inaccessible for an extended period of time.
The GlobalNOC network management systems are housed in state of the art
secure data centers on the Indiana University campuses in Bloomington and
Indianapolis. All systems have power protection, including generator protection.
A highly resilient network interconnects the data centers and our network
operations centers, and all critical systems are replicated in Bloomington and
Indianapolis. Systems are configured in either hot-spare or warm-spare
configurations, and the GlobalNOC also maintains a live, monitored spare
hardware pool to mitigate problems due to hardware failure. Ultimately, the
GlobalNOC network management system has been designed for transparent
Remote staff connected by video
failover for many systems, and manual failover procedures for the remaining
critical systems are well documented and regularly exercised.
The GlobalNOC is bound by Indiana University policy that mandates the creation,
annual maintenance, and testing of departmental disaster recovery plans. In
addition to the disaster recovery plan, the Service Desk and Engineering groups
maintain step-by-step tactical plans in case of disaster. The entire staff is trained
to access and execute the disaster recovery/business continuity plan.
Scenario and Objectives The November 9th event testing disaster preparedness was carefully planned.
The goal was to simulate the complete loss and inaccessibility of the Informatics
& Communications Technology Complex (ICTC) building on the IUPUI campus
due to a force majeure event. This included the Network Operation Center, data
center, and all supporting technology and systems located in the ICTC building.
Further, the test needed to roll operations and systems to redundant facilities.
Objectives involved testing:
1. Tactical plans and staff ability to execute disaster recovery/business
continuity plan without management/supervisor instruction
2. Ability to operate from CNOC located at IUPUI library
3. Required time to rollover operations from IUPUI NOC to CNOC
4. Ability of Bloomington NOC and remote staff to assume full operational
responsibility
5. Potential impact to customers via random reports of network issues
6. Readiness of the technology systems to cope with a significant disaster
event, including failover of critical systems in Indianapolis facility to their
backup systems at other facilities
Exercise Summary
The exercise started at 10am ET on November 9th, when the Systems
Engineering Team began shutting down ICTC systems to simulate failure of the
facility. The Service Desk began seeing critical alarms, at which time an event
was declared. The Service Desk manager transferred operational responsibility
to the Bloomington NOC, then initiated an evacuation of the GlobalNOC ICTC
facility and asked participants to relocate to the CNOC facility located in the
IUPUI library. All Service Desk personnel gathered critical items and departed to
set up contingency operations pursuant to the Disaster Recovery Plan. The
Bloomington NOC successfully assumed responsibility, and at no time
experienced an interruption to production operations.
At the IUPUI library, it took approximately 10 minutes to set up the CNOC with
full internet and phone service.
Calls routed and the CNOC
operated at a diminished capacity
while Systems Engineering
continued the failover process.
Over the next three hours, staff
conducted a comprehensive test of
systems and process, logging any
resulting issues for review.
At 1pm ET, the event was declared over and systems were restored to their
original production configurations. Service Desk staff packed up the CNOC and
moved back to the ICTC building to resume normal operations.
Contingency Network Operation Center
Exercise Result
All objectives were successfully tested. A dedicated observer recorded detailed
notes of any areas of difficulty or inefficiency, identifying potential for
improvement. Additionally, all actions were logged and time stamped in a central
chat room used for the exercise. Immediately following the exercise, a debrief of
all participants took place to ensure the accuracy and completeness of the notes
and event timeline.
A comprehensive list of issues and areas for improvement circulated among the
participants and the GlobalNOC directors. The list included suggested fixes and
the budget required to address them, if applicable. The GlobalNOC directors
gave carte blanche budget approval to address each of the identified issues or
areas of improvement. Due to its proprietary nature and confidentiality and
security concerns, the list of issues will not be publicly released.
Closing
All objectives were successfully tested, and an appropriate budget is in place to
resolve any issues identified. The GlobalNOC project management team has
created a project to drive and monitor the progress of issue resolution, with a
projected close date of May 30, 2014. The GlobalNOC will conduct another
Disaster Recovery Test during the fourth quarter of 2014.
Contact Information GlobalNOC 535 West Michigan Street Indianapolis, IN 46202 globalnoc@iu.edu 1 + 317-274-7788
top related