it business continuity briefing march 3, 2011. incident overview improving the power posture of...

IT Business Continuity Briefing

March 3, 2011

Incident Overview

Improving the power posture of the Primary Data Center

STAGEnet Redundancy

Telephone Redundancy

Secondary Data Center and Recovery Point Objectives (RPO)

Secondary Data Center and Recovery Time Objectives (RTO)

Customer communications during outage incidents

Agenda

SYSTEMS & DATA

NETWORK SERVICES

POWER & ENVIRONMENTALS

FACILITIES & STAFF

IT Business Continuity Dependencies

SYSTEMS & DATA

NETWORK SERVICES

POWER & ENVIRONMENTALS

FACILITIES & STAFF

Incident Impact

ITD powered down servers and equipment in the Primary Data Center to minimize data loss.

ITD started to provision equipment to allow the Secondary Data Center to assume the role of the primary data center.

Initial time estimates projected power being restored to the Primary Data Center by 6:00 pm.

Power restored at 5:50 pm, email and core network services restored at 6:30 pm, final systems/applications completed by 11:30 pm.

January 18th Incident Response

Primary Data Center and Secondary Data Center both have generators to provide backup power.

ITD is working with Facilities Management and Sirius Computer Solutions to identify and implement solutions that will provide a second redundant power source to the Primary Data Center.

Hoping to be completed by the end of 2011.

Power Posture Improvements

Four Quadrant RPR Ring provides redundancy on the statewide ring by allowing traffic to automatically failover if a core node fails.

The Network Point of Presence in each quadrant has equipment architected for High Availability and backup power generation.

Internet Gateways in Bismarck and Fargo are load balanced and architected to provide failover if one of the Internet Gateways fails.

Agencies should coordinate with ITD if they require redundancy (network diversity) at individual endpoint locations.

STAGEnet Redundancy

Current Design is a Standard Digital Design

Dependent on the PBX serving the endpoint

The PBX has high availability components

Does not provide redundant service if the PBX fails

There is a service agencies can purchase to re-route critical numbers (e.g. Crisis Hotlines) in the event of a disaster.

Telephone Redundancy - Current

New Voice over IP (VoIP) design during the next two years.

As part of the standard VoIP design we will have four redundant Call Managers on STAGEnet which provide failover if the primary Call Manager serving a site fails.

Provides the ability to relocate telephone numbers to other sites with network connectivity.

Provides redundant core services for dial tone, call center and automatic call distribution (ACD).

Will not initially provide redundancy for voice mail, mobility and Interactive Voice Response (IVR).

Telephone Redundancy - VoIP

Recovery Point Objective (RPO)

Recovery Time Objective (RTO)

The Recovery Point Objective (RPO) – the point in time to which you must go back to recover data when a loss incident occurs.

RPO focuses on data is independent of the time it takes to get a non-functional system back on-line (the Recovery Time Objective or RTO).

Generally a definition of what an agency determines is an “acceptable loss" in a disaster situation.

The value of the data in the “acceptable loss” window can then be weighed against the cost of the additional loss-prevention measures that would be necessary to narrow the window.


Generally speaking backups are performed on a nightly basis to tape at our Secondary Data Center. Databases have full weekly backups and nightly

incremental backups. Other data – only items that have changed during the

day are backed-up.

Generally speaking the RPO or potential loss window for most data is one day – a Tuesday 4 pm disaster would require you to restore the Monday night back up and the activity for Tuesday is lost.

Agencies whose business requirements don’t allow for this potential data loss implement data replication.


Recover Time Objective (RTO) – a measure of how long it takes for a system to resume normal operations to avoid unacceptable business impacts.

Prior to 2006 ITD contracted for an out of state disaster recovery hot site with a best case mainframe RTO of 72 hours.

With the deployment of online applications and multiple platforms a contracted hot site with adequate network bandwidth and processing capacity became unaffordable.

ITD invested in a second data center to improve the State’s RPO and moved to a four hour RTO for core network services.


Now looking to improve the RTO of the second data center from four hours to a matter of minutes for core network services.

Base services that will be up within the first hour: E-Mail File and print services AS/400 platform and applications Current replicated hardware Disaster Recovery Web Site – basic information


Base services that will be up within four to twelve hours: Mainframe (must IPL) / DELA ConnectND

Selected shared services and some agencies have development and/or test environments residing at the second data center. These environments will be converted to assume the role of production servers in a disaster scenario.


Agencies that do not invest in replicated data solutions and backup processing capacity will need to wait for additional storage and servers to be shipped and provisioned. Estimated RTO of 3 weeks to 8 weeks for production systems depending on hardware availability, staffing priorities and the amount of data to restore.

Agencies that invest in replicated data solutions but no backup processing capacity will need to wait for servers to be shipped and provisioned. Estimated RTO of 2 weeks to 4 weeks depending on hardware availability and staffing priorities.


We feel we can improve our communications process during any future disaster events.

Planned communication avenues: DR Website E-mail Customer Service Desk Notifind – currently used to communicate with our

staff We may be asking for emergency contacts for

critical applications

Disaster Recovery Communications

Questions

ITD Contingency Planning Contact

Larry Lee [email protected] 328-2721

it business continuity briefing march 3, 2011. incident overview improving the power posture of...

Documents