how the disaster cluster recovered · i/os kaboom:: alpha es40 quorum:: integrity rx2620 sdboom::...

149

Upload: others

Post on 07-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered
Page 2: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

© 2008 Hewlett-Packard Development Company, L.P.The information contained herein is subject to change without notice

How the Disaster Proof OpenVMS Cluster Recovered So Fast, and How Yours Can, Too

Keith Parris Systems/Software Engineer

HPMonday, May 19 and Wednesday, May 21

Page 3: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

Story of the OpenVMS Cluster in the Disaster Proof Video

Page 4: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

4 30 July 2015

Disaster Proof Demonstration and Video

Page 5: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

5 30 July 2015

Camden Arkansas NTS

Page 6: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

6 30 July 2015

The Failover Datacenter

Page 7: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

7 30 July 2015

The original “green” datacenter

Page 8: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

8 30 July 2015

Nature gets in on the act!

Page 9: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

9 30 July 2015

KABOOM! Arkansas on the ground

Page 10: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

10 30 July 2015

OpenVMS Disaster-Proof configuration & application

XP12000 XP24000

Shadow set

Strea

m of

I/Os

KABOOM::

Alpha

ES40

QUORUM::

Integrity

rx2620

SDBOOM::

Integrity

Superdome

All I/O’s need to complete

to all spindles before it is

considered done.

When a spindle drops out

The shadow set is reduced.

I/O’s “in flight” wait for the

Shadow set to be reduced.

The longest outstanding request for an I/O during the DP demo was 13.71 seconds.

Page 11: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

11 30 July 2015

GQB ready for a ride!

Page 12: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

12 30 July 2015

Disaster Proof Demo OpenVMS Cluster

Page 13: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

How the Disaster Proof OpenVMS Cluster Recovered So Fast, and How Yours Can, Too

Page 14: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

OpenVMS Cluster Failure Detection Mechanisms and Cluster State Transitions

Page 15: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

15 30 July 2015

OpenVMS Cluster Connection Manager and Transient Failures

• Some failures are temporary and transient

− Especially in a LAN environment

• To prevent the disruption of unnecessary removal of a node from the cluster, when a communications failure is detected, the Connection Manager waits for a time in hopes of the problem going away by itself

− This time is called the Reconnection Interval

• SYSGEN parameter RECNXINTERVAL

− RECNXINTERVAL is dynamic and may thus be temporarily raised if needed for something like a scheduled LAN outage

Page 16: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

16 30 July 2015

OpenVMS Cluster Connection Manager and Communications or Node Failures

• If the Reconnection Interval passes without connectivity being restored, or if the node has “gone away”, the cluster cannot continue without a reconfiguration

• This reconfiguration is called a State Transition, and one or more nodes will be removed from the cluster

Page 17: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

17 30 July 2015

Failure and Repair/Recovery within Reconnection Interval

Failure occurs

Failure detected

(virtual circuit

broken)

Problem fixed

Fixed state detected

(virtual circuit

re-opened)

Time

RECNXINTERVAL

Page 18: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

18 30 July 2015

Hard Failure

Failure occurs

Failure detected

(virtual circuit

broken)

State transition

(node removed

from cluster)

Time

RECNXINTERVAL

Page 19: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

19 30 July 2015

Late Recovery

Failure occurs

Failure detected

(virtual circuit

broken)

State transition

(node removed

from cluster)

Time

RECNXINTERVAL

Problem fixed

Fix detected

Node does CLUEXIT

bugcheck

Node learns it has been

removed from cluster

Page 20: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

20 30 July 2015

Failure Detection Mechanisms

• Mechanisms to detect a node or communications failure

− Last-Gasp Datagram

− Periodic checking

• Multicast Hello packets on LANs

• Polling on CI and DSSI

• TIMVCFAIL check

Page 21: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

21 30 July 2015

PEDRIVER Hello Packet Timing

• Hello packet Transmit Interval

−Default is 3 seconds

−Dithered by reducing to as much as half to avoid forming”packet trains”

• so Hellos could be spaced as close as 1.5 seconds, or as far apart as 3 seconds

• Hello packet Listen Timeout

−Default is 8 seconds

−Allows detection of failure in between 8 and 9 seconds

Page 22: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

22 30 July 2015

Failure Detection onLAN interconnects

Time t=0

Time t=3

Time t=6

Time t=9

Remote node Local node

Hello packet

Hello packet

Hello packet (lost)

Hello packet

Clock ticks

01

2

30

12

34

5

6

10

Listen Timer

Page 23: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

23 30 July 2015

Failure Detection onLAN interconnects

Time t=0

Time t=3

Time t=6

Remote node Local node

Hello packet

Hello packet (lost)

Clock ticks

01

2

3

4

5

6

Listen Timer

7

8Virtual

Circuit

Broken

Hello packet (lost)

Page 24: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

24 30 July 2015

TIMVCFAIL Mechanism

Local node Remote node

Time t=0

Time t=1/3 of TIMVCFAIL

Time t=2/3 of TIMVCFAIL

Request

Response

Request

Response

Page 25: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

25 30 July 2015

TIMVCFAIL Mechanism

Local node Remote node

Time t=0

Time t=1/3 of TIMVCFAIL

Time t=2/3 of TIMVCFAIL

Request

Response

Request

Time t=TIMVCFAIL

Node fails

some time during

this period

1

2

Virtual circuit broken

Page 26: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

26 30 July 2015

Sequence of eventsDuring a State Transition

• Determine new cluster configuration

• If quorum is lost:

• QUORUM capability bit removed from all CPUs

• no process can be scheduled to run

• Disks all put into mount verification

• If quorum is not lost, continue…

• Rebuild lock database

• Stall lock requests

• I/O synchronization

• Do rebuild work

• Resume lock handling

Page 27: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

27 30 July 2015

Measuring State Transition Effects

• Determine the type of the last lock rebuild:$ ANALYZE/SYSTEM

SDA> READ SYS$LOADABLE_IMAGES:SCSDEF

SDA> EVALUATE @(@CLU$GL_CLUB + CLUB$B_NEWRBLD_REQ) & FF

Hex = 00000002 Decimal = 2 ACP$V_SWAPPRV

• Rebuild type values:

1. Merge (locking not disabled)

2. Partial

3. Directory

4. Full

Page 28: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

28 30 July 2015

Measuring State Transition Effects

• Determine the duration of the last lock request stall period:

SDA> DEFINE TOFF = @(@CLU$GL_CLUB+CLUB$L_TOFF)

SDA> DEFINE TON = @(@CLU$GL_CLUB+CLUB$L_TON)

SDA> EVALUATE TON-TOFF

Hex = 0000026B Decimal = 619 PDT$Q_COMQH+00003

Page 29: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

29 30 July 2015

Minimizing Impactof State Transitions

• Configurations issues:

− Few (e.g. exactly 3) nodes

− Quorum node; no quorum disk

− Set up LAN cluster interconnect to minimize length of time packet-forwarding is blocked

• Original IEEE 802.1d Spanning Tree algorithm could take 35-40 seconds to converge and start forwarding packets again

− Two completely-independent spanning trees could help avoid communications being blocked on both at once

• Newer IEEE 802.1w Rapid Spanning Tree (and IEEE 802.1s Multiple Spanning Tree) protocols can be configured to recover in less than 1 second

Page 30: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

Disaster Proof Demonstration Settings and Behavior

Page 31: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

31 30 July 2015

OpenVMS System Parameter Settings for the Disaster Proof Demonstration

• SHADOW_MBR_TMO lowered from default of 120 down to 8 seconds

• RECNXINTERVAL lowered from default of 20 down to 10 seconds

• TIMVCFAIL lowered from default of 1600 to 400 (4 seconds, in 10-millisecond clock units) to detect node failure in 4 seconds, worst-case, (detecting failure at the SYSAP level)

• LAN_FLAGS bit 12 set to enable Fast LAN Transmit Timeout (give up on a failed packet transmit in 1.25 seconds, worst case, instead of an order of magnitude more in some cases)

• PE4 set to hexadecimal 0703 (Hello transmit interval of 0.7 seconds, nominal; Listen Timeout of 3 seconds), to detect node failure in 3-4 seconds at the PEDRIVER level

Page 32: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

32 30 July 2015

Disaster Proof Demo Timeline

• Time = 0: Explosion occurs

• Time around 3.5 seconds: Node failure detected, via either PEDRIVER Hello Listen Timeout or TIMVCFAIL mechanism. VC closed; Reconnection Interval starts.

• Time = 8 seconds: Shadow Member Timeout expires; shadowset members removed.

• Time around13.5 seconds: Reconnection Interval expires; State Transition begins.

• Time = 13.71 seconds: Recovery complete; Application processing resumes.

Page 33: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

33 30 July 2015

Disaster Proof Demo Timeline

Explosion

Failure Detection Time

PEDRIVER Hello Listen Timeout or

TIMVCFAIL Timeout

T = 0 T = about 3.5 seconds

Page 34: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

34 30 July 2015

Disaster Proof Demo Timeline

Explosion

Shadow Member Timeout

Failed Shadowset Members Removed

T = 0 T = 8 seconds

Page 35: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

35 30 July 2015

Disaster Proof Demo Timeline

Reconnection Interval

PEDRIVER Hello Listen Timeout or

TIMVCFAIL Timeout

T = 0 T = about 3.5 seconds

Explosion

T = about 13.5 seconds

State Transition Begins

Page 36: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

36 30 July 2015

Disaster Proof Demo Timeline

T = 0 T = 13.71 seconds

Explosion

T = about 13.5 seconds

Node Removed

from Cluster Application Resumes

Cluster State Transition

Lock Database Rebuild

State Transition Begins

Page 37: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

Simulation and Testing of Long Distance DR/DT Configurations

Page 38: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

38 30 July 2015

Trends

Page 39: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

39 30 July 2015

Trends

• Increase in disasters

• Longer inter-site distances for better protection

• Business pressures for shorter distances for performance

• Increasing pressure not to bridge LANs between sites

Page 40: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

40 30 July 2015

• Trends

− Increase in Disasters

Trends

Page 41: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

41 30 July 2015

“Natural disasters have quadrupled over the last two decades, from an average of 120 a year in the early 1980s to as many as 500 today.”

Continuity Insights Magazine

Nov./Dec. 2007 issue, page 10

Page 42: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

42 30 July 2015

“There has been a six-fold increase in floods since 1980. The number of floods and wind-storms have increased from 60 in 1980 to 240 last year.”

Continuity Insights Magazine

Nov./Dec. 2007 issue, page 10

Page 43: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

44 30 July 2015

Increase in Disasters

http://www.oxfam.org/en/files/bp108_climate_change_alarm_0711.pdf/download

Page 44: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

46 30 July 2015

• Trends

− Longer inter-site distances for better protection

Trends

Page 45: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

47 30 July 2015

“Some CIOs are imagining potential disasters that go well beyond the everyday hiccups that can disrupt applications and networks. Others, recognizing how integral IT is to business today, are focusing on the need to recover instantaneously from any unforeseen event.” …“It's a different world. There are so many more things to consider than the traditional fire, flood and theft.”

“Redefining Disaster“

Mary K. Pratt, Computerworld, June 20, 2005http://www.computerworld.com/hardwaretopics/storage/story/0,10801,102576,00.html

Page 46: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

48 30 July 2015

Northeast US Before Blackout

Source: NOAA/DMSP

Page 47: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

49 30 July 2015

Northeast US After Blackout

Source: NOAA/DMSP

Page 48: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

50 30 July 2015

“The blackout has pushed many companies to expand their data center infrastructures to support data replication between two or even three IT facilities -- one of which may be located on a separate power grid.”

Computerworld, August 2, 2004http://www.computerworld.com/securitytopics/security/recovery/story/0,10801,94944,00.html

Page 49: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

51 30 July 2015

“You have to be far enough apart to make sure that conditions in one place are not likely to be duplicated in the other.“… “A useful rule of thumb might be a minimum of about 50 km, the length of a MAN, though the other side of the continent might be necessary to play it safe.”“Disaster Recovery Sites: How Far Away is Far Enough?”

Drew Robb, Datamation, October 4, 2005http://www.enterprisestorageforum.com/continuity/features/article.php/3552971

Page 50: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

52 30 July 2015

Trends:Longer inter-site distances for better protection

• In the past, protection was focused against risks like fires, floods, tornadoes. 1 to 5 miles was fine between sites.

• Right after 9/11, 60 to100 miles looked much better.

• After the Northeast Blackout of 2003, and increasing awareness of the possibility of a terrorist group obtaining a nuclear device and wiping out an entire metropolitan area is no longer inconceivable.

− Resulting pressure is for inter-site distances of 1,000 to 1,500 miles

• Challenges:

− Telecommunications links

− Latency due to speed of light adversely affects performance

Page 51: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

53 30 July 2015

• Trends

−Business pressures for shorter distances for performance

Trends

Page 52: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

54 30 July 2015

“A 1-millisecond advantage in trading applications can be worth $100 million a year to a major brokerage firm, by one estimate.”

Richard Martin, InformationWeek,

April 23, 2007

Page 53: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

55 30 July 2015

“The fastest systems, running from traders' desks to exchange data centers, can execute transactions in a few milliseconds -- so fast, in fact, that the physical distance between two computers processing a transaction can slow down how fast it happens.”Richard Martin, InformationWeek,

April 23, 2007

Page 54: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

56 30 July 2015

“This problem is called data latency --delays measured in split seconds. To overcome it, many high-frequency algorithmic traders are moving their systems as close to the Wall Street exchanges as possible.”

Richard Martin, InformationWeek,

April 23, 2007

Page 55: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

57 30 July 2015

• Trends

− Increasing pressure not to bridge LANs between sites

Trends

Page 56: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

58 30 July 2015

Trends:Increasing Resistance to LAN Bridging

• In the past, setting up a VLAN spanning sites for an OpenVMS disaster-tolerant cluster was common

• Networks are now IP-centric

• IP network mindset sees LAN bridging as “bad,” sometimes even “totally unacceptable”

• Alternatives:−Separate, private link for OpenVMS Multi-site Cluster

−Metropolitan Area Networks (MANs) using MPLS

−Ethernet-over-IP (EoIP)

−SCS-over-IP support planned for OpenVMS 8.4

Page 57: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

59 30 July 2015

Site Selection and Inter-Site Distance

Page 58: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

60 30 July 2015

Planning for DT: Site Selection

Sites must be carefully selected:

• Avoid hazards

− Especially hazards common to both (and the loss of both datacenters at once which might result from that)

• Make them a “safe” distance apart

• Select site separation in a “safe” direction

Page 59: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

61 30 July 2015

Planning for DT: What is a “Safe Distance”

Analyze likely hazards of proposed sites:

• Natural hazards

− Fire (building, forest, gas leak, explosive materials)

− Storms (Tornado, Hurricane, Lightning, Hail, Ice)

− Flooding (excess rainfall, dam breakage, storm surge, broken water pipe)

− Earthquakes, Tsunamis

Page 60: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

62 30 July 2015

Planning for DT: What is a “Safe Distance”

Analyze likely hazards of proposed sites:

• Man-made hazards

− Nearby transportation of hazardous materials (highway, rail)

− Terrorist with a bomb

− Disgruntled customer with a weapon

− Enemy attack in war (nearby military or industrial targets)

− Civil unrest (riots, vandalism)

Page 61: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

63 30 July 2015

Former Atlas E Missile Silo Site in Kimball, Nebraska

Page 62: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

64 30 July 2015

Planning for DT: Site Separation Distance

• Make sites a “safe” distance apart

• This must be a compromise. Factors:

− Risks

− Performance (inter-site latency)

− Interconnect costs

− Ease of travel between sites

− Availability of workforce

Page 63: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

65 30 July 2015

Planning for DT: Site Separation Distance

• Select site separation distance:− 1-3 miles: protects against most building fires, natural gas leaks,

armed intruders, terrorist bombs

− 10-30 miles: protects against most tornadoes, floods, hazardous material spills, release of poisonous gas, non-nuclear military bomb strike

− 100-300 miles: protects against most hurricanes, earthquakes, tsunamis, forest fires, most biological weapons, most power outages, suitcase-sized nuclear bomb

− 1,000-3,000 miles: protects against “dirty” bombs, major region-wide power outages, and possibly military nuclear attacks

Threat Radius

Page 64: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

66 30 July 2015

"You have to be far enough away to be beyond the immediate threat you are planning for.“…"At the same time, you have to be close enough for it to be practical to get to the remote facility rapidly.“

“Disaster Recovery Sites: How Far Away is Far Enough?” By Drew Robb

Enterprise Storage Forum, September 30, 2005

http://www.enterprisestorageforum.com/continuity/features/article.php/3552971

Page 65: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

68 30 July 2015

“A Watertight Plan” By Penny Lunt Crosman, IT Architect, Sept. 1, 2005

http://www.itarchitect.com/showArticle.jhtml?articleID=169400810

“Survivors of hurricanes, floods, and the London terrorist bombings offer best practices and advice on disaster recovery planning.”

Page 66: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

69 30 July 2015Source: “A Watertight Plan” By Penny Lunt Crosman, IT Architect, Sept. 1, 2005

Page 67: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

70 30 July 2015

Planning for DT: Site Separation Direction

• Select site separation direction:

− Not along same earthquake fault-line

− Not along likely storm tracks

− Not in same floodplain or downstream of same dam

− Not on the same coastline

− Not in line with prevailing winds (that might carry hazardous materials or radioactive fallout)

Page 68: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

Long-Distance Disaster Tolerance Using OpenVMS Clusters

Page 69: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

Background

Page 70: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

73 30 July 2015

Historical Context

Example: New York City, USA

• 1993 World Trade Center bombing raised awareness of DR and prompted some improvements

• Sept. 11, 2001 has had dramatic and far-reaching effects

−Scramble to find replacement office space

−Many datacenters moved off Manhattan Island, some out of NYC entirely

− Increased distances to DR sites

− Induced regulatory responses (in USA & abroad)

Page 71: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

74 30 July 2015

Trends and Driving Forces in the US

• BC, DR and DT in a post-9/11 world:

−Recognition of greater risk to datacenters

• Particularly in major metropolitan areas

−Push toward greater distances between redundant datacenters

• It is no longer inconceivable that, for example, terrorists might obtain a nuclear device and destroy the entire NYC metropolitan area

Page 72: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

75 30 July 2015

Trends and Driving Forces in the US

• "Draft Interagency White Paper on Sound Practices to Strengthen the Resilience of the U.S. Financial System“

−http://www.sec.gov/news/studies/34-47638.htm

• Agencies involved:

Federal Reserve System

Department of the Treasury

Securities & Exchange Commission (SEC)

• Applies to:

Financial institutions critical to the US economy

Page 73: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

76 30 July 2015

US Draft Interagency White Paper

The early “concept release” inviting input made mention of a 200-300 mile limit (only as part of an example when asking for feedback as to whether any minimum distance value should be specified or not):

“Sound practices. Have the agencies sufficiently described expectations regarding out-of-region back-up resources? Should some minimum distance from primary sites be specified for back-up facilitiesfor core clearing and settlement organizations and firms that play significant roles in critical markets (e.g., 200 -300 miles between primary and back-up sites)? What factors should be used to identify such a minimum distance?”

Page 74: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

77 30 July 2015

US Draft Interagency White Paper

This induced panic in several quarters:

• NYC feared additional economic damage of companies moving out

• Some pointed out the technology limitations of some synchronous mirroring products and of Fibre Channel at the time which typically limited them to a distance of 100 miles or 100 km

Revised draft contained no specific distance numbers; just cautionary wording

Ironically, that same non-specific wording now often results in DR datacenters 1,000 to 1,500 miles away

Page 75: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

78 30 July 2015

US Draft Interagency White Paper

“Maintain sufficient geographically dispersedresources to meet recovery and resumption objectives.”

“Long-standing principles of business continuity planning suggest that back-up arrangements should be as far away from the primary site as necessary to avoid being subject to the same set of risks as the primary location.”

Page 76: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

79 30 July 2015

US Draft Interagency White Paper

“Organizations should establish back-up facilities a significant distance away from their primary sites.”

“The agencies expect that, as technology and business processes … continue to improve and become increasingly cost effective, firms will take advantage of these developments to increase the geographic diversification of their back-up sites.”

Page 77: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

80 30 July 2015

Ripple effect of Regulatory Activity Within the USA

• National Association of Securities Dealers (NASD):

−Rule 3510 & 3520

• New York Stock Exchange (NYSE):

−Rule 446

Page 78: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

81 30 July 2015

Ripple effect of Regulatory Activity Outside the USA

• United Kingdom: Financial Services Authority:−Consultation Paper 142 – Operational Risk and Systems

Control

• Europe:−Basel II Accord

• Australian Prudential Regulation Authority−Prudential Standard for business continuity management

APS 232 and guidance note AGN 232.1

• Monetary Authority of Singapore (MAS)−“Guidelines on Risk Management Practices – Business

Continuity Management” affecting “Significantly Important Institutions” (SIIs)

Page 79: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

82 30 July 2015

Resiliency Maturity Model project

• The Financial Services Technology Consortium (FTSC) has begun work on a Resiliency Maturity Model

−Taking inspiration from the Carnegie Mellon Software Engineering Institute’s Capability Maturity Model (CMM) and Networked Systems Survivability Program

− Intent is to develop industry standard metrics to evaluate an institution’s business continuity, disaster recovery, and crisis management capabilities

Page 80: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

Long-distance Effects:Inter-site Latency

Page 81: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

84 30 July 2015

Long-distance Cluster Issues

• Latency due to speed of light becomes significant at higher distances. Rules of thumb:

− About 1 ms per 100 miles, one-way

− About 1 ms per 50 miles round-trip latency

• Actual circuit path length can be longer than highway mileage between sites

• Latency can adversely affect performance of

− Remote I/O operations

− Remote locking operations

Page 82: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

85 30 July 2015

200 240 400

4400

23000

0

5000

10000

15000

20000

25000

Latency (micro-seconds)

Gigabit Ethernet, zerodistance

Fast Ethernet, zerodistance

ATM 30 miles

DS-3 250 miles

OC-3 1400 miles

OpenVMS Lock Request Latencies

Page 83: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

86 30 July 2015

Inter-site Latency:Actual Customer Measurements

Highway MileageLatency (ms) Est. Circuit Path Length

5 miles ATM OC-3 0.5 30 miles

35 miles 1.5 95 miles

25 to 35 miles,

IP DLSW link3 to 4 190-250 miles (effective)

130 miles DS-3 4.4 275 miles

“Over 150” miles 5.5 350 miles

1,250 miles DS-3 30 1,875 miles

Page 84: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

87 30 July 2015

Differentiate between latency and bandwidth

• Can’t get around the speed of light and its latency effects over long distances

− Higher-bandwidth link doesn’t mean lower latency

Page 85: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

Long-distance Techniques:SAN Extension

Page 86: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

89 30 July 2015

SAN Extension

• Fibre Channel distance over fiber is limited to about 100 kilometers

−Shortage of buffer-to-buffer credits adversely affects Fibre Channel performance above about 50 kilometers

• Various vendors provide “SAN Extension” boxes to connect Fibre Channel SANs over an inter-site link

• See SAN Design Reference Guide Vol. 4 “SAN extension and bridging”:

−http://h20000.www2.hp.com/bc/docs/support/SupportManual/c00310437/c00310437.pdf

Page 87: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

Long-distance Data Replication

Page 88: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

91 30 July 2015

Disk Data Replication

• Data mirroring schemes

− Synchronous

• Slower, but no chance of data loss in conjunction with a site loss

− Asynchronous

• Faster, and works for longer distances

but can lose seconds’ or minutes’ worth of data (more under high loads) in a site disaster

Page 89: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

92 30 July 2015

Continuous AccessSynchronous Replication

Node

FC Switch

Node

FC Switch

Mirrorset

EVA EVA

Controller in

charge of

mirrorset:

Write

Page 90: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

93 30 July 2015

Continuous AccessSynchronous Replication

Node

FC Switch

Node

FC Switch

Mirrorset

EVA EVA

Controller in

charge of

mirrorset:

Write

Write

Page 91: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

94 30 July 2015

Continuous AccessSynchronous Replication

Node

FC Switch

Node

FC Switch

Mirrorset

EVA EVA

Controller in

charge of

mirrorset:

Write

Write

Success status

Page 92: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

95 30 July 2015

Continuous AccessSynchronous Replication

Node

FC Switch

Node

FC Switch

Mirrorset

EVA EVA

Controller in

charge of

mirrorset:

Write

Write

Success status

Success status

Page 93: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

96 30 July 2015

Continuous AccessSynchronous Replication

Node

FC Switch

Node

FC Switch

Mirrorset

EVA EVA

Controller in

charge of

mirrorset:

Write

Write

Success status

Success status

Application

continues

Page 94: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

97 30 July 2015

Continuous AccessAsynchronous Replication

Node

FC Switch

Node

FC Switch

Mirrorset

EVA EVA

Controller in

charge of

mirrorset:

Write

Page 95: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

98 30 July 2015

Continuous AccessAsynchronous Replication

Node

FC Switch

Node

FC Switch

Mirrorset

EVA EVA

Controller in

charge of

mirrorset:

Write Success status

Page 96: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

99 30 July 2015

Continuous AccessAsynchronous Replication

Node

FC Switch

Node

FC Switch

Mirrorset

EVA EVA

Controller in

charge of

mirrorset:

Write Success status

Application

continues

Page 97: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

100 30 July 2015

Continuous AccessAsynchronous Replication

Node

FC Switch

Node

FC Switch

Mirrorset

EVA EVA

Controller in

charge of

mirrorset:

Write

Write

Success status

Application

continues

Page 98: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

101 30 July 2015

Synchronous versus Asynchronous Replication and Link Bandwidth

Time

0 8 am 12 noon 5 pm 12 pm

MB/Sec

Synchronous – RPO = 0

Asynchronous – RPO 2 hrs. max

Asynchronous – RPO many hrs.

Application write bandwidth

Page 99: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

102 30 July 2015

Data Replication and Long Distances

• Some vendors claim synchronous mirroring is impossible at a distance over 100 kilometers, 100 miles, or 200 miles, because their product cannot support synchronous mirroring over greater distances

• OpenVMS Volume Shadowing does synchronous mirroring

−Acceptable application performance is the only limit found so far on inter-site distance for HBVS

Page 100: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

103 30 July 2015

Long-distance SynchronousHost-based Mirroring Software Tests

• OpenVMS Host-Based Volume Shadowing (HBVS) software (host-based mirroring software)

• SAN Extension used to extend SAN using FCIP boxes

• AdTech box used to simulate distance via introduced packet latency

• No OpenVMS Cluster involved across this distance (no OpenVMS node at the remote end; just “data vaulting” to a “distant” disk controller)

Page 101: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

104 30 July 2015

Long-distance HBVS Test Results

Delay, 1-way

(milliseconds)

Throughput

(Bytes/Second)

Distance

(Kilometers)

Distance

(Miles)

0 ms 11 megabytes 0 km 0 miles

10 ms 226 kilobytes 2,000 km 1,250 miles

50 ms 45 kilobytes 10,000 km 6,250 miles

100 ms 24 kilobytes 20,000 km 12,500 miles

200 ms 15 kilobytes 40,000 km 25,000 miles

300 ms 9 kilobyte 60,000 km 37,500 miles

400 ms 8 kilobytes 80,000 km 50,000 miles

485 ms 6.5 kilobytes 97,000 km 60,625 miles

Page 102: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

Mitigating the Effects of Long Inter-site Distances

Page 103: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

106 30 July 2015

Minimizing Round Trips Between Sites

• Some vendors have Fibre Channel SCSI-3 protocol tricks to do writes in 1 round trip vs. 2

−e.g. Brocade’s “FastWrite” or Cisco’s “Write Acceleration”

• Application design can also affect number of round-trips required between sites

Page 104: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

107 30 July 2015

Mitigating Impact of Inter-Site Latency

How applications are distributed across a multi-site OpenVMS cluster can affect performance

This represents a trade-off among performance, availability, and resource utilization

Page 105: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

108 30 July 2015

Application Scheme 1:Hot Primary/Cold Standby

• All applications normally run at the primary site

− Second site is idle, except for data replication work, until primary site fails, then it takes over processing

• Performance will be good (all-local locking)

• Fail-over time will be poor, and risk high (standby systems not active and thus not being tested)

• Wastes computing capacity at the remote site

Page 106: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

109 30 July 2015

Application Scheme 2:Hot/Hot but Alternate Workloads

• All applications normally run at one site or the other, but not both; data is mirrored between sites, and the opposite site takes over upon a failure

• Performance will be good (all-local locking)

• Fail-over time will be poor, and risk moderate (standby systems in use, but specific applications not active and thus not being tested from that site)

• Second site’s computing capacity is actively used

Page 107: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

110 30 July 2015

Application Scheme 3:Uniform Workload Across Sites

• All applications normally run at both sites simultaneously. (This would be considered the “norm” for most OpenVMS clusters.)

• Surviving site takes all load upon failure

• Performance may be impacted (some remote locking) if inter-site distance is large

• “Fail-over” time will be excellent, and risk low (all systems are already in use running the same applications, thus constantly being tested)

• Both sites’ computing capacity is actively used

Page 108: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

111 30 July 2015

Work-arounds being used today

• Multi-hop replication

−Synchronous to nearby site

−Asynchronous to far-away site

• Transaction-based replication

−e.g. replicate transaction (a few hundred bytes) with Reliable Transaction Router instead of having to replicate all the database page updates (often 8 kilobytes or 64 kilobytes per page) and journal log file writes behind a database

Page 109: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

112 30 July 2015

Data Replication over Long Distances:Multi-Hop Replication

• It may be desirable to synchronously replicate data to a nearby “short-haul” site, and asynchronously replicate from there to a more-distant site− This is sometimes called “cascaded” data replication

Synch Secondary AsynchPrimary Tertiary

100 miles 1,000 miles

Short-Haul Long-Haul

Page 110: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

Testing & Simulation of Long Distances

Page 111: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

114 30 July 2015

Testing / Simulation

• Before incurring the risk and expense of site selection, datacenter construction, and inter-site link procurement:

• Test within a single-datacenter test environment, with distance simulated by introducing packet latency, and bandwidth simulated by throttling traffic flow

• Techniques for simulating distance with latency:

−Hardware Network Emulators

−Software Network Emulators

Page 112: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

115 30 July 2015

Hardware Network Emulators

• A couple of vendors / products:

−Shunra STORM Network Emulator

−Spirent AdTech

Page 113: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

116 30 July 2015

Software Network Emulators

• A couple of examples:

−NIST Net from the National Institute of Standards and Technology

• http://snad.ncsl.nist.gov/nistnet/

−D4 (Dick’s Dynamic Delay Device) in OpenVMS

Page 114: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

117 30 July 2015

D4

• Capability added to OpenVMS Gigabit Ethernet LAN drivers

• Packets can be:

−Delayed

− Lost

• Bandwidth can be throttled/limited

Page 115: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

118 30 July 2015

D4

• Controlled by LAN SDA Extension:

−SDA> LAN DELAY PARAM /qualifiers

−SDA> LAN DELAY STATUS /qualifiers

Page 116: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

119 30 July 2015

D4

• LAN packets are handled / affected between a pair of Gigabit Ethernet NICs

• One non-Primary CPU recommended per pair of NICs

−Use Fast_Path to move interrupts off of Primary CPU onto a non-Primary CPU for both NICs

• So a quad-CPU OpenVMS system with 6 Gigabit Ethernet NICs can handle 3 LAN traffic streams

Page 117: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

120 30 July 2015

D4

• OpenVMS 8.3 or later, plus a LAN patch kit:−8.3 on Alpha: VMS83A_LAN-V0300 (or later)

−8.3 on Integrity: VMS83I_LAN-V0700 (or later)

−8.3-1H1: VMS831I_LAN-V0100 (or later)

• Functionality is contained in _MON images. Set SYSTEM_CHECK to 1 or:−Copy SYS$LOADABLE_IMAGES:SYS$EI1000_MON.EXE

to SYS$LOADABLE_IMAGES:SYS$EI1000.EXE

−Copy SYS$LOADABLE_IMAGES:SYS$EW5700_MON.EXE to SYS$LOADABLE_IMAGES:SYS$EW5700.EXE.

Page 118: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

121 30 July 2015

Example D4_SETUP.COM• $ !

• $ ! Configure RX4640 system for LAN Delay Function using EIC/EID, EIE/EIF, EWA/EWB

• $ !

• $ set noon

• $ !

• $ ! Set preferred CPU of other devices

• $ !

• $ set dev fga0/pref=0

• $ set dev fgb0/pref=0

• $ set dev eia/pref=0

• $ set dev eib/pref=0

• $ set dev eig/pref=0

• $ set dev eih/pref=0

• $ set dev ewc/pref=0

• $ !

• $ ! Devices to use are the AB465A Broadcom ports (Ruchba combo)

• $ !

• $ set dev ewa/pref=1

• $ set dev ewb/pref=1

• $ !

• $ ! Devices to use are the A7012A Intel ports

• $ !

• $ set dev eic/pref=2

• $ set dev eid/pref=2

• $ !

• $ ! Devices to use are the AB545A Intel ports (quad card)

• $ !

• $ set dev eie/pref=3

• $ set dev eif/pref=3

• $ !

• $ ! Turn off LAN driver tracing on all devices

• $ !

• $ mc lancp set dev/notrace/all

• $ ! Turn on LAN driver tracing on interesting devices, excluding fork begin/end entries

• $ !

• $ ! mc lancp set dev/trace=(mask=(%xFFFFFFF3,-1),size=2048) ewa

• $ ! mc lancp set dev/trace=(mask=(%xFFFFFFF3,-1),size=2048) ewb

• $ ! mc lancp set dev/trace=(mask=(%xFFFFFFF3,-1),size=2048) eic

• $ ! mc lancp set dev/trace=(mask=(%xFFFFFFF3,-1),size=2048) eid

• $ ! mc lancp set dev/trace=(mask=(%xFFFFFFF3,-1),size=2048) eie

• $ ! mc lancp set dev/trace=(mask=(%xFFFFFFF3,-1),size=2048) eif

Page 119: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

122 30 July 2015

SDA> LAN commands

• SDA> LAN DELAY PARAM /DEVICE=(device1,device2) /AGE=value /BANDWIDTH=value /BUFFER=value /DELAY=value /LOSS=value /TLOSS=value

Page 120: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

123 30 July 2015

SDA> LAN commands

−/DEVICE=(device1,device2) specifies the two LAN devices to use. They must both be assigned to the same secondary CPU.

−/DELAY=value specifies the amount of delay in microseconds to be imposed on each received packet before it is transmitted on the other device. Zero is the default.

−/BANDWIDTH=value specifies the maximum bandwidth allowed in megabits per second. Zero (default) means there is no bandwidth limit.

Page 121: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

124 30 July 2015

SDA> LAN commands

−/AGE=value specifies the packet age limit to be imposed, in microseconds. Packets older than this age are discarded. Zero (default) means there is no age limit.

−/BUFFER=value specifies the maximum amount of data in bytes to be buffered. Incoming packets that would cause this limit to be exceeded are discarded. Zero (default) means there is no buffering limit.

−/LOSS=value specifies the packet loss rate to be imposed, as the number of packets to be discarded each second. Zero (default) is no intentional packet loss.

−/TLOSS=value specifies the total number of packets to be discarded. Zero (default) means there is no limit to the number of packets that will be discarded.

Page 122: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

125 30 July 2015

SDA> LAN commands

• SDA> LAN DELAY STATUS /DEVICE=(device1,device2) /CONTINUOUS=value /HISTOGRAM /RESET

Page 123: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

126 30 July 2015

SDA> LAN commands

−/DEVICE=(device1,device2) specifies the two LAN devices to use. They must both be assigned to the same secondary CPU. If no devices are specified, status will be displayed for all device pairs.

−/CONTINUOUS=value specifies that the status display is to be repeated every value seconds. The default is no repetitions.

Page 124: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

127 30 July 2015

SDA> LAN commands

−/HISTOGRAM specifies that histogram data should be displayed, which includes:• Delay Variance (not a true statistical variance) – the difference between

the expected time that a transmit was to be issued and the time it actually was. For example, if the specified delay was 50 microseconds and a packet was transmitted 55 microseconds after the packet was received, the histogram bucket incremented is for 5 microseconds. This gives you an idea how accurate the delay function is. There are 64 buckets of 1024 CPU cycles each, so for a 1000 mhz processor, each bucket is 1.024 microseconds each. Note that this does not include any additional delay, perhaps because the transmit queue on the device is backing up because of load or the effect of flow control.

• Packets Outstanding – the number of packets outstanding to the other device for transmit. There are 16 buckets of 64 packets each, so the first bucket is for 0-63 packets outstanding, etc.

• Bytes Outstanding – the number of bytes outstanding to the other device for transmit. There are 16 buckets of 64k bytes each, so the first bucket is for 0-65535 bytes, etc.

• Packet Length – the length of each received packet in 16 buckets are given in the display 64..127, 128..191, etc.

Page 125: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

128 30 July 2015

SDA> LAN commands

−/RESET – clears the counters before the display (you can also use LAN DELAY PARAM /DEVICE=(device1,device2) to clear the counters).

Page 126: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

129 30 July 2015

LAN DELAY STATUS ExampleWAN$SDA(X-1) Extension on VLAN4 (HP rx4640 (1.30GHz/3.0MB)) at 9-JUL-2006 13:02:10.96

---------------------------------------------------------------------------------------

Device 1: EIC (Active) Device 2: EID (Active) CPU affinity: 2

Delay (usec): 5000 Max packet age (usecs): 0 Loss rate (pk/sec): 0

Bandwidth (mbits/sec): 50 Max buffering (bytes): 0 Total loss (pks): 0

EIC Xmt (pk) 1668495 (by) 13668246768 (mpk) 8 (mby) 1264 Lost (age) 0

EIC Rcv (pk) 1668228 (by) 13666059504 (mpk) 8 (mby) 1264 Lost (buffering) 0

EIC MBits/sec (128 pk) Xmt 0.00 Rcv 0.00 X+R 0.00 Lost (intentional) 0

EIC MBits/sec (512 pk) Xmt 0.00 Rcv 0.00 X+R 0.01 Lost (pool) 0

EIC MBits/sec (4096 pk) Xmt 0.04 Rcv 0.04 X+R 0.08 Current xmt (pk) 0/8

EIC MBits/sec (All pk) Xmt 11.91 Rcv 11.91 X+R 23.83 Current xmt (by) 0/57344

EIC Failures: Link 1 Xmt 0 Rcv 0 Elapsed time (sec) 9178

EID Xmt (pk) 1668228 (by) 13666059504 (mpk) 8 (mby) 1264 Lost (age) 0

EID Rcv (pk) 1668594 (by) 13669057776 (mpk) 8 (mby) 1264 Lost (buffering) 0

EID MBits/sec (128 pk) Xmt 0.00 Rcv 0.00 X+R 0.00 Lost (intentional) 0

EID MBits/sec (512 pk) Xmt 0.00 Rcv 0.00 X+R 0.01 Lost (pool) 0

EID MBits/sec (4096 pk) Xmt 0.04 Rcv 0.04 X+R 0.08 Current xmt (pk) 100/483

EID MBits/sec (All pk) Xmt 11.91 Rcv 11.91 X+R 23.83 Current xmt (by) 819200/3956736

EID Failures: Link 1 Xmt 0 Rcv 0 Elapsed time (sec) 9178

SDA>

Page 127: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

130 30 July 2015

LAN DELAY STATUS/HISTOGRAM Example

WAN$SDA(X-1) Extension on VLAN4 (HP rx4640 (1.30GHz/3.0MB)) at 27-AUG-2006 13:32:33.17

---------------------------------------------------------------------------------------

Device 1: EIC (Active) Device 2: EID (Active) CPU affinity: 2

Delay (usec): 0 Max packet age (usecs): 0 Loss rate (pk/sec): 0

Bandwidth (mbits/sec): 0 Max buffering (bytes): 0 Total loss (pks): 0

EIC Delay Variance (0..49+ usec): - - - 23% 44% 19% 10% - - 1% 3% - - - - -

- - - - - - - - - - - - - - - -

- - - - - - - - - - - - - - - -

- - - - - - - - - - - - - - - -

EIC Packets Outstanding (0..960+): 100% - - - - - - - - - - - - - - -

EIC Bytes Outstanding (0..960k+) : 100% - - - - - - - - - - - - - - -

EIC Packet Length: 64+ 128+ 192+ 256+ 384+ 448+ 512+ 756+ 1024 1280 1519 2048 3072 4096 6144 8192

EIC Packets: 33% 1% - 32% - - - - 17% 13% - - - - - 4%

EID Delay Variance (0..49+ usec): - - - 25% 46% 23% 2% - - 1% 3% - - - - -

- - - - - - - - - - - - - - - -

- - - - - - - - - - - - - - - -

- - - - - - - - - - - - - - - -

EID Packets Outstanding (0..960+): 100% - - - - - - - - - - - - - - -

EID Bytes Outstanding (0..960k+) : 100% - - - - - - - - - - - - - - -

EID Packet Length: 64+ 128+ 192+ 256+ 384+ 448+ 512+ 756+ 1024 1280 1519 2048 3072 4096 6144 8192

EIC Packets: 33% 1% - 32% - - - - 17% 13% - - - - - 4%

Page 128: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

Real-Life Examples

Page 129: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

132 30 July 2015

Real-Life Example:Credit Lyonnais, Paris

•Credit Lyonnais fire in May 1996

•OpenVMS multi-site cluster with data replication between sites (Volume Shadowing) saved the data

•Fire occurred over a weekend, and DR site plus quick procurement of replacement hardware allowed bank to reopen on Monday

Source: Metropole Paris

Page 130: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

133 30 July 2015

“ In any disaster, the key is to protect the data. If you lose your CPUs, you can replace them. If you lose your network, you can rebuild it. If you lose your data, you are down for several months. In the capital markets, that means you are dead. During the fire at our headquarters, the DIGITAL VMS Clusters were very effective at protecting the data.”

Jordan DoePatrick HummelIT Director, Capital Markets Division, Credit Lyonnais

Page 131: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

134 30 July 2015

Headquarters for Manhattan's Municipal Credit Union (MCU) were across the street from the World Trade Center, and were devastated on Sept. 11."It took several days to salvage critical data from hard-drive arrays and back-up tapes and bring the system back up” ...“During those first few chaotic days after Sept. 11, MCU allowed customers to withdraw cash from its ATMs, even when account balances could not be verified. Unfortunately, up to 4,000 people fraudulently withdrew about $15 million."

Ann Silverthorn, Network World Fusion, 10/07/2002

http://www.nwfusion.com/research/2002/1007feat2.html

Page 132: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

135 30 July 2015

Real-Life Examples: Commerzbank on 9/11

• Datacenter near WTC towers

• Generators took over after power failure, but dust & debris eventually caused A/C units to fail

• Data replicated to remote site 30 miles away

• One AlphaServer continued to run despite 104° F temperatures, running off the copy of the data at the opposite site after the local disk drives had succumbed to the heat

• See http://h71000.www7.hp.com/openvms/brochures/commerzbank/

Page 134: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

137 30 July 2015

Real-Life Examples of OpenVMS: International Securities Exchange

• All-electronic stock derivatives (options) exchange

• First new stock exchange in the US in 26 years

• Went from nothing to majority market share in 3 years

• OpenVMS Disaster-Tolerant Cluster at the core, surrounded by other OpenVMS systems

• See http://h71000.www7.hp.com/openvms/brochures/ise/

Page 135: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

138 30 July 2015

“OpenVMS is a proven product that’s beenbattle tested in the field. That’s why wewere extremely confident in building thetechnology architecture of the ISE onOpenVMS AlphaServer systems.”

Danny Friel, Sr. Vice President,Technology / Chief Information Officer,International Securities Exchange

Page 136: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

139 30 July 2015

“ We just had a disaster at one of our 3 sites 4 hours ago. Both the site's 2 nodes and 78 shadow members dropped when outside contractors killed all power to the computer room during maintenance. Fortunately the mirrored site 8 miles away and a third quorum site in another direction kept the cluster up after a minute of cluster state transition.”

Lee Mah,Capital Health Authority

writing in comp.os.vms, Aug. 20, 2004

Page 137: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

140 30 July 2015

“I have lost an entire data center due to a combination of a faulty UPScombined with a car vs. powerpole, and again when we needed to do major power maintenance. Both times, the remaining half of the cluster kept us going.”Ed Wilts, Merrill Corporation

writing in comp.os.vms, July 22, 2005

Page 138: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

Business Continuity

Page 139: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

142 30 July 2015

Business Continuity: Not Just IT

•The goal of Business Continuity is the ability for the entire business, not just IT, to continue operating despite a disaster.

•Not just computers and data:

−People

−Facilities

−Communications: Data networks and voice

−Transportation

−Supply chain, distribution channels

−etc.

Page 140: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

UsefulResources

Page 141: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

144 30 July 2015

Business Continuity Resources

• Disaster Recovery Journal:

− http://www.drj.com/

• Continuity Insights Magazine:

− http://www.continuityinsights.com//

• Contingency Planning & Management Magazine

− http://www.contingencyplanning.com/

• All are high-quality journals. The first two are available free to qualified subscribers

• All hold conferences as well

Page 142: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

145 30 July 2015

Multi-OS Disaster-Tolerant Reference Architectures Whitepaper

• Entitled “Delivering high availability and disaster tolerance in a multi-operating-system HP Integrity server environment”

• Describes DT configurations across all of HP’s platforms: HP-UX, OpenVMS, Linux, Windows, and NonStop

• http://h71028.www7.hp.com/ERC/downloads/4AA0-6737ENW.pdf

Page 143: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

146 30 July 2015

Tabb Research Report

• "Crisis in Continuity: Financial Markets Firms Tackle the 100 km Question"

−available from https://h30046.www3.hp.com/campaigns/2005/promo/wwfsi/index.php?mcc=landing_page&jumpid=ex_R2548_promo/fsipaper_mcc%7Clanding_page

Page 144: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

147 30 July 2015

Draft Interagency White Paper

• "Draft Interagency White Paper on Sound Practices to Strengthen the Resilience of the U.S. Financial System“−http://www.sec.gov/news/studies/34-47638.htm

• Agencies involved:Federal Reserve System, Department of the Treasury,

Securities & Exchange Commission (SEC)

• Applies to:Financial institutions critical to the US economy

• But many other agencies around the world are adopting similar rules

Page 145: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

148 30 July 2015

Business Continuity and Disaster Tolerance Services from HP

Web resources:

• BC Services:− http://h20219.www2.hp.com/services/cache/10107-0-0-225-121.aspx

• DT Services: − http://h20219.www2.hp.com/services/cache/10597-0-0-225-121.aspx

Page 146: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

149 30 July 2015

OpenVMS Disaster-Tolerant Cluster Resources

• OpenVMS Documentation at OpenVMS website:− OpenVMS Cluster Systems

− HP Volume Shadowing for OpenVMS

− Guidelines for OpenVMS Cluster Configurations

• OpenVMS High-Availability and Disaster-Tolerant Cluster information at the HP corporate website: http://h71000.www7.hp.com/availability/index.htmlandhttp://h18002.www1.hp.com/alphaserver/ad/disastertolerance.html

• More-detailed seminar and workshop notes at http://www2.openvms.org/kparris/ and http://www.geocities.com/keithparris/

• Book “VAXcluster Principles” by Roy G. Davis, Digital Press, 1993, ISBN 1-55558-112-9

Page 147: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

150 30 July 2015

Questions?

Page 148: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered

151 30 July 2015

Speaker Contact Info:

•Keith Parris

•E-mail: [email protected] [email protected]

•Web: http://www2.openvms.org/kparris/

Page 149: How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM:: Integrity Superdome All I/O’s need to complete to all spindles before it is considered