“z/os multi-site business continuity” september, 2012
DESCRIPTION
Learn aboput the “z/OS Multi-Site Business Continuity” September, 2012. This paper explores the various GDPS configuration deployments that clients have implemented to provide high availability/continuous operations locally and/or out of region disaster recovery protection. It also explores the trend towards trying to reduce D/R testing costs by moving toward a ‘regular site switch’ or ‘site toggle’ model. For more information on IBM System z, visit http://ibm.co/PNo9Cb. Visit the official Scribd Channel of IBM India Smarter Computing at http://bit.ly/VwO86R to get access to more documents.TRANSCRIPT
Page 1 9/11/2012
2 © Copyright IBM Corporation, 2012
“z/OS Multi-Site Business Continuity” September, 2012
Robert F. Kern
E-mail: [email protected],
Page 2 9/11/2012
2 © Copyright IBM Corporation, 2012
Notices Copyright © 2012 by International Business Machines Corporation.
No part of this document may be reproduced or transmitted in any form without written
permission from IBM Corporation.
The information provided in this document is distributed “AS IS” without any warranty,
either express or implied. IBM EXPRESSLY DISCLAIMS any warranties of
merchantability, fitness for a particular purpose OR INFRINGEMENT.
IBM shall have no responsibility to update this information.
IBM products are warranted according to the terms and conditions of the agreements (e.g.,
IBM Customer Agreement, Statement of Limited Warranty, International Program
License Agreement, etc.) under which they are provided. IBM is not responsible for the
performance or interoperability of any non-IBM products discussed herein.
The provision of the information contained herein is not intended to, and does not; grant
any right or license under any IBM patents or copyrights. Inquiries regarding patent or
copyright licenses should be made, in writing, to:
IBM Director of Licensing
IBM Corporation
North Castle Drive
Armonk, NY 10504-1785
USA
Trademarks The following trademarks may appear in this Paper.
AIX, AS/400, DS8000, Enterprise Storage Server, Enterprise Storage Server Specialist,
ESCON, FICON, FlashCopy, Geographically Dispersed Parallel Sysplex, HyperSwap,
IBM, iSeries, OS/390, RMF, System/390, S/390, Tivoli, TotalStorage, z/OS, and zSeries
are trademarks of International Business Machines Corporation or Tivoli Systems Inc.
Other company, product, and service names may be trademarks or registered trademarks
of their respective companies.
Page 3 9/11/2012
2 © Copyright IBM Corporation, 2012
Abstract Clients look for ways to reduce their TCO, simplify operations, and provide better service
to their customers. A trend in the area of Business Continuity today is that more and
more clients are looking to develop multi-site Continuous Operations and D/R strategies,
with the idea of switching which site production runs at on a regular basis. The concept
of toggling between sites or doing site flip/flops is gaining more scrutiny. Most clients
today who exploit toggling between sites do so with the full GDPS/PPRC HyperSwap
functionality deployed with a Multi-Site Workload. This configuration provides the
ability to perform switch sites in real time, with minimal interruption to the business.
Another emerging trend is for clients with out of region data centers to start examining
how they might also best accomplish the same business objective of switching sites,
while minimizing the impact to their business during the site switch operation.
Page 4 9/11/2012
2 © Copyright IBM Corporation, 2012
Introduction
This paper explores the various GDPS configuration deployments that clients have
implemented to provide high availability/continuous operations locally and/or out of
region disaster recovery protection. It also explores the trend towards trying to reduce
D/R testing costs by moving toward a ‘regular site switch’ or ‘site toggle’ model. To do
this, the paper will examine each of these aspects:
� 2 sites within metro/sysplex distance
� active/active (multi-site workload) with HyperSwap and parallel
sysplex exploitation - non-disruptive flip/flop
� active/standby (single site workload) with HyperSwap and parallel
sysplex exploitation - non-disruptive flip/flop possible with appropriate
configuration and temporary performance impact. Applications that do
not exploit sysplex incur an outage during the site move. Disruptive
site switches are typically automated to minimize the outage duration.
� 2 sites beyond metro/sysplex distance or using asynchronous data replication.
� Disruptive switch but automated to minimize outage duration
� Active/Standby – DB2 & IMS Application Disaster/Recovery at
distance. Two separate Sysplexes at distance with application
level Active/Standby across the two Sysplexes utilizing application
specific software based data replication technology.
� 3 Site Configurations & benefits.
� Future vision.
The traditional two site model provides for Site 1 as the “primary production” site
and Site 2 as the “backup or remote recovery” site. The regular site toggle model is a
peer to peer relationship model where production can run at either site and switching
sites for “business reasons” on a regular basis becomes the business norm. An
active/active model, that enables a site switch with minimal performance impacts, can
be realized by clients through the following:
� sysplex enabled applications
� deployment of a multi-site workload under GDPS/PPRC with HyperSwap.
� Duplication of all site resources across the two sites.
As distances between sites increase, data replication must switch from synchronous to
asynchronous techniques to avoid application performance impacts. In addition,
parallel sysplex distances are typically determined by the acceptable CF Link
performance for the various applications as well as the maximum STP Timer distances
(200km maximum). With these types of configurations a site switch is possible, but
Page 5 9/11/2012
2 © Copyright IBM Corporation, 2012
an automated sysplex wide IPL is required. End to end automation like GDPS can
minimize the outage time to perform the site switch.
This paper will discuss trends and directions in this arena for z/OS.
High Availability/Continuous Operations & Out of Region Disaster
Protection
IT Infrastructure Availability can be broken down into three pieces; High
Availability, Continuous Operations and Disaster/Recovery. Each brings unique client
requirements to clients when addressing Business Continuity. Through an
understanding of the client business requirements in this arena, IBM can help tailor the
right solution at the right cost point for any IT infrastructure.
6 © 2009 IBM Corporation Copyright IBM 2009
Business Continuity - Aspects of Availability
High AvailabilityFault-tolerant, failure-
resistant infrastructure supporting continuous application processing
Continuous OperationsNon-disruptive backups and system maintenance coupled with continuous availability of
applications
Disaster RecoveryProtection against
unplanned outages such as disasters through reliable,
predictable recovery
Protection of critical business data
Recovery is predictable and reliable
Operations continue after a disaster
Costs are predictable and manageable
GDPS Solutions Overview
Page 6 9/11/2012
2 © Copyright IBM Corporation, 2012
GDPS (Geographically Disperses Parallel Sysplex) shipped originally in 1998 and
introduced the concept of multi-site IT Infrastructure resource management, for the
Sysplex. GDPS automation enhances the z/OS base sysplex and parallel sysplex
management to an end to end “server, workload, and data, with a coordinated network
switch” solution of resource management within the same or across multiple sites
providing continuous operations for clients. To accomplish this, GDPS automation
interfaces with many different System z hardware & software interfaces to reduce the
necessity of skilled personnel to perform various operations during a site switch. Some of
these interfaces include:
� System z Hardware Management Console (HMC) to manage the System z
hardware reconfigurations dynamically. (ex. CBU, Expend Lpars, System IPLs,
etc.)
� Sysplex & STD Timer interfaces,
� CF Duplexing Interfaces
� DS8000 Data Replication Functions – FlashCopy, z/OS Global Mirror(XRC),
Metro Mirror (PPRC), and Global Mirror
� Various z/OS System Interfaces
� z/OS integrated with various DS8000 Synergy items.
GDPS is storage vendor independent as all major storage vendors on the System z
platform can participate in solutions using their implementation of the IBM DS8000 Disk
Storage Subsystem data replication architecture of Metro Mirror, FlashCopy and zGM
(XRC). New features and functions are developed with the IBM Systems Storage team
on the DS8000. IBM sells the Host to Storage Subsystem “architecture” to the other
storage vendors. Those vendors then implement the feature/function on their disk
subsystems based on the Host to Disk Storage Subsystem architected interfaces. So, the
disk storage subsystem internal processing for a feature or function may be different from
one vendor to another. Depending on the specific feature/function there generally is
some time where the feature/function is only available on the DS8000. One should
consult with each storage vendor to understand specific feature/function support for any
DS8000 storage subsystem enhancement.
In addition, the GDPS automation inter-operates with all major system automation
packages available for System z.
Relative to Business Resiliency/Business Continuity, IBM’s Flagship product is GDPS.
GDPS comes in a variety of different flavors/solutions. The following two charts
illustrate the various solutions.
Page 7 9/11/2012
2 © Copyright IBM Corporation, 2012
GDPS provides an entry level solution called GDPS HyperSwap Manager, focused on
providing the HyperSwap availability solution for z/OS on the same data center floor or
across two local area data centers up to 200km with Parallel Sysplex.
GDPS/PPRC HyperSwap is the Full Function version of HyperSwap Manager, which can
be easily upgrade to. The full function GDPS/PPRC HyperSwap supports zVM and
zLinux data along with z/OS data. In addition to masking disk subsystem failures the full
function version, exploits parallel SYSPLEX to mask CEC failures, persistent sessions to
coordinate a network switch, CF Duplexing to manage CF structure failures and VTS PtP
to mask tape subsystem failures. Finally, if the failures evolve into a disaster scenario,
GDPS provides a complete end to end site failover/fallback capability for both planned
and unplanned site switches. One mouse click and the server, data, workload and a
coordinated network site switch are performed via automation. All data is recovered, the
SYSPLEX IPL’ed, data bases restarted followed by the applications. Skilled personnel
are no longer required to get the Sysplex up and running in the event of a disaster.
GDPS/GM (System z & Open Systems data) & GDPS/XRC (z/OS & zLinux only)
provide site failover/failback (FO/FB), typically “out of region” exploiting IBM’s Global
Mirror and zGM (XRC) data replication technologies.
GDPS/MzGM and GDPS/MGM provide a combination of high availability/continuous
operations locally coupled with out of region D/R protection. All GDPS solutions are
fully automated, proven, auditable, and in the case of PPRC and zGM (XRC) storage
vendor independent!
Page 8 9/11/2012
2 © Copyright IBM Corporation, 2012
The various GDPS solutions also support zVM and zLinux data through a feature call
x/DR.
The GDPS System z umbrella also includes the ability for GDPS automation to inter-
operate with System p, x, i (Linux), Windows, HP and Sun through the GDPS/DCM
(Distributed Cluster Manager) automation “inter-operability code” feature that works in
conjunction with Tivoli System Automation Application Manager (SA Appman) and/or
the Symantec Vertias Cluster Server Solutions. With GDPS and the x/DR and/or DCM
features, a single mouse click can yield a coordinated site failover/fall back of all of the
customers systems. (ex. System z (z/OS, zLinux, zVM) coordinated with say System p
AIX systems). The disk replication functions can be managed separately with the GDPS
and DCM automation or together, depending on the clients requirements for cross
platform data consistency.
GDPS is build upon the IBM DS8000 Storage based data replication architecture for
FlashCopy, Metro Mirror, z/OS Global Mirror and Global Mirror. As new features and
functions are implemented in the DS8000, GDPS automation is modified to exploit those
features and functions. In addition, GDPS supports various DS8000 base box features
used in conjunction with the various advanced functions.
IBM DS8000 Metro Mirror and Global Mirror support a function known as ‘Open Lun
Support’, such that through an ECKD device address, GDPS automation is able to
manage the Metro Mirror and/or Global Mirror functions for a distributed system Lun(s).
This is also true for Metro Global Mirror configurations. With the Open Lun support,
GDPS can provide a single restart point across the platforms. More systems and data
replication alternatives will continue to be provided in the future based on client
requirements. This is especially important for clients that have Multi-Platform
Applications where transactions are for example initially received by a Windows system,
then routed to say an AIX system and then to the “backend” z/OS System. Each system
may save data and as a result to recover the “application”, multiple platforms must be
recovered to the same point in time. GDPS inter-operability with Tivoli AppMan and/or
Symantec Veritas Cluster Server can provide such a solution for clients.
Page 9 9/11/2012
2 © Copyright IBM Corporation, 2012
Open Lun Support is also important for clients with applications like SAP where the user
interfaces are typically on non System z platforms and the backend data base runs on
z/OS. In some cases clients have moved the application’s parts that were running on non-
System z platforms to zLinux, but many clients resist introducing the risk of any change
to critical production applications that have been running for some time. Open Lun
Support can provide a data consistency solution for multi-platform application(s). All
data is recovered to a single point in time enabling each platform’s data base to perform a
data base Restart operation instead of a data base Recover operation when a site switch
occurs. The data base restart process manages all “in flight” and “in doubt” transactions,
which in turn permits the application(s) parts spread across the different platforms to
resume processing from the restarted point in time forward. GDPS automation when
combined with the DCM automation feature can inter-operate across the enterprise to
provide a complete business solution for clients in the area of IT business continuity.
This critical business function is made possible by the DS8000 ‘open Lun support’.
Page 10 9/11/2012
2 © Copyright IBM Corporation, 2012
Two Local Data Centers - 2 sites within metro/sysplex distance
The full GDPS/PPRC HyperSwap implementation can be configured as an active/active
“multi-site workload’ or active/standby “single-site workload” providing real time
planned and unplanned site switches mode through the deployment the following
features/functions:
- Parallel Sysplex – permits the movement of a workload from one processor at site 1 to an
alternate CEC in site 2.
- Sysplex enabled Applications. (required for multi-site workloads)
- HyperSwap – permits the ability for disk access to switch from a Metro Mirror Primary
volume(s) to the target volume(s) and reverse the mirror without an IPL of the parallel Sysplex.
- VTS Peer to Peer Tape configuration permits real time tape mirroring across multiple
physical Tape libraries without interrupting operations.
- Multiple Sysplex Timers permit timer switches in real time.
- CF Duplexing permits the switching of data structure access in real time.
- The concept of persistent sessions enables real time network switches.
Some customer applications have affinities. (e.g., all transactions for a given type must
be routed to a specific system, one transaction passes information onto the next
transaction, etc.). A sysplex enabled application requires that all affinities be removed
so a transaction can be routed to & execute on any clone of the application on any
system in the sysplex. When this is done, the application can then be run in an
active/active, multi-site workload configuration. Transactions can be distributed to run
on any system within the Sysplex, independent of their physical location.
Through the GDPS automation, more and more clients perform both Planned and
Unplanned site switches on a regular basis. Planned site switches are used to minimize
the production risks associated with site or equipment maintenance. Once a lights out
data center opens its doors for maintenance operations, the possibility exists for
production impacts. These can be minimized by switching production to the alternate
site in real time with a multi-site workload configuration. Providing the ability for a
client to exploit this type of operational functionality has spurred clients to think of new
approaches and new business exploitations of the technology.
Page 11 9/11/2012
2 © Copyright IBM Corporation, 2012
38 IBM Systems© 2008 IBM Corporation
GDPS/PPRC: a Continuous Availabilty and/or Disaster Recovery Solution- Metropolitan Distance
SITE 1
NETWORK
SITE 2NETWORK
112
2
3
4
56
7
8
9
10
11
112
2
3
4
567
8
9
10
11
� Manages Multi-Site Parallel Sysplex,
Processors, CBU, CF, Couple Data Sets
� Manages Disk RC (System z & open LUN)
� Manages Tape Remote Copy (PtPVTS)
� Exploits Hyperswap & FlashCopy Function
� Automated planned and unplanned actions
(z/OS, CF, disk, tape, site)
� Improves availability of heterogeneous
System z business operations
Planned and unplanned exception conditions
The above diagram shows a high-level view of the GDPS/PPRC topology. The physical
topology of a GDPS/PPRC consists of a base or Parallel Sysplex cluster spread across
two sites (known as site 1 and site 2) with one or more z/OS systems at each site,
separated by up to 200 kilometers (km). The multi-site sysplex must be configured with
redundant hardware (e.g., a Coupling Facility and a Sysplex Timer in each site) and the
cross site connections (typically dedicated or ‘dark’ fibre) must be redundant. All critical
data is mirrored from the primary site (site 1 in this diagram) to the secondary site (site 2).
All Shared CF structures are located on the primary site coupling facilities. Therefore,
when transactions are executed on the processors at the remote site, disk I/O and Shared
CF structure access is through links from the secondary site to the primary site and the
disk I/O and CF structure updates are then mirrored in a synchronous manner back to the
remote site. This adds additional overheads to the applications disk I/O as well as any
access to shared CF structures. Before a customer elects to deploy a multi-site
configuration, he must first insure that his applications are sysplex enabled after which
careful consideration must be given to the system & application performance impacts of
these two accesses when a transaction is executed at the remote site. In many cases the
application performance impact will limit the effective distance that an active/active
configuration can actually sustain.
For disk I/O the performance impact of Metro Mirror rule of thumb:
1. Disk Subsystem overhead of MM at zero distance + (plus)
Page 12 9/11/2012
2 © Copyright IBM Corporation, 2012
2. speed of light through dedicated “dark” fibre for a single protocol exchange (linear function
of 1ms/100km or .1ms/10km) x (times)
3. the # of protocol exchanges implemented in the specific MM disk to disk implementation (for
IBM DS8000 MM, a single protocol exchange is accomplished through a feature called pre-
deposit write) + (plus)
4. other device overheads that may be on the fibre path. (ex. Switches, DWDMs, compression
and/or encryption devices, channel extenders, etc.)
For CF single latency rule of thumb:
� Signal latency impact (round trip) = 10 US/KM * fiber distance KM * # of protocol
exchanges
� Example: assume two sites separated by 10 KM and a processor in site 1 is accessing disk in
site 2, signal latency impact = 10 US/KM * 10 KM * 1 (FICON has one protocol exchange) or
100 US impact
� Terminology:
► Kilometer (KM) – one KM equals 5/8 mile
► Millisecond (MS) – 10**-3
► Microsecond (US) – 10**-6
For most clients, the impact of CF single latency beyond 40-50 km (25-30 miles) yields
too great of application impact. Because of this, GDPS/PPRC multi-site implementations
typically tend to be campus or metro distances.
If customer applications are not sysplex enabled and/or the application performance
impact of a multi-site configuration to too great, then the choice for these clients
becomes GDPS/PPRC w/HyperSwap in a Single-site (active/standby) configuration. In
this configuration, all hardware can be duplicated across the two sites. The secondary
site processor typically will run the GDPS control system typically referred to as the k-
sys. Both a planned and unplanned site switch will involve the re-ipl of all systems in the
Sysplex at the recovered site having had automation recover and switch all dependent
resources.
GDPS/PPRC prerequisites include NetView and System Automation for z/OS. GDPS
automation also interacts with any existing automation products. With a multi-site
Parallel Sysplex, this provides a Continuous Availability/Continuous Operations and a
Disaster Recovery solution. In addition, GDPS provides set of panels for standard
actions as well as the ability to customize scripts for an installation.
GDPS/PPRC Multi-site sysplex. At least one system in Site 2 is in the site 1
production Sysplex. All production can run in site 1, the GDPS “K-sys” runs in site 2
or production can run in either or both site 1 & 2. Sysplex timers and CFs are in both
sites. Two (for availability) fiber trunks are recommended to connect both sites, For
unplanned reconfigs, system failures, processor failures, systems can be restarted in
place or on the other site depending upon how they are defined.
Page 13 9/11/2012
2 © Copyright IBM Corporation, 2012
GDPS/PPRC Single-site sysplex. All production images run at the primary site. The
GDPS “K-sys” typically runs at site 2 and all resources are typically available at both
sites. Sysplex timers and CFs are in both sites. Two (for availability) fiber trunks are
recommended to connect both sites
The following outlines the typical resources available at each site for GDPS/PPRC
w/HyperSwap.
� Base Sysplex or Parallel Sysplex environment
� Manages unplanned reconfigurations
� z/OS, CF, disk, tape, & coordinates network connections
� Designed to maintain data consistency and integrity across all volumes
� Fast, automated site failover
� No or limited data loss
� Single point of control for
� Standard actions
� Stop, Remove, IPL system(s)
� Parallel Sysplex Configuration management
� Couple data set (CDS), Coupling Facility (CF) management
� User defined script (e.g. Planned Site Switch)
� PPRC Configuration management
2 Sites Beyond Metro/Sysplex Distance
GDPS solutions beyond metro/Sysplex distance include GDPS/XRC and GDPS/GM).
Clients select either the XRC or GM data replication technique based on their specific
requirements. XRC provides for the lowest possible RPO and only supports z/OS, and
zLinux data. Global Mirror provides for a tunable RPO (3-5 seconds to 18 hours) and
supports all System z and distributed systems data.
With asynchronous data replication solutions a site switch will require an automated
Sysplex wide IPL. Asynchronous data replication can support a “Planned Site Switch”
with no loss of data, but to do this the applications must be shut down. Storage based
data replication technology today supports planned site Failover/Failback scenarios
such that only changed data need be copied back to resync the sites. This capability is
available today with the various flavors of GDPS 2-site and 3-site solutions. But, in
Page 14 9/11/2012
2 © Copyright IBM Corporation, 2012
each case a Sysplex wide IPL, data replication disk/tape switch, and a client end user
network switch must be done in a coordinated manner. In this way the Sysplex is
restarted as well as all data bases and application(s) workloads at the remote site.
When the various data bases are restarted, “In Flight” and “In Doubt” transactions are
resolved as well as a “rebuild” of any and all coupling facility structures.
If a “planned outage” can be tolerated by the client, then switching sites on a regular
base can help to minimize costs involved with D/R testing. Planned site switches can
verify that all the resources required to run the application are available in both sites.
This can then be fully tested to insure that enough capacity (Processor, storage, network,
etc) is available at both sites for any and all combinations of the workload. In addition,
a client is testing the complete production application(s) end to end. Often, traditional
D/R tests only verify that the ‘system platform’ can be ipl’ed and based on time
available some minimal subset of the production workload is executed. The best D/R
test can be executed by a site switch that in fact leaves production to run in each site for
a reasonably long period of time. (ex. 3-6 months) During this time, the application
typically has gone through various periods of the business cycle including end of day,
end of week, and end of quarter processing. Through careful planning, one can
eventually verify that all application processing can be executed independent of site.
This approach fits into some business models better than others. In some countries a
physical site utility check is required once a year. This requires a full electrical
shutdown. Therefore a site switch to the other production site may be easier in this
environment as the outage is minimized to the time to perform the site switch and have
the application(s) back up and running rather than also including the time to verify all
utilities at the original production site.
The simple approach to insure that a client can easily switch sites and run all
applications with similar performance, scalability and capacity growth is to duplicate
all hardware and software resources across both sites. If a client currently has
deployed a 3-site GDPS configuration with GDPS/PPRC HyperSwap locally at the
production site, one would also want to deploy the same configuration at the target
sister production site. This would typically be called a 4-site configuration is pictured
below.
The emerging thoughts are that money currently spent on Disaster Recovery Testing
could be decreased, if one could provide on a regular basis, the ability to switch back
and forth across sites in an automated fashion. When implemented, planned site
switches provide this function. That means that D/R testing need only verify that the
unique automation required to perform a site switch for an unplanned scenario also
works. Customers minimize the differences between planned and unplanned site switch
scenarios today by deploying the “Test the Way we Recover and Recover the way we
test” model. Typically today, several clients D/R testing is done at the remote site
while maintaining full D/R protection. This is done by making a PiT FlashCopy of the
data and performing all D/R testing against that copy of the data. When a disaster
occurs, as part of the recovery process, a FlashCopy of the data is created and used for
the D/R recovery process. This minimizes unique actions between a planned and
unplanned site failover scenario.
Page 15 9/11/2012
2 © Copyright IBM Corporation, 2012
In both the planned and unplanned site switch scenarios, GDPS automation can
minimize the time of the outage or the RTO. GDPS automation can also help to
minimize the risk of performing a site switch as the automation is proven, repeatable
and minimizes human errors. The Recovery Time objective is a measure of the time it
takes from the time that a planned or unplanned site switch is identified until all
applications are up and running at the remote site. A key benefit of GDPS automation
is that, once implemented, the RTO is a known proven, repeatable quantity.
GDPS/Active/Standby - Application by Application Availability:
If all of a client’s application data is within a single data base (DB2 and/or IMS),
clients can implement high availability across two sites on an application by
application basis rather than managing high availability/disaster protection on a
platform(s) basis.
GDPS/Active/Standby automation enables automated ‘application level’ site switches
that typically provides an RTO on the order of seconds to minutes. Clients use DB2 to
DB2 software data replication with IBM Tivoli Infosphere Replication Server for z/OS and/or
IMS to IMS software data replication with IBM Tivoli Clasic Infosphere Replication for z/OS..
In this case the DB2/IMS log entries are replicated between sites by DB2/IMS. An
active z/OS image with a copy of the DB2/IMS data base is running at the remote site
and all DB2/IMS updates are applied when received. In the event of a disaster or a
planned site switch for this application, the end user network is switched to route active
transactions to the remote site for processing with minimal data loss. The routing of
transactions managed by the IBM Workload Distributor software.
This approach typically also requires the client to implement a strict change control
process across all systems to insure that the various system components are always
updated in step to keep the z/OS images in sync. The following picture outlines the
GDPS/Active/Standby solution.
Page 16 9/11/2012
2 © Copyright IBM Corporation, 2012
3-Site Configurations
Several clients with an out of region D/R implementation or with high availability
locally have moved to a 3-site configuration by implementing either GDPS/MzGM
w/HyperSwap or a GDPS/MGM w/HyperSwap. These configurations provided ‘local’
high availability/continuous operations environments providing local real time planned
site switch scenarios as well as site failover/failback functionality for a local site disaster
with an RPO of zero. Some clients, implement their ‘2nd
local site on the same data center
floor, or across a fire wall on the same data center floor. A few customers have just
implemented HyperSwap locally to avoid a disk subsystem failure from causing a
Sysplex wide outage. In all cases, the implementation focus was on increasing
availability of IT to the business locally or adding out of region D/R protection. One key
cost component on developing a multi-site solution is the duplication of the client’s end
user network. Depending on the complexity and cost associated with replicating the end
user network, several clients prefer to implement a ‘3-site’ solution across only two
physical sites.
At this time, IBM has deployed some 80+ GDPS/MzGM w/HyperSwap or GDPS/MGM
with HyperSwap multi-site configurations. The following figures outline these
implementations.
Page 17 9/11/2012
2 © Copyright IBM Corporation, 2012
57 IBM Systems© 2008 IBM Corporation
GDPS/MzGM w/HyperSwap & Incremental Resync
Site1Site1Site1Site1
Site2Site2Site2Site2
K1
MetroMirror
P2bkup
CF1
KgB
� Data Replication A->B & A -> C
� Incremental resynch B ����C
if Site1 or A-disk fails
� Maintains disaster recovery position
� Improved RTO
� Optional: CFs / Prod systems in Site2
P1 Unix
K1K1
A
K2 P1bkup
CF2
P2
K2
Incremental Resync
Recovery SiteRecovery SiteRecovery SiteRecovery Site
AC
AF
SDM
112
2
3
4
56
7
8
9
10
11
Kx
CF1SDM KxP2Bkup
P1Bkup
z/OS Global Mirror1
12
2
3
4
56
7
8
9
10
11
F Recommended for FlashCopy
112
2
3
4
56
7
8
9
10
11
112
2
3
4
56
7
8
9
10
11
ETR or STP
The standard GDPS/MzGM HyperSwap with Incremental Resync configuration
enables data replication from A -> B with HyperSwap and z/OS Global Mirror data
replication from A -> C. On an A->B HyperSwap event, the Incremental
Resynchronization for GDPS MzGM enables the reestablishment of the z/OS Global
Mirror session from A->C to B->C. GDPS manages the z/OS Global Mirror sessions,
so that only changed tracks need to be sent to the recovery site instead of requiring a
full-volume copy to reestablish the disaster recovery copy. This can greatly reduce the
time required (in some cases from hours down to minutes) to reconnect to your remote
site, reducing the risk of not being protected
Page 18 9/11/2012
2 © Copyright IBM Corporation, 2012
54 IBM Systems© 2008 IBM Corporation
GDPS/MGM w/HyperSwap
Site1Site1Site1Site1
Site2Site2Site2Site2 Recovery SiteRecovery SiteRecovery SiteRecovery Site
Kp
R P1Bkup
AD
AC
Global Mirror
P2bkup
P2Bkup
AF
CF1
112
2
3
4
56
7
8
9
10
11
112
2
3
4
56
7
8
9
10
11
112
2
3
4
56
7
8
9
10
11
KgB
� GM K-Sys runs in production LPAR
► HyperSwap protection
� Reduced resource requirement
CF3
Non-zBkup
P1 Non-z
K1Kp
A
KP P1bkup
CF2
P2 Non
-z
KPR
MetroMirror
Kg
Kg
Kg
Kg
ETR or STP
F Recommended for FlashCopy
The standard GDPS/MGM w/HyperSwap configuration provides data replication from
A->B->C. The ability to run the GDPS/GM Ksys in a GDPS/PPRC production system,
reducing the number of z/OS images required for an MGM configuration. (Kg)
� Incrementally resync A->C if Site2 or B-disk fails
� Requires A->C bandwidth
� GDPS/GM K-sys runs in a production system
� HyperSwap protection for GDPS/GM K-sys
� Reduced resource requirement
� Maintain disaster recovery position following resync
� Improved RPO
The Kg System lives in P2. P2 is a production system. It runs GDPS/PPRC in one
Netview. In another Netview it runs the GDPS/GM Ksys function. P2 disk is PPRCed
and protected by HyperSwap. This includes any disk that is related to the "Kg system
function".
P2 is a production system that can live in either Site 1 or Site 2. It has Kg system as
it's parasite. When you move P2, the Kg system function will be moved with it.
Page 19 9/11/2012
2 © Copyright IBM Corporation, 2012
3 site configurations provide additional options as well as considerations when
performing site switches.
1. If the two local sites are physically separated for both high availability and local D/R protection,
when a remote site switch occurs is it still requirement to have two local sites physically split at that
location as well? The alternatives would be to have two logical sites within the same physical site,
perhaps separated by a physical fire wall. In the site toggle model this consideration may be very
different than if the remote site is only used in the event of a disaster. In the disaster site scenario,
when a disaster occurs, high availability may be added to that site after the business is back up and
running again. The site toggle model views all sites as ‘production’ ready sites, where as the
disaster/recovery site model views the remote site as only being actually used in the event of a
disaster. Both models are valid, and really vary based on the client’s business requirements.
2. A complete understanding of the various fallback scenarios and additional copies of the disk
required to support each of these scenarios should be investigated and understood with both the
GDPS/MzGM and the GDPS/MGM options.
3. as mentioned above, end user Network connectivity to each data center can definitely influence
the costs associated with the ultimate solution.
A recognized customer requirement in this area is to provide the exact same
functionality at the target site (High Availability + Disaster Recovery protection) on
both a planned and when possible an unplanned site switch. That is, the ability to use
asynchronous data replication back to the original production site as well as providing
local HyperSwap functionality. With this functionality, both sites provide the business
equal functionality to the business and enables a peer site configuration.
Distributed Systems
As mentioned earlier in this paper, with the GDPS/DCM capability, GDPS automation
can inter-operate with either Tivoli AppMan or Veritas Cluster Server to provide end to
end automated management of various distributed platforms in 2-site or 3-site
configurations. Cross System data consistency can also be provided via the DS8000
open lun support. With this function, GDPS can provide a common restart point across
all z/OS and distributed systems data. Today, high availability of data is provided through
distributed systems software mirroring typically called LVM Mirrors. Data availability
for disaster recover can be provided through hardware and software based data
replication functions. Functionality in this arena will continue to evolve as clients
develop more and more cross platform applications.
Future Vision
The next chart outlines the evolution from a single server into an Enterprise Wide
Business Continuity Solution. Single Servers, became clustered servers, clustered
servers then spanned physical sites. This was then extended to end to end multi-site
heterogeneous clusters, followed by integrated end to end multi-site clusters. The
emerging trend for z/OS is next toward multiple application level Active/Active Sites at
distance coupled with the traditional platform based high availability and
disaster/recovery solutions.
Page 20 9/11/2012
2 © Copyright IBM Corporation, 2012
. Conclusion
The requirements for real time high availability, continuous operations and disaster
recovery for z/OS as well as distributed systems continue to push IBM to provide 24x7
computing environments with superior business resilience functionality.
New Smarter Planet applications typically deal with real time data that needs to be
captured, stored and analyzed in real time on a 24x7 basis. These applications and
volumes of data also introduce new requirements in scalability as well as challenges in
total cost of ownership. The management of IT Operations across a single site or multiple
sites locally or at distance, presents the opportunity to optimize all compute resources to
maximize their utilization, as well as enable them to meet the business requirements of
the end user clients today and tomorrow. Emerging trends to enable applications and their
platforms to be virtualized and run across physical data centers located around the world
is the ultimate goal. The z/OS platform, coupled with GDPS automation has become the
leading edge of general purpose solutions towards this end...
Page 21 9/11/2012
2 © Copyright IBM Corporation, 2012
Author Bob Kern - IBM Advanced Technical Support America’s ( [email protected]). Mr. Kern is an IBM Master Inventor & Executive IT Architect. He has 36 years experience in large system design and development and holds numerous patents in Storage related topics. For the last 28 years, Bob has specialized in disk device support and is a recognized expert in continuous availability, disaster recovery and real time disk mirroring. He created the DFSMS/MVS subcomponents for Asynchronous Operations Manager and the System Data Mover. Bob was named in 2003 a Master Inventor by the IBM Systems & Technology Group and is one of the inventors of Concurrent Copy, PPRC, XRC, GDPS and zCDP solutions. He continues to focus in the Disk Storage Architecture area on HW/SW solutions focused on Continuous Availability, and Data Replication. He is a member of the GDPS core architecture team and the GDPS Customer Design Council with focus on storage related topics.