shawn bodily ats hacmp specialist - ibmpublic.dhe.ibm.com/systems/power/community/aix/central...ibm...
TRANSCRIPT
IBM System p5 and eServer p5
© 2006 IBM Corporation
Introduction toHigh Availability Cluster Multi-Processing
(HACMP)and
HACMP Extended Distance (HACMP-XD)Shawn Bodily
ATS HACMP Specialist
IBM
server
pSeries
IBM
IBM System p5 and eServer p5
© 2004 IBM Corporation
I
2 © 2005 IBM Corporation
Although hardware is now very reliable, hardware failures account for a small minority of system outages
Several studies place the proportion between 20% and 45%Human error, software error and planned maintenance cause the majority of service outages
IBM System p5 and eServer p5
© 2004 IBM Corporation
I
3 © 2005 IBM Corporation
Downtime and poor performance are expensive both financially and in terms of customer perceptions
“Overall downtime-costs average 3.6% of annual revenue.” –Infonetics
Many studies estimate average cost of downtime at over $5,000/hour
Popular Web sites estimate cost of downtime at millions of dollars
A 22-hour crash in June, 2003 cost eBay an estimated $5M
Losses go beyond immediate sales revenue
To clients, availability equates to reliability and trustworthiness
Internal application failures prevent employees from working
IBM System p5 and eServer p5
© 2004 IBM Corporation
I
4 © 2005 IBM Corporation
HACMP - Proven Technology for Business
Mature product now in its 17th major release
Averaging 40,000 licenses sold world-wide annually
Built on a decade of IBM cluster leadership
HACMP allows you to create highly available environments with minimal hardware.
HACMP is scalable up to 32-nodes, allowing your cluster to adapt to the growing demands of your business.
The optional XD feature allows your clusters to span unlimited geographic distances.
IBM System p5 and eServer p5
© 2004 IBM Corporation
I
5 © 2005 IBM Corporation
HACMP – Is NOT the right solution if:
Your environment is not secure
Network security is not in place
Change management procedures are not respected
You do not have trained administrator
Environment is prone to ‘user fiddle faddle’
Application requires manual intervention
HACMP will never be an out-of-the-boxsolution to availability. A certain degreeof skill will be always be required.
IBM System p5 and eServer p5
© 2004 IBM Corporation
I
6 © 2005 IBM Corporation
Reducing both Planned and Unplanned downtimeUnplanned Outage
System Failure– Hardware– Operating System Crash– Power Loss– User Error
Component Failure– NIC – SCSI/SAN Adapter– Network Hub/Switch– SAN Switch– Disk Failure (both O/S and application data)
Planned OutageMaintenance
– System Hardware Change/Upgrade– OS & Application Upgrades & Fixes
Testing– Applied Fixes– Failure scenarios for HA & DR
IBM System p5 and eServer p5
© 2004 IBM Corporation
I
7 © 2005 IBM Corporation
HACMP™ protects against service outages by detecting problems and quickly “failing over” to backup hardware
Two nodes (A and B)Two networks
Private (internal) network
Public (shared) networkShared disk
All data in shared storage available to both nodes
Critical applicationsDatabase server
Web server– Dependent on DB Shared DiskShared Disk
PrivatePrivateNetworkNetwork
!IBM
server
pSeries
AA
IBM
server
pSeries
BB
Company Shared NetworkCompany Shared Network
Web SrvDatabase
IBM System p5 and eServer p5
© 2004 IBM Corporation
I
8 © 2005 IBM Corporation
Example Failure #1: Node failure
Shared DiskShared Disk
PrivatePrivateNetworkNetwork
Node A fails completely
Node B detects the loss of Node A
Node B starts up its own instance of the Database.
Database is temporarily taken-over by Node B until Node A is brought back online
!IBM
server
pSeries
AA
IBM
server
pSeries
BB
Company Shared NetworkCompany Shared Network
Web SrvDatabase
IBM System p5 and eServer p5
© 2004 IBM Corporation
I
9 © 2005 IBM Corporation
Example Failure #2: Loss of network connection
Node A loses a NIC
Because of NIC redundancy, the service IP swaps locally
Operations continue normally while problem is resolved
If total public network connectivity was lost a fallover could occur
Shared DiskShared Disk
PrivatePrivateNetworkNetwork
!IBM
server
pSeries
AA
IBM
server
pSeries
BB
Company Shared NetworkCompany Shared Network
Web SrvDatabase
IBM System p5 and eServer p5
© 2004 IBM Corporation
I
10 © 2005 IBM Corporation
One to one
One to any
Any to anyAny to one
Failover possibilities
IBM System p5 and eServer p5
© 2004 IBM Corporation
I
11 © 2005 IBM Corporation
Custom Resource Groups Startup Preferences
Online On Home Node Only (cascading) - (OHNO)Online on First Available Node (rotating or cascading w/inactive takeover)
- (OFAN)Online On All Available Nodes (concurrent) - (OAAN)Startup Distribution
Fallover Preferences
Fallover To Next Priority Node In The List - (FOHP)Fallover Using Dynamic Node Priority - (FDNP)Bring Offline (On Error Node Only) - (BOEN)
Fallback Preferences
Fallback To Higher Priority Node - (FBHP)Never Fallback - (NFB)
IBM System p5 and eServer p5
© 2004 IBM Corporation
I
12 © 2005 IBM Corporation
Common Resources to make highly availableService IP Address(es)
The IP Addresses that users/client apps will use for productionThis can be one or multiple addressesNot limited to the number of interfaces when utilizing aliasing
Application (Server)
Application(s) desired to be controlled/protect by HACMPMany cases can be user provided start/stop scriptMay take advantage of pre-packaged application Smart Assists.
Shared Storage
Volume GroupsLogical VolumesJFSNFS
IBM System p5 and eServer p5
© 2004 IBM Corporation
I
13 © 2005 IBM Corporation
Additional Granular OptionsResource Group Dependencies
Parent/Child Relationships– Great for Multi-Tier environments
Location Dependencies– Online on Same Node
• All resource groups must be online on the same node– Online on Different Nodes
• All resource groups must be online on different nodes– Online on Same Site
• All resource groups must be online on the same site
Define Resource Group Priorities (Different Node Dep.)LowIntermediateHigh
IBM System p5 and eServer p5
© 2004 IBM Corporation
I
14 © 2005 IBM Corporation
Application MonitoringHACMP can monitor applications in one of two ways:
Process Monitor – determines the death of a process
Custom Monitor – monitors health of the application using a monitor method you provide
Decisions upon failureRestart – Can establish a number of restarts to restart locally. After a specified restart count, if app continues to fail you can escalate to a fallover.
– Notifiy – Send email notification– Fallover – Move application and associated resource group to next
candidate node.
Suspend/Resume Application Monitoring at anytime.
IBM System p5 and eServer p5
© 2004 IBM Corporation
I
15 © 2005 IBM Corporation
DLPAR/CUoD configuration
Active Processors Inactive ProcessorsW
eb S
erve
r
Ord
er E
ntry
HACMPHACMP
Production Database Server
DLPAR/CUoD Server (running applications on active processors)
Database Server
Shared Disk
HACMP on the primary machine detects the failureRunning in a partition on another server, HACMP grows the backuppartition, activates the required inactive processors and restarts application
HACMPHACMP
IBM System p5 and eServer p5
© 2004 IBM Corporation
I
16 © 2005 IBM Corporation
Recent HACMP releases greatly improve ease of useEnhancements include:
Configuration wizard for typical two-node cluster
Automatic detection and configuration of IP networks
“Online Planning Worksheet” guides you through configuration
Simplified Web-based interface for management and monitoring
Online Planning Worksheets For Resource Groups Shown Here
IBM System p5 and eServer p5
© 2004 IBM Corporation
I
17 © 2005 IBM Corporation
With HACMP V5.x, you can configure a cluster in just five questions
1. What is the address of the backup node?2. What is the name of the application?3. What script HACMP should use to start it?4. What script HACMP should use to stop it?5. What is the service IP label that clients will use to access
the application?
IBM System p5 and eServer p5
© 2004 IBM Corporation
I
18 © 2005 IBM Corporation
IBM System p5 and eServer p5
© 2004 IBM Corporation
I
19 © 2005 IBM Corporation
IBM System p5 and eServer p5
© 2004 IBM Corporation
I
20 © 2005 IBM Corporation
WebSMIT Overview Demo
IBM System p5 and eServer p5
© 2004 IBM Corporation
I
21 © 2005 IBM Corporation
HACMP Cluster Test Tool
The Cluster Test Tool reduces implementation costs by simplifying validation of cluster functionality.
It reduces support costs by automating testing of an HACMP cluster to ensure correct behavior in the event of a real cluster failure.
The Cluster Test Tool executes a test plan, which consists of a series of individual tests.
Tests are carried out in sequence and the results are analyzed by the test tool.
Administrators may define a custom test plan or use the automated test procedure.
Test results and other important data are collected in the test tool's log file.
IBM System p5 and eServer p5
© 2004 IBM Corporation
I
22 © 2005 IBM Corporation
New features make HACMP V5.X easier to use and more flexible
Automatic detection and correction of common cluster configuration problemsEnhanced support for complex multi-tier applications, relationships and dependenciesClusters can be configured with simple ASCII filesParallel resource processing recovers applications fasterSimpler, more flexible configuration and managementNew “Smart-Assists” simplify HACMP implementation in DB2®, Oracle and WebSphere® environments
Inexpensive option includes all three Smart-Assists
IBM System p5 and eServer p5
© 2004 IBM Corporation
I
23 © 2005 IBM Corporation
HACMP with Oracle 10g fallover Demo(1) p52A(1) p505(1) HMCHACMP 5.4AIX 5.3 TL5Oracle 10g DS4300LPARMon (http://www.alphaworks.ibm.com/tech/lparmon)Swingbench (http://www.dominicgiles.com/swingbench.html)Web-based System Manager
The cluster shown, was actually created using the two-node configuration assistant within HACMP.
IBM System p5 and eServer p5
© 2006 IBM Corporation
HACMP Extended Distance (HACMP-XD)
IBM
server
pSeries
IBM
IBM System p5 and eServer p5
© 2004 IBM Corporation
I
25 © 2005 IBM Corporation
HA/DR is a balance of recovery time requirements and cost
Do you really need HA or DR ?
What is the target recovery time ?Minutes ? Hours ? Days ?
Costs associated with implementing and maintaining an HA or DR solutionRedundant hardware
Inter site networking
Operations staff
IBM System p5 and eServer p5
© 2004 IBM Corporation
I
26 © 2005 IBM Corporation
Tiers of Disaster Recovery: Level Setting HACMP/XD
Recovery Time Tiers based on SHARE definitions
15 Min. 1-4 Hr.. 4 -8 Hr.. 8-12 Hr.. 12-16 Hr.. 24 Hr.. Days
Tier 4 - Batch/Online database shadowing & journaling, Point in Time disk copy (FlashCopy), TSM-DRM
Tier 3 - Electronic Vaulting, TSM**, Tape
Tier 2 - PTAM, Hot Site,TSM**
Value
*PTAM = Pickup Truck Access Method with Tape**TSM = Tivoli Storage Manager*** = Geographically Dispersed Parallel Sysplex
Tier 7 - Highly automated, business wide, integrated solution (Example: GDPS/PPRC/VTS P2P, AIX HACMP/XD , OS/400 HABP....
Tier 6 - Storage mirroring (example: XRC, PPRC, VTS Peer to Peer)
Tier 5 - Software two site, two phase commit (transaction integrity)
Applications with Low tolerance to
outage
Applications Somewhat Tolerant
to outage
Applications very tolerant to outage*Tier 1 - PTAM
Zero or near zero data recreationZero or near zero data
minutes to hoursminutes to hoursdata recreationdata recreation
up to 24 hoursup to 24 hoursdata recreationdata recreation
24-48 hours24-48 hoursdata recreationdata recreation
Best D/R practice is to blend tiers of solutions in order to maximize application coverage at lowest possible cost . One size, one technology, or one
methodology doesn't fit all applications.
HACMP /XDfits in here
IBM System p5 and eServer p5
© 2004 IBM Corporation
I
27 © 2005 IBM Corporation
IBM System p5 and eServer p5
© 2004 IBM Corporation
I
28 © 2005 IBM Corporation
HACMP Extended Distance (XD) is an optional component for cross-site geographic disaster recovery
Backup systems may be physically separate from primary operations for protection in the event of power failure, flood, earthquake etc.
The XD option provides a basket of disaster recovery capabilities and integration points
XD provides multiple options:IP-based data mirroring (GLVM, HAGEO)
Support for hardware-based data mirroring (Metro-Mirror/PPRC)
IBM System p5 and eServer p5
© 2004 IBM Corporation
I
29 © 2005 IBM Corporation
HACMP XD – Extended Distance for Disaster RecoveryData replication between sites ensures a copy of the data is available after a site wide disaster
Choice of Technology depends on distance, performance requirements
Campus-wide – use LVM Split Site Mirroring
SAN
LAN / MAN
IBM System p5 and eServer p5
© 2004 IBM Corporation
I
30 © 2005 IBM Corporation
HACMP XD – Extended Distance for Disaster Recovery
Metro wide – use SVC or ESS/PPRC Mirroring
ServerA
ServerB
ServerC
ServerD
Router Router
PPRC/Metro Mirror
oreRCMF
PrimaryESS/DS
SecondaryESS/DS
ProductionSite
RecoverySite
SVC Mirroring
SVC SVC
IBM System p5 and eServer p5
© 2004 IBM Corporation
I
31 © 2005 IBM Corporation
HACMP XD – Extended Distance for Disaster RecoveryUnlimited – use GLVM Mirroring
Subset of disks are defined as “Remote Physical Volumes” or RPVs
copy 1 Mirror 2 copy 2copy 1 Mirror 2 copy 2
copy 1 Mirror 1 copy 2copy 1 Mirror 1 copy 2
RPV DriverReplicates data over
WAN
LVM Mirrored Volume Group
Both sites always have a complete copy of all mirrors
IBM System p5 and eServer p5
© 2004 IBM Corporation
I
32 © 2005 IBM Corporation
New HACMP “Geographic Logical Volume Manager” is a reliable, easy-to-use data mirror and failover capability
GLVM provides unlimited-distance IP-based data mirroring Fully integrated with AIX 5L™ logical volume management
Easier to use than existing HAGEO solutionNo need to define and manage separate state maps
Long-term replacement for HAGEO
Automatically reverses direction of data replication on failover
Supports all IBM TotalStorage® products certified with base HACMP
IBM System p5 and eServer p5
© 2004 IBM Corporation
I
33 © 2005 IBM Corporation
HACMP XD – HACMP automates the solutionHACMP integrates support for all the replication options
Manages data replication direction, switching and resyncafter recovery
Recovers locally or moves entire application to backup site
Common infrastructure supports all solutions Choose the one that meets your performance and distance
requirements
IBM System p5 and eServer p5
© 2004 IBM Corporation
I
34 © 2005 IBM Corporation
Thank You
Questions?????
IBM System p5 and eServer p5
© 2004 IBM Corporation
I
35 © 2005 IBM Corporation
Backup Slides on Networking
IBM System p5 and eServer p5
© 2004 IBM Corporation
I
36 © 2005 IBM Corporation
Typical Local HACMP Clustering Configuration
A single network view on a common subnet. Multiple networks can be used.
switch
switch
en0
en1
en0
en1
10.70.10.x
IBM System p5 and eServer p5
© 2004 IBM Corporation
I
37 © 2005 IBM Corporation
HACMP Clustering Across Sites
Different subnets, routers connected to allow cross subnet communications
switch
switch
en0
en1
en0
en1
10.70.10.x
switch
switch
10.50.10.x
Router Router