shawn bodily ats hacmp specialist - ibmpublic.dhe.ibm.com/systems/power/community/aix/central...ibm...

IBM System p5 and eServer p5

© 2006 IBM Corporation

Introduction toHigh Availability Cluster Multi-Processing

(HACMP)and

HACMP Extended Distance (HACMP-XD)Shawn Bodily

ATS HACMP Specialist

IBM

server

pSeries

IBM



I

2 © 2005 IBM Corporation

Although hardware is now very reliable, hardware failures account for a small minority of system outages

Several studies place the proportion between 20% and 45%Human error, software error and planned maintenance cause the majority of service outages



I


Downtime and poor performance are expensive both financially and in terms of customer perceptions

“Overall downtime-costs average 3.6% of annual revenue.” –Infonetics

Many studies estimate average cost of downtime at over $5,000/hour

Popular Web sites estimate cost of downtime at millions of dollars

A 22-hour crash in June, 2003 cost eBay an estimated $5M

Losses go beyond immediate sales revenue

To clients, availability equates to reliability and trustworthiness

Internal application failures prevent employees from working



I


HACMP - Proven Technology for Business

Mature product now in its 17th major release

Averaging 40,000 licenses sold world-wide annually

Built on a decade of IBM cluster leadership

HACMP allows you to create highly available environments with minimal hardware.

HACMP is scalable up to 32-nodes, allowing your cluster to adapt to the growing demands of your business.

The optional XD feature allows your clusters to span unlimited geographic distances.



I


HACMP – Is NOT the right solution if:

Your environment is not secure

Network security is not in place

Change management procedures are not respected

You do not have trained administrator

Environment is prone to ‘user fiddle faddle’

Application requires manual intervention

HACMP will never be an out-of-the-boxsolution to availability. A certain degreeof skill will be always be required.



I


Reducing both Planned and Unplanned downtimeUnplanned Outage

System Failure– Hardware– Operating System Crash– Power Loss– User Error

Component Failure– NIC – SCSI/SAN Adapter– Network Hub/Switch– SAN Switch– Disk Failure (both O/S and application data)

Planned OutageMaintenance

– System Hardware Change/Upgrade– OS & Application Upgrades & Fixes

Testing– Applied Fixes– Failure scenarios for HA & DR



I


HACMP™ protects against service outages by detecting problems and quickly “failing over” to backup hardware

Two nodes (A and B)Two networks

Private (internal) network

Public (shared) networkShared disk

All data in shared storage available to both nodes

Critical applicationsDatabase server

Web server– Dependent on DB Shared DiskShared Disk

PrivatePrivateNetworkNetwork

!IBM

server

pSeries

AA

IBM

server

pSeries

BB

Company Shared NetworkCompany Shared Network

Web SrvDatabase



I


Example Failure #1: Node failure

Shared DiskShared Disk


Node A fails completely

Node B detects the loss of Node A

Node B starts up its own instance of the Database.

Database is temporarily taken-over by Node B until Node A is brought back online

!IBM

server

pSeries

AA

IBM

server

pSeries

BB


Web SrvDatabase



I


Example Failure #2: Loss of network connection

Node A loses a NIC

Because of NIC redundancy, the service IP swaps locally

Operations continue normally while problem is resolved

If total public network connectivity was lost a fallover could occur

Shared DiskShared Disk


!IBM

server

pSeries

AA

IBM

server

pSeries

BB


Web SrvDatabase



I


One to one

One to any

Any to anyAny to one

Failover possibilities



I


Custom Resource Groups Startup Preferences

Online On Home Node Only (cascading) - (OHNO)Online on First Available Node (rotating or cascading w/inactive takeover)

- (OFAN)Online On All Available Nodes (concurrent) - (OAAN)Startup Distribution

Fallover Preferences

Fallover To Next Priority Node In The List - (FOHP)Fallover Using Dynamic Node Priority - (FDNP)Bring Offline (On Error Node Only) - (BOEN)

Fallback Preferences

Fallback To Higher Priority Node - (FBHP)Never Fallback - (NFB)



I


Common Resources to make highly availableService IP Address(es)

The IP Addresses that users/client apps will use for productionThis can be one or multiple addressesNot limited to the number of interfaces when utilizing aliasing

Application (Server)

Application(s) desired to be controlled/protect by HACMPMany cases can be user provided start/stop scriptMay take advantage of pre-packaged application Smart Assists.

Shared Storage

Volume GroupsLogical VolumesJFSNFS



I


Additional Granular OptionsResource Group Dependencies

Parent/Child Relationships– Great for Multi-Tier environments

Location Dependencies– Online on Same Node

• All resource groups must be online on the same node– Online on Different Nodes

• All resource groups must be online on different nodes– Online on Same Site

• All resource groups must be online on the same site

Define Resource Group Priorities (Different Node Dep.)LowIntermediateHigh



I


Application MonitoringHACMP can monitor applications in one of two ways:

Process Monitor – determines the death of a process

Custom Monitor – monitors health of the application using a monitor method you provide

Decisions upon failureRestart – Can establish a number of restarts to restart locally. After a specified restart count, if app continues to fail you can escalate to a fallover.

– Notifiy – Send email notification– Fallover – Move application and associated resource group to next

candidate node.

Suspend/Resume Application Monitoring at anytime.



I


DLPAR/CUoD configuration

Active Processors Inactive ProcessorsW

eb S

erve

r

Ord

er E

ntry

HACMPHACMP

Production Database Server

DLPAR/CUoD Server (running applications on active processors)

Database Server

Shared Disk

HACMP on the primary machine detects the failureRunning in a partition on another server, HACMP grows the backuppartition, activates the required inactive processors and restarts application

HACMPHACMP



I


Recent HACMP releases greatly improve ease of useEnhancements include:

Configuration wizard for typical two-node cluster

Automatic detection and configuration of IP networks

“Online Planning Worksheet” guides you through configuration

Simplified Web-based interface for management and monitoring

Online Planning Worksheets For Resource Groups Shown Here



I


With HACMP V5.x, you can configure a cluster in just five questions

1. What is the address of the backup node?2. What is the name of the application?3. What script HACMP should use to start it?4. What script HACMP should use to stop it?5. What is the service IP label that clients will use to access

the application?



I




I


WebSMIT Overview Demo



I


HACMP Cluster Test Tool

The Cluster Test Tool reduces implementation costs by simplifying validation of cluster functionality.

It reduces support costs by automating testing of an HACMP cluster to ensure correct behavior in the event of a real cluster failure.

The Cluster Test Tool executes a test plan, which consists of a series of individual tests.

Tests are carried out in sequence and the results are analyzed by the test tool.

Administrators may define a custom test plan or use the automated test procedure.

Test results and other important data are collected in the test tool's log file.



I


New features make HACMP V5.X easier to use and more flexible

Automatic detection and correction of common cluster configuration problemsEnhanced support for complex multi-tier applications, relationships and dependenciesClusters can be configured with simple ASCII filesParallel resource processing recovers applications fasterSimpler, more flexible configuration and managementNew “Smart-Assists” simplify HACMP implementation in DB2®, Oracle and WebSphere® environments

Inexpensive option includes all three Smart-Assists



I


HACMP with Oracle 10g fallover Demo(1) p52A(1) p505(1) HMCHACMP 5.4AIX 5.3 TL5Oracle 10g DS4300LPARMon (http://www.alphaworks.ibm.com/tech/lparmon)Swingbench (http://www.dominicgiles.com/swingbench.html)Web-based System Manager

The cluster shown, was actually created using the two-node configuration assistant within HACMP.



HACMP Extended Distance (HACMP-XD)

IBM

server

pSeries

IBM



I


HA/DR is a balance of recovery time requirements and cost

Do you really need HA or DR ?

What is the target recovery time ?Minutes ? Hours ? Days ?

Costs associated with implementing and maintaining an HA or DR solutionRedundant hardware

Inter site networking

Operations staff



I


Tiers of Disaster Recovery: Level Setting HACMP/XD

Recovery Time Tiers based on SHARE definitions

15 Min. 1-4 Hr.. 4 -8 Hr.. 8-12 Hr.. 12-16 Hr.. 24 Hr.. Days

Tier 4 - Batch/Online database shadowing & journaling, Point in Time disk copy (FlashCopy), TSM-DRM

Tier 3 - Electronic Vaulting, TSM**, Tape

Tier 2 - PTAM, Hot Site,TSM**

Value

*PTAM = Pickup Truck Access Method with Tape**TSM = Tivoli Storage Manager*** = Geographically Dispersed Parallel Sysplex

Tier 7 - Highly automated, business wide, integrated solution (Example: GDPS/PPRC/VTS P2P, AIX HACMP/XD , OS/400 HABP....

Tier 6 - Storage mirroring (example: XRC, PPRC, VTS Peer to Peer)

Tier 5 - Software two site, two phase commit (transaction integrity)

Applications with Low tolerance to

outage

Applications Somewhat Tolerant

to outage

Applications very tolerant to outage*Tier 1 - PTAM

Zero or near zero data recreationZero or near zero data

minutes to hoursminutes to hoursdata recreationdata recreation

up to 24 hoursup to 24 hoursdata recreationdata recreation

24-48 hours24-48 hoursdata recreationdata recreation

Best D/R practice is to blend tiers of solutions in order to maximize application coverage at lowest possible cost . One size, one technology, or one

methodology doesn't fit all applications.

HACMP /XDfits in here



I




I


HACMP Extended Distance (XD) is an optional component for cross-site geographic disaster recovery

Backup systems may be physically separate from primary operations for protection in the event of power failure, flood, earthquake etc.

The XD option provides a basket of disaster recovery capabilities and integration points

XD provides multiple options:IP-based data mirroring (GLVM, HAGEO)

Support for hardware-based data mirroring (Metro-Mirror/PPRC)



I


HACMP XD – Extended Distance for Disaster RecoveryData replication between sites ensures a copy of the data is available after a site wide disaster

Choice of Technology depends on distance, performance requirements

Campus-wide – use LVM Split Site Mirroring

SAN

LAN / MAN



I


HACMP XD – Extended Distance for Disaster Recovery

Metro wide – use SVC or ESS/PPRC Mirroring

ServerA

ServerB

ServerC

ServerD

Router Router

PPRC/Metro Mirror

oreRCMF

PrimaryESS/DS

SecondaryESS/DS

ProductionSite

RecoverySite

SVC Mirroring

SVC SVC



I


HACMP XD – Extended Distance for Disaster RecoveryUnlimited – use GLVM Mirroring

Subset of disks are defined as “Remote Physical Volumes” or RPVs

copy 1 Mirror 2 copy 2copy 1 Mirror 2 copy 2

copy 1 Mirror 1 copy 2copy 1 Mirror 1 copy 2

RPV DriverReplicates data over

WAN

LVM Mirrored Volume Group

Both sites always have a complete copy of all mirrors



I


New HACMP “Geographic Logical Volume Manager” is a reliable, easy-to-use data mirror and failover capability

GLVM provides unlimited-distance IP-based data mirroring Fully integrated with AIX 5L™ logical volume management

Easier to use than existing HAGEO solutionNo need to define and manage separate state maps

Long-term replacement for HAGEO

Automatically reverses direction of data replication on failover

Supports all IBM TotalStorage® products certified with base HACMP



I


HACMP XD – HACMP automates the solutionHACMP integrates support for all the replication options

Manages data replication direction, switching and resyncafter recovery

Recovers locally or moves entire application to backup site

Common infrastructure supports all solutions Choose the one that meets your performance and distance

requirements



I


Thank You

Questions?????



I


Backup Slides on Networking



I


Typical Local HACMP Clustering Configuration

A single network view on a common subnet. Multiple networks can be used.

switch

switch

en0

en1

en0

en1

10.70.10.x



I


HACMP Clustering Across Sites

Different subnets, routers connected to allow cross subnet communications

switch

switch

en0

en1

en0

en1

10.70.10.x

switch

switch

10.50.10.x

Router Router

shawn bodily ats hacmp specialist - ibmpublic.dhe.ibm.com/systems/power/community/aix/central...ibm...

Documents