clusters part 4 - systems lars lundberg the slides in this presentation cover part 4 (chapters...

Clusters Part 4 - SystemsLars Lundberg

The slides in this presentation cover Part 4 (Chapters 12-15) in Pfister’s book. We will, however, only present slides for chapter 12.

This part is the most important one in Pfister’s book!

High Availability What we today call high availability was

previously called fault tolerance. Traditionally there has been hardware fault

tolerant systems. This means that faults are entirely handled by the hardware, and the software does not have to care.

Cluster systems offer fault tolerance in software, i.e. they use standard hardware.

Classes of Availability

Availabilty Outage per Year Class(”# of 9s)

90% More than one month 199% Just under 4 days 299.9% Just under 9 hours 399.99% About an hour 499.999% A little over 5 minutes 599.9999% About half a minute 699.99999% About 3 seconds 7

Measuring Availability

The availability is usually measured as the percentage of the time that a system is available. Assuming that a system can be either fully available or not available at all.

Potential problems when measuring availability: What if the system is partly available Should we include periods when the system is not used Should we include planned outages for maintenance etc.

Planned outages can be a real problem in non-stop operation environments.

High Availability vs. Continuous operationIf we separate the planned outages (maintenance, upgrades etc.) from the unplanned ones (crashes, faults etc.), we can make the distinction between:

High availability (few and short unplanned outages)

Continuous operation (few and short planned and unplanned outages)

High availability and continuous operation are not always equally important.

Reasons for unplanned outages

Loss of power Application software Operating system software Subsystem software (e.g. databases) Hardware with moving parts (e.g. disks, fans, printers) I/O-adapters Memory Processors, caches etc.

Outage Duration Hardware does not “break” as often as software,

but when it does it takes longer to repair. Traditional hardware fault tolerance can recover

from a fault faster than software fault tolerant cluster systems.

Very few clusters can recover from a fault in less than 30 seconds. It often takes much longer.

Definition of High AvailabilityA system is highly available if: No replaceable piece is a single point of failure. The system is sufficiently reliable that you are

likely to be able to repair or replace any broken parts before anything else breaks.

Single point of failure is a single element of hardware or software which, if it fails, brings down the entire system.

Summary of High Availability For 24x365 operation (24 hour 365 days per year), you

must consider things like cooling, power supply, and also provide careful system management.

24x365 operation also implies dealing with planned outages and disasters, not just breakage and errors.

Disregarding power failure, software causes the largest number of outages

The longest unplanned outages are caused as much by hardware as software (again disregarding power failure)

Summary of High Availability cont. Avoid single point of failure Clusters can help with planned outages and

some unplanned errors in hardware and software.

Hardware based fault tolerance fails over instantaneous, but does not help with software errors and planned outages.

There is no industrial consensus on what “high availability” and “fault tolerance” means.

Failover

Client

I amOK

BozoAlice

Failure

Client

BozoAlice

One computer (Alice) is watching another computer (Bozo); ifBozo dies, Alice takes over Bozo’s work

Failover problems

If Alice tries to take of the control at the same time as Bozo comes back up again, we will have two computers struggling for the control at the same time. This can cause a lot of problems.

Avoiding planned outagesIf we want to upgrade Bozo we can do the following: Do a controlled (forced) fail over to Alice Upgrade Bozo while Alice is taking care of business Do a failback to Bozo Alice can now also be upgraded

Consequently, one of the advantages with clusters is that we do not have to take the system down during upgrades and maintenance.

Problems may, however, occur when: the upgrade includes change of data format on disk, or when when the software runs in parallel across the cluster nodes

Moving resources when failing over

When an application is moved from one node to another the resources that it needs must also be moved, e.g. files and IP-addresses.

Early high-availability cluster system left this problem to the user, i.e. the user had to write a number of shell scripts that were executed during a failover.

One way to help the user is to define the dependencies between different applications and resources. The user then only has to define where a certain application should go,and the cluster software will move the necessary resources along with the application.

Potential problems when moving resources

Resources may depend on individual cluster nodes, e.g. a certain disk may only be accessible on a certain node.

The procedure for bringing resources on-line may depend on the node, e.g. a printer queue may already be defined on some nodes, and redefining it may cause problems.

The information about the resource dependencies must be available and consistent throughout the cluster nodes, even when the node responsible for updating this information crashes.

Moving data - replication vs. switchover

Moving data from Bozo to Alice an be done in two ways: Replication (separate disks/shared nothing, see

Figure 108): Bozo and Alice have their own separate disks, and the

changes made on Bozo are continuously sent to Alice. As an alternative, the changes in Bozo could be sent in

batches at certain time intervals. Switchover (shared disk, see Figure 109):

A disk (or other storage device) is connected to both Bozo and Alice, and when Bozo crashes, Alice takes control over the disk.

Switchover is often preferred in high availability systems

Replication vs. switchoverReplication advantages: It is easier to add a new node when using replication. It can be difficult to synchronize the disks in switchover

configurations, e.g. the two systems must agree on disk partitions, volume names etc.

In switchover the disks are in one place. This limits the distance between the nodes and also be a problem with flooding of the room with the disk or other disasters.

Replication can use simpler storage units because: The disks do not need to support dual access The disks themselves are not a single point of failure

Replication vs. switchoverSwitchover advantages: Easier to backup the disk Less disk space is required Less overhead, i.e. when using replication the Bozo

must send copies of the change to Alice, and Alice must write these updates on the local disks. This uses CPU and I/O capacity.

If Bozo waits for Alice to signal that each update has been recorded correctly, the performance will be degraded. If Bozo does not wait, data may be lost when a failure occurs.

Failback is easier.

Avoiding corrupt data - transactions

When Bozo crashes, it might corrupt data or leave it in an inconsistent state.

Transactions are used for avoiding this problem Transactions are usually implemented by having a log

file on stable storage (e.g. mirrored disk) No matter what happens (assuming the stable storage

stays stable) a consistent state of the data can be recreated from the log file.

In replicated systems, transactions are implemented by a technique called two-phase commit.

Failing over communication When Alice takes over the job from Bozo, the communication from

the client is redirected using IP takeover IP takeover is obtained by resetting one (or more) of the

communication adapters on Alice to respond to the IP address(es) that Bozo was using.

Since most communication protocols have routines for retransmission after a time out limit, the client computes never know the difference. However, the people at the client computers probably have to log in again, i.e. their sessions are usually aborted at failover.

An alternative way of failing over communication is that each client have multiple IP addresses: the primary server, the secondary server and so on. If the primary server does not respond the client tries to contact the secondary server and so on.

Time for doing a failover

The time for reaching a fully operational state after a failover can be substantial. In best case scenarios the time can be as low as tens of seconds.

The failover times can be reduced by having pairs of processes: There is one process on Alice for each process on Bozo. Every time the process on Bozo changes its state that

change is reflected on the process on Alice. Tandem has claimed that by using this technique, sub-

second failover is achievable.

Failover to where? This question becomes interesting when there are

more than two nodes in the cluster Simple add-on high-availability systems often use

static schemes, e.g. if Bozo dies, put jobs A and B on Alice n the rest on Clara.

Sophisticated cluster systems provide mechanisms for automatic load balancing (possibly also considering some user defined priorities).

Dynamic load balancing is easier is shared-disk clusters than in shared nothing clusters. In hared nothing clusters replication is used and this makes the backup order more static.

Global locks In a shared-disk system, one must handle the

problem of system wide locks when a node crashes The processes on the node that crashed were

probably holding resources that processes on other nodes will have to use. If the locks are not released the entire system will lockup.

There are two ways of handling this problem: Letting the applications keep track of the locks that it

was holding Letting a global lock manager keep track of the locks

that the applications on the crashed node were holding.

Heartbeats Heartbeat messages are used for detecting when a node is dead. Each node sends short messages to the other nodes, telling

them that the node is alive If a heartbeat message does not arrive within a time-out period,

the node is declared dead. One problem with this approach is that the message could be

delayed for various reasons, and in that case a node which is declared dead may be OK. This can cause a lot of problems.

Another problem with this approach is that the node may be OK, but the communication link for the heartbeat is not OK. This could also lead to the dangerous conclusion that an Ok node is dead.

In order to improve the reliability of the heartbeat method the cluster might send heartbeat signals on a number of different

channels, e.g. normal LAN, RS232 serial links, I/O links etc.

Actions when Bozo is declared dead

Establish a new heartbeat chain that excludes Bozo Inform parallel subsystems that were running on Bozo;

such as databases, of what has occurred and is about to happen

Fence Bozo off from its resources (e.g. disks) Form a cluster-wide, consistent plan defining how

Bozo’s resources should be redistributed Execute the plan, i.e. move the resources etc. Inform the subsystem that the resource reallocation

has been completed Resume normal operation

Alternatives to heartbeats

Instead of heartbeats, one can use the opposite approach: a liveness check.

This means that Alice will at certain points ask Bozo if he is OK.

A liveness check suffers from the same kind of problems as heartbeats, i.e. it is hard to guarantee a response within certain limits.

If a cluster node has reasons to believe that the rest of the system thinks that the node is dead, the node had better commit suicide. This could happen when a node detects that its heartbeat signals have been delayed beyond the time-out limit.

IBM RS/6000 Cluster Technology (Phoenix) The purpose of Phoenix is to help the developer to build cluster-

parallel applications that are highly available, i.e. Phoenix is a development tool and does not do anything by itself.

The product is highly scalable; designed for 512 nodes; it has been run on clusters with more than 400 nodes.

There are tree core services in Phoenix (see Figure 111): Topology Services

This service has no direct interface to the application. It manages heartbeats and maintains a dynamic map of the state of the other cluster nodes.

Group ServicesThe key interface that helps the application to deal with high availability issues when some event happens.

Event ManagerThis service provides a way to inform a program running anywhere in the cluster when some thing interesting happens

Microsoft’s Clustering Services (MSCS)

MSCS is currently supporting only two-node clusters, later versions will however support a larger number of nodes.

MSCS is, unlike Phoenix, a self-contained high-availability cluster product

A key component is MSCS is the quorum resource, which is usually a disk. The purpose of the quorum resource is to make sure that only one of the two nodes thinks that it is in charge of the cluster.

Each node has access to a dynamic, but cluster-wide consistent, configuration database.

Scaling

The more there are in a cluster the less you pay for high availability, e.g.: The additional cost for handling a node failure in a one-

node system is 100%, i.e. we need two instead of one computers.

The additional cost of handling a node failure in a four-node system is 25%, i.e. we need five instead of four computers.

One implication of this that it is desirable to use computers that cannot individually fulfill the job requirements.

Disaster Recovery Disasters differ from ordinary failures in that they are

distributed over an area, e.g. flooding of a room, earthquakes etc.

Shared disk switchover solutions will not work for disasters.

Some crude and simple solutions are often used: Sending away a backup tape to a remote location at certain

intervals Sending away a backup electronically to a remote location

at certain intervals The key difference between disaster recovery and normal

clustering is the distance between the nodes. This causes delays which can strongly affect performance.

SMP and CC-NUMA Availability

If one processor node in an SMP or a CC-NUMA multiprocessor crashes, the entire system will crash.

There are a number of reasons for this, e.g.: The caches on the processor nodes may contain the only

valid copy of a certain variable. The data structures in the operating system is shared

between the processors, and if a processor crashes it may corrupt the shared data.

clusters part 4 - systems lars lundberg the slides in this presentation cover part 4 (chapters...

Documents

available n

n high availability

software n subsystem

databases n hardware

summary of high availability

n 24x365 operation

n cluster systems

largest number of outages