windows nt clustering service

8
0018-9162/98/$10.00 © 1998 IEEE October 1998 55 Windows NT Clustering Service A cluster is a collection of computer nodes— independent, self-contained computer sys- tems—that work together to provide a more reliable and powerful system than a single node. 1 To be effective, the cluster must be easy to program and no more complex to manage than a single large computer. Clusters have three advantages over single nodes, They can grow larger than the largest single node, they can tolerate node failures and still continue offering service, and they can be built from inexpensive smaller computers. In general, the goal of a cluster is to distribute a com- puting load over several systems, without users or sys- tem administrators being aware of the independent systems running the services. If any hardware or soft- ware component fails, users might lose access briefly or may experience degraded performance, but they will not lose access to the service. The Microsoft Cluster Service 2 for Windows NT responds to market needs by addressing high-avail- ability file servers, databases, Web servers, electronic mail servers, and generic applications and services. For many organizations, these servers are essential to daily business operations. The NT clustering service detects and restarts failed hardware or software components or migrates the failed component’s functionality to another node if local restart is not possible. This recovery strategy improves the mean- time-to-repair for the system and thereby improves overall availability. The cluster service also offers a much simpler user and programming interface. System administrators ini- tially define requirements for operating behavior, and the clustering-service software automatically manages policy decisions regarding a cluster’s state as well as applications running on a given node. NT CLUSTER ABSTRACTIONS The cluster service uses several abstractions— including resource, resource dependencies, and resource groups—to simplify both the cluster service itself and user-visible system administration. Resource A resource is the basic unit of management within the cluster: It allows the cluster software to treat objects managed by the cluster identically, even though they may provide totally different functionality. NT implements several resource types for its clustering ser- vice: physical hardware, such as SCSI disks; logical items such as IP addresses, disk volumes, and file server shares; and resource types to allow existing applica- tions to take advantage of the cluster service. Each resource provides an identical set of interfaces to the cluster software, allowing the resource to be monitored and managed. Resources can inform the cluster service of a failure in the function it represents. In that case the cluster services can restart the resource or any other resource that depends on that resource. Resources may, while under the control of the cluster service, migrate to other nodes in the cluster. Application developers can design their own resource types and control libraries. To do this, the developer provides a resource DLL to extend the capa- bilities of the cluster software to manage a specific resource. Resource dependencies Resources often depend on the availability of other resources. An instance of an SQL database depends on the presence of physical disks that store the data- base. These dependencies are declared and recorded in a dependency tree that is maintained by the cluster. The dependency tree describes the sequence in which the resources need to be brought online and taken offline, and which resources need to migrate together. Dependencies cannot cross resource group boundaries. Resource groups A group is a collection of resources that must be managed as a single unit. Usually, a group contains all the elements necessary to either run an application or connect client systems to an application’s services. Groups allow a systems administrator to combine resources into larger logical units and manage them as The Windows NT Clustering Service supports high-availability file servers, databases, and generic applications and services that are essential to daily business operations. Rod Gamache Rob Short Mike Massa Microsoft Corporation Cover Feature .

Upload: m

Post on 22-Sep-2016

218 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Windows NT Clustering Service

0018-9162/98/$10.00 © 1998 IEEE October 1998 55

Windows NTClustering Service

Acluster is a collection of computer nodes—independent, self-contained computer sys-tems—that work together to provide a morereliable and powerful system than a singlenode.1 To be effective, the cluster must be

easy to program and no more complex to manage thana single large computer. Clusters have three advantagesover single nodes, They can grow larger than thelargest single node, they can tolerate node failures andstill continue offering service, and they can be builtfrom inexpensive smaller computers.

In general, the goal of a cluster is to distribute a com-puting load over several systems, without users or sys-tem administrators being aware of the independentsystems running the services. If any hardware or soft-ware component fails, users might lose access brieflyor may experience degraded performance, but theywill not lose access to the service.

The Microsoft Cluster Service2 for Windows NTresponds to market needs by addressing high-avail-ability file servers, databases, Web servers, electronicmail servers, and generic applications and services. Formany organizations, these servers are essential to dailybusiness operations.

The NT clustering service detects and restarts failedhardware or software components or migrates thefailed component’s functionality to another node iflocal restart is not possible. This recovery strategyimproves the mean- time-to-repair for the system andthereby improves overall availability.

The cluster service also offers a much simpler userand programming interface. System administrators ini-tially define requirements for operating behavior, andthe clustering-service software automatically managespolicy decisions regarding a cluster’s state as well asapplications running on a given node.

NT CLUSTER ABSTRACTIONSThe cluster service uses several abstractions—

including resource, resource dependencies, andresource groups—to simplify both the cluster serviceitself and user-visible system administration.

ResourceA resource is the basic unit of management within

the cluster: It allows the cluster software to treatobjects managed by the cluster identically, even thoughthey may provide totally different functionality. NTimplements several resource types for its clustering ser-vice: physical hardware, such as SCSI disks; logicalitems such as IP addresses, disk volumes, and file servershares; and resource types to allow existing applica-tions to take advantage of the cluster service.

Each resource provides an identical set of interfacesto the cluster software, allowing the resource to bemonitored and managed. Resources can inform thecluster service of a failure in the function it represents.In that case the cluster services can restart the resourceor any other resource that depends on that resource.Resources may, while under the control of the clusterservice, migrate to other nodes in the cluster.

Application developers can design their ownresource types and control libraries. To do this, thedeveloper provides a resource DLL to extend the capa-bilities of the cluster software to manage a specificresource.

Resource dependenciesResources often depend on the availability of other

resources. An instance of an SQL database dependson the presence of physical disks that store the data-base. These dependencies are declared and recordedin a dependency tree that is maintained by the cluster.The dependency tree describes the sequence in whichthe resources need to be brought online and takenoffline, and which resources need to migrate together.Dependencies cannot cross resource group boundaries.

Resource groupsA group is a collection of resources that must be

managed as a single unit. Usually, a group contains allthe elements necessary to either run an application orconnect client systems to an application’s services.Groups allow a systems administrator to combineresources into larger logical units and manage them as

The Windows NT Clustering Service supports high-availability file servers,databases, and generic applications and services that are essential todaily business operations.

RodGamacheRob ShortMike MassaMicrosoftCorporation

Cove

r Fea

ture

.

Page 2: Windows NT Clustering Service

56 Computer

a single entity. Operations performed on a group affectall resources contained within that group.

The group is the unit of failover management,which means that if any resource in the group fails,then the entire group is considered to have failed. Forthat group to be failed over to another system, the newsystem must be capable of supporting all the resourcesin the group.

The administrator of the cluster can assign a col-lection of independent resource dependency trees toa single resource group. When the group needs tomigrate to another node in the cluster, all the resourcesin the group will move to the new location. Theadministrator sets failover policies on a group basis,including the list of preferred owner nodes, the fail-back window (the time delay between a failed noderebooting and the cluster service moving applicationsback onto it), and so forth.

Figure 1 shows a view of a cluster from the clusteradministration tool. The cluster itself is calledRODGATEST and it contains two computer systems,RODGA7 and RODGA8. Six resources are definedalong with four resource groups. Each of these con-structs is described in the following sections.

CLUSTER ARCHITECTUREThe key features of cluster operation are resource

management, which controls the local state ofresources, and failure management, which orches-trates the response to failure conditions.

Windows NT maintains configuration informationin a local database, called the registry. In clustered sys-tems, a cluster-wide registry allows the cluster serviceand applications to create and maintain a globally con-sistent view of the state of its resources as well as thecurrent cluster configuration. Current cluster mem-bership is recorded in the registry. Updates to the clus-ter registry are done using an atomic update protocol.

In a clustered system it is critical that all nodes main-tain a consistent view of cluster membership. Mem-bership agreement is achieved through a protocolbased on the Tandem Regroup Algorithm describedin more detail later in this article.3

The cluster service relies heavily on the Windowscore infrastructure, including RPC (remote procedurecall) mechanisms, name management, network inter-face management, security, resource controls, and filesystem. The cluster extends these base services to allnodes in the cluster with the levels of transparency andreliability described next.

TransparencyThe cluster—even though it is built of a group of

loosely coupled, independent computer systems—pre-sents itself as a single system to clients outside it.Therefore, client applications interact with the clus-ter as if it were a single high-performance, highly reli-able server. Clients are not affected by interaction withthe cluster and do not need modifications to takeadvantage of the cluster service. However, certainclients may need to use transactional semantics whencommunicating with the cluster to guarantee that nodata is lost in the event of a failure. The clients shouldkeep all information relating to the operation untilthey know that the server has safely stored the resultsof the operation onto disk.

Similarly, system management tools access andmanage the cluster as if it were a single server. Serviceand system execution information is available in sin-gle-image, cluster-wide logs.

ReliabilityThe cluster service can detect failures of the hard-

ware and software resources it manages. In the case ofan application or node failure, the cluster service canrestart failed applications on other nodes in the cluster.The administrator can set the restart policy to meet theavailability specification for that application, which ispart of the cluster configuration. A failure can alsocause the control over resources (shared disks, networknames, and so on) to migrate to other nodes in the sys-tem. Finally, administrators can upgrade hardware andsoftware in a phased manner without interrupting theavailability of the services in the cluster.

We explicitly did not address several issues as partof the first phase of the design because of either tech-nical complexity or schedule pressures. We providedno support for fault-tolerant applications (processpair, primary-backup, active replication), no facilitiesfor the nonstop migration of running applications,and no support for the recovery of the shared statebetween client and server. These and other “tradi-tional” cluster features will be added in phases tofuture versions of the product.

VIRTUAL SERVERSClients in client-server applications generally con-

nect to a physical server to access their service. InMicrosoft Cluster Service (MSCS), we created an

Figure 1. A view of acluster from the clus-ter administrationtool. The cluster con-tains two computersystems and definessix resources and fourresource groups.

.

Page 3: Windows NT Clustering Service

abstraction—a virtual Windows NT server—thatencapsulates the service and all the resources requiredto provide that service. The virtual server includes anetwork name and IP address that are visible andaccessible from clients, no matter where in the clusterthey are instantiated. Several of these virtual serverscan run on a single node, much as Web servers nowhost multiple home pages at different IP addresses.

The application itself also runs within a virtualserver environment. This environment provides theapplication with the illusion that it is running on a vir-tual NT node with virtual services that appear to beconsistent with the application, even if the applicationis restarted on a different node in the cluster. So thevirtual server environment provides the application,the administrators, and the clients with the illusion ofa single, stable environment.

A virtual server resource group requires a nodename resource (offered by NetBios and Domain NameSystem) and an IP address resource. Together, thesepresent consistent naming for the clients to access theservice. Also, the same resource group needs to includeall the resources required to execute the server appli-cation (disks, databases, and so on).

Server encapsulationEach virtual server environment provides a name

and configuration space that is separated from othervirtual servers running at the same node. Each appli-cation can do registry access, service control, and IPC(interprocess communication) and RPC and has a con-sistent view even if it runs on another node in the clus-ter. This separation ensures that two instances of anapplication service running on the same node but inseparate virtual server environments will not clash as they access configuration data or communicateinternally.

Two significant changes were made to the baseWindows NT code to create this virtual server envi-ronment.

• We changed the system calls that return or usethe system name, such as GetComputerName orgethostname, to return the network name asso-ciated with the virtual server instead of the nameof the node on which the application is running.

• We modified the NT registry, which stores theapplication configurations in a single monolithichierarchical structure, into separate trees thatstore parts of the registry. Separate trees ensurethat applications running in separate virtualservers on the same node can access unique con-figuration data and store their current state in theregistry. Each unique tree represents a single vir-tual server and is internally dependent on thename associated with the virtual server.

APPLICATION AND SERVICE AVAILABILITYApplications generally need not be modified

to operate within the clustering service envi-ronment. There are two modes that a givenapplication can choose to become part of thecluster environment. In the simplest case, anyWindows NT application or service can regis-ter as a generic application or service and usethe NT infrastructure supplied. Alternatively,the application can provide a customizedresource DLL (dynamically linked library) tai-lored to work specifically with the given appli-cation or service.

Generic application or serviceAs the name suggests, a generic application or ser-

vice is one that has not been modified in any way totake advantage of MSCS’s features. After having beenregistered, the application is simply represented by aresource and managed by the cluster service.

However, generic resources have very primitive fail-ure detection and are therefore less reliable than a cus-tom application or service resource DLL. Forexample, the generic application failure detector sim-ply checks to see if the process has exited to determineif the application is still running. The generic servicesimilarly queries the Windows NT Service ControlManager to see if the service is still running. A hungapplication or service will not always be detected. Forthis reason, developers can provide a custom resourceDLL for more accurate health detection.

Custom application or serviceA custom application or service does not require

that the actual application or service be modified.Instead, customization entails the development of acluster interface layer that allows the cluster softwareto correctly manage and respond to failures detectedby the custom resource DLL. The principal interfacesallow the resource to be started and stopped as wellas monitored to see if it is still operating correctly. Theadvantage of developing a custom resource type is thatthe cluster software correctly controls the applicationor service, and a routine can be written to verify thatthe application is working properly. For example, adatabase service failure detector could send databasequeries into the service to detect if the service is oper-ational. This level of sophisticated detection cannotbe done with a generic service, since it knows noth-ing about the specific service.

NETWORK INFRASTRUCTUREModern applications such as client-server, distrib-

uted, and management applications all rely on net-works in one form or another. The capability to detectand survive network failures is essential for mission-

October 1998 57

Applications generally need not

be modified to operate within the Microsoft

Cluster Service environment.

.

Page 4: Windows NT Clustering Service

58 Computer

critical, line-of-business applications and sys-tems.

The Windows NT Clustering Serviceimproves the availability of Windows NT net-works by

• continuously monitoring the state andquality of all networks used by the cluster-ing service;

• transparently and reliably switching thetraffic among networks if multiple net-works are available; and

• automatically failing over networkresources among cluster nodes.

To explain the operation of the networking software,we first introduce a high-level overview of the opera-tion of the cluster service as a whole.

Clustering service network componentsFigure 2 shows an overview of the components and

their relationships in a single system of a WindowsNT cluster. The clustering service controls all aspectsof cluster operation on a cluster system. It is imple-mented as a Windows NT service and consists of eightclosely related, cooperating components:

• Node manager handles cluster membership andwatches the health of other cluster systems.

• Configuration database manager maintains thecluster configuration database.

• Resource manager/failover manager makes allresource and group management decisions andinitiates appropriate actions, such as startup,restart, and failover.

• Event processor connects all components of thecluster service, handles common operations, andcontrols cluster service initialization.

• Communication manager manages communica-tions with all other nodes of the cluster.

• Global update manager provides a global updateservice that is used by other components withinthe cluster service.

• Resource monitor monitors the health of eachresource via callbacks to the resources. Strictlyspeaking, it is not part of the cluster service. Itruns in a separate process, or processes, and com-municates with the cluster service via RPC.

• Time service maintains consistent time within thecluster but is implemented as a resource ratherthan as part of the cluster service itself.

The node manager, resource manager (with itsresource monitors), and communication manager areparticularly interesting in the context of this article.

Node manager. Node manager maintains cluster

membership and sends periodic messages, called heart-beats, to the other systems in the cluster to detect sys-tem failures. If one system detects a communicationfailure with another cluster node, it broadcasts a mes-sage to the entire cluster, causing all members to ver-ify their view of the current cluster membership. Thisis called a regroup event. If this happens, the writes topotentially shared devices are frozen until the mem-bership has stabilized. If a node manager on a systemdoes not respond during the regroup, it is removedfrom the cluster and its active groups must be failedover (“pulled”) to an active system. Note that the fail-ure of a clustering service also causes all of its locallymanaged resources to fail.

Resource manager. Resource manager/failovermanager is responsible for stopping and startingresources, managing resource dependencies, and ini-tiating the failover of groups. It receives resource andsystem state information from the resource monitorand the node manager. It uses this information tomake decisions about groups. Resource monitormonitors the health of a resource by periodically call-ing the IsAlive and LooksAlive callback functions inthe resource DLLs.

Communication manager. Communication manageris responsible for providing communication amongall nodes. In conjunction with a kernel mode trans-port driver, it provides a reliable messaging infra-structure to the rest of the cluster software. The kernelmode driver uses all available networks and, whenone fails, reliably and transparently switches the traf-fic to the next available network.

Network reliability The reliability of network hardware components

has dramatically improved over the past few years.However, failures still happen. For mission-criticalapplications, the reliability offered by standard hard-ware is not sufficient. There are two alternatives:

• Use expensive, fault-tolerant, proprietary prod-ucts.

• Use COTS (commercial-off-the-shelf) compo-nents that provide similar reliability via software.

The clustering service uses the COTS model andimproves network reliability by using all available net-works, whether private (node to node) or public (clus-ter to client). When one network fails, the clustercommunication traffic is transparently moved to thenext available network.

Availability. Most services provided by a clusteredserver are accessed through either of two resourcetypes: a network name or an IP address. A networkname is a friendly name of a device or service; it is pub-lished by the clustering service and included in any

The clusteringservice uses theCOTS model and

improves networkreliability by using

all availablenetworks, whetherprivate or public.

.

Page 5: Windows NT Clustering Service

available naming services, such as DNS and WindowsInternet Naming Service (WINS). An IP address is a32-bit number in dotted decimal format that repre-sents an Internet Protocol (IP) address.

The availability of cluster resources can only be asgood as the availability of the network used to accessthe cluster. For such services to be highly available, itis essential that any network failures be properlydetected and the function failed over and restarted onanother node or network.

Here we describe how the clustering service monitorsthe health of network components, what mechanismsare used to detect network component failures, and theactions that are taken in case such failure is detected.

Failure detection. Heartbeats are the primarymethod of detecting communication failures and

determining their scope. A heartbeat is a simple mes-sage sent to another node in the cluster approximatelyevery second. If more than two heartbeat messagesare not received in a period of time, the cluster servicedecides that the network has failed. If a further threeheartbeat messages are not received, the cluster ser-vice initiates a regroup event and makes every attemptto determine the state of the other node. The hard-ware interface driver may be queried to determine iflow-level hardware failures have occurred.

The two-node case is especially problematic,because of the potential damage done by a node incor-rectly deciding that the other node is down. To help infault isolation, a secondary method of pinging exter-nal hosts or routers to see if they can communicatewith the missing node is tried. A quorum algorithm

October 1998 59

Cluster management tools

Cluster API DLL

RPC Clusterservice

Globalupdate

managerDatabasemanager

Nodemanager

Communicationmanager

Cluster awareapplication

Nonawareapplication

Resource manager

Failover manager

Eventprocessor

Resource monitors

Other nodes

Applicationresource

DLL

Applicationresource

DLL

Logicalresource

DLL

Physicalresource

DLL

Resourcemanagementinterface

Figure 2. Cluster service network components. Theclustering servicecontrols all aspectsof cluster operationon a cluster system.It is implemented asa Windows NTservice and consistsof eight closelyrelated, cooperatingcomponents.

.

Page 6: Windows NT Clustering Service

60 Computer

is then used in conjunction with other fault iso-lation mechanisms to determine if either nodecan continue to operate as the surviving clustermember.

Network resource failover process. Based onthe results of the network state analysis, theclustering service can determine if it shouldfailover network resources. Three assumptionswere made during the design process:

• Clients must use IP protocol, as the cluster-ing service can failover only IP addresses.

• Clients must know how to gracefully recon-nect or retry after failure.

• All cluster nodes must be on the same LANsegment.

During the failover process, the clustering servicemoves the network name and IP address to the sur-viving node. It rebinds the IP address to the new MAC(media access control) address and transmits an ARP(address resolution protocol) broadcast containing thenew mapping, thus flushing all ARP caches on thelocal segment.

When an address is instantiated on a survivingnode, the transport on that node will begin receivingpackets from all clients that were connected to theprimary prior to the failover.

Managing network state Changes in the states of the networks and network

interfaces in a cluster must be consistently, accurately,and swiftly reported so that administrators and appli-cation software can take appropriate actions. Determin-ing the location and scope of a fault, to the greatestextent possible, is an important part of this process. Sucha determination enables applications to work aroundthe problem and helps administrators make repairs.

When a failure occurs that can be isolated to a sub-set of the nodes on the network, IP address resourcesthat depend on the affected interfaces should fail. Thefailure of an IP address resource will trigger a failoverof the associated resource group to another node,which can continue to offer service.

Network and interface states. Several rules must befollowed when determining the network and inter-face states:

• All nodes must have a consistent and accurate viewof interface and network states across the cluster.

• It must be possible to determine whether faultsaffect the entire network or only a subset of theattached nodes.

• Outages that are not long enough for other nodesto assume that a node has failed are not reportedoutside of the network software.

Only a node-down event is reported when a node sud-denly crashes. Reporting all possible intermediate statetransitions would slow the recovery process in the nor-mal case, while providing no benefit in unusual failurescenarios.

Interface and network state definitions. Beforedescribing the algorithms for network state and fail-ure detection, we define and categorize the pertinentnetwork management objects.

Two objects are of particular interest and are treatedseparately by the network software.

• The interface includes the hardware networkinterface and its low-level driver software.

• The network is the communication link betweentwo or more interfaces.

This distinction is important since an interface mayfail, but the network connection to another node stillworks over a second set of interfaces. Each node tracksthe state of each network and interface with which itmay need to communicate.

Network management process. Each node maintainsa connectivity vector for each network to which it isattached. The connectivity vector is indexed by nodeID and contains the node’s view of the state of allinterfaces on the network. Connectivity is determinedby heartbeat exchanges. A single node, dubbed theaccountant, performs all state calculations.

Vector entries are calculated using network topol-ogy information, node operational status information,heartbeat status information, and driver hard-errornotifications. The vectors from all nodes are combinedto form a complete connectivity matrix for each net-work. The matrix is used to calculate the state of thenetwork and the state of each constituent interface.Network and interface states are recalculated when-ever a network or interface state transition is reportedanywhere in the cluster.

The node that forms a cluster automaticallyassumes the role of accountant. Other nodes areinformed of the identity of the accountant when theyjoin the cluster. Whenever a node detects a change inits connectivity with other nodes, it sends a messagecontaining an updated connectivity vector to theaccountant. As connectivity reports arrive, the accoun-tant recalculates the network and interface states. Theaccountant broadcasts the resulting interface and net-work state changes to all nodes using a global update.

If the accountant dies, the surviving node with thesmallest node ID automatically assumes the role ofaccountant. The new accountant issues a globalupdate announcing its new duty and directing allnodes to report their current connectivity vectors. Thenew accountant then recalculates the network andinterface states and broadcasts the results.

Changes in the statesof the networks and

network interfaces ina cluster must be

consistently,accurately, and

swiftly reported sothat administrators

and application software can take

appropriate actions.

.

Page 7: Windows NT Clustering Service

USING THE CLUSTERING SERVICEA number of major server applications take advan-

tage of the functionality offered by MSCS. We describethree of these servers: Microsoft SQL server, OracleParallel and Fail-Safe servers, and the SAP R/3 busi-ness application development system.

Microsoft SQL ServerSQL Server 6.5 is a client-server relational database

system that consists of a main server process and ahelper process, called the executive.

The SQL server process manages a collection ofdatabases that in turn map down to the NT file sys-tem. The server is a free-threaded process that listensfor SQL requests and executes the requests against thedatabase. It is common to encapsulate the databasewith stored procedures that act on the database. Clientrequests invoke these procedures, written in theTransactSQL programming language, and the proce-dures execute a program that has control flow and canmake read and write calls to the database. These pro-cedures can also access other NT services (for exam-ple, mail is often used for operator notification).

The SQL executive performs housekeeping tasks,such as launching periodic jobs to back up or repli-cate the database so that it can be used in case the pri-mary system fails.

SQL Server database failover. Using MSCS it waspossible to design SQL Server to use server failoveras its mechanism for high-availability. We configuredthe SQL server resource group as a virtual NT serverwith a virtual registry, virtual devices, virtual services,virtual name, virtual IP address, and whatever elsewas needed to create the fiction SQL Server was a vir-tual NT node. Clients connect to the server that runson this virtual node. The server application thinks itis running on this virtual server, and the virtual servercan migrate among physical servers.

This design allows a two-node MSCS cluster to havetwo or more high-availability SQL Servers. The clientsalways see the same server name, even as the servermigrates from node to node. The server sees the sameenvironment, even as it migrates as well. Administra-tors, who are really just clients, administer virtualservers, just like real servers (no new concepts havebeen added). This simple design has proved to be veryeasy to understand and easy to explain to users.

Oracle database serversOracle has developed the Oracle Fail-Safe database

server for the MSCS market. This product uses theMSCS shared-nothing model. It is based on a data-par-titioning pattern in which each instance of the Fail-Safedatabase server manages one or more databases. Eachdatabase is stored at a shared disk but—in confor-mance with the MSCS model—this disk is accessible

only to the node that controls it as a clusterresource.

Each Fail-Safe server is a dedicated server,running at each node in the cluster. It maintainsthe correct dependencies among resources, data-bases, and database instances. It also monitorsthe database (by interacting with the resourcemonitors) and maintains the dynamic configu-ration of the client-server interface modules.

Clients access a database through the networknames and IP addresses associated with theresource group that holds each database,respecting the virtual server model. If a databasefails, the client briefly pauses and then recon-nects to the database server under the samename and address, which will have moved tothe surviving node. Application builders areencouraged to maintain transaction state information,such that after a failure any in-flight transactions canbe quickly resubmitted.

MSCS manages the collection of resources thatmake up each instance of a Fail-Safe database server.In case of failure, the MSCS resource group (disk,database, IP address, and network name) is migratedto another node in the cluster. Once the resources arebrought online, the local Fail-Safe server is notified ofthe new resources and the server instantiates newdatabase servers for each of the new databases.

SAP R/3SAP AG offers a scalable system for the develop-

ment of business applications. SAP builds its systemsaround a sophisticated middleware system, SAP R/3.R/3 is a three-tier client-server system. Its three tiersseparate presentation, application, and data storageinto clearly isolated layers. Although all the businesslogic is in the application tier, there is no persistentstate at the application servers. This separation allowsapplication servers to be added to the system toachieve higher processing capacity and availability.Load balancing and availability management of theparallel application servers is done with dedicated R/3management tools.

All persistent state of R/3 is maintained in the data-base tier, where a third-party database is used for datastorage. To facilitate the use of different database ven-dors, the system avoids the storage of nonportable ele-ments such as database stored procedures. There arethree components, of which only a single instance canbe active in the system and which are thus a singlepoint of failure: the database, the message server, andthe enqueue server. The failure of any of these serverswill bring the complete system to a halt.

MSCS makes these services highly available. Becauseonly a single instance of each server can be active inthe system, MSCS can use an unpartitioned failover

October 1998 61

A number of majorserver applicationstake advantage ofthe functionality

offered by MicrosoftCluster Service:Microsoft SQLServer, Oracle Parallel and

Fail-Safe servers,and SAP R/3.

.

Page 8: Windows NT Clustering Service

organization: One node of the cluster hosts the data-base server and the other provides the message andenqueue services. If either node fails, the server anddisk resources are moved to the remaining node in thecluster. The resource group that holds the networknames and IP addresses under which the server wasaccessible also migrates to the surviving node.

The application servers in SAP R/3 are failoveraware in that they can handle being disconnected fromthe database or the message/enqueue server tem-porarily and after a waiting period will try to recon-nect again. A failure in the database tier is transparentto the user, as the application servers will mask thepotential transaction failure that was the result of thefailure. Migration of the application server is handledat the client by establishing a new login-session at thenew node hosting the application server.

M icrosoft Cluster Service has been shipping forabout a year on Windows NT version 4.0.The upcoming Windows NT 5.0 release of

Windows NT Clustering Service will improve ease ofuse through a wizard that guides the user through thecreation of cluster resources. Windows NT 5.0Clustering Service will also improve the integrationwith existing services. For example, it will provide amore available and easily managed DHCP (dynamichost connection protocol) service.

In the future we will provide the infrastructure forbuilding robust, distributed, and scalable enterprise-level applications. We are adding support for SystemArea Networks with high-bandwidth, low-latencyinterconnects. We will provide the system-level infra-structure for building multinode clusters, as well as acomplete application environment based on the forth-coming COM+ version of Microsoft’s object modeltechnology. ❖

AcknowledgmentsWe thank Werner Vogels at Cornell and Jim Gray at

Microsoft Research for sharing their knowledge andexpertise. Tom Phillips and Bohdan Raciborksi pro-vided some badly needed help, since writing this arti-cle overlapped with the beta 2 release of NT5.

References1. G.F. Pfister, In Search of Clusters: The Coming Battle in

Lowly Parallel Computing, Prentice Hall, EnglewoodCliffs, N.J., 1995.

2. W. Vogels et al., “The Design and Architecture of theMicrosoft Cluster Service—A Practical Approach toHigh-Availability and Scalability,” Proc. 28th Symp.Fault-Tolerant Computing, CS Press, 1998, pp. 422-431.

3. R. Carr, “Tandem Global Update Protocol,” TandemSystems Rev., Vol. 1, No. 2, 1985.

Rod Gamache is responsible for leading the develop-ment of the NT cluster product. He has been atMicrosoft for the past four years working on net-working and Windows NT clusters. Previously,Gamache was at Digital Equipment Corp. for 17years, where he worked on porting Windows NT toAlpha, the VMS operating system, and networking.

Rob Short is director for development of core operat-ing system infrastructure for Windows NT. Apartfrom a year working on interactive television, he hasworked on Windows NT since starting at Microsoft in1988. Before that he spent 15 years at Digital design-ing hardware and software. Short recieved an MS incomputer science from the University of Washington.

Mike Massa is responsible for the networking compo-nents of Microsoft clusters. He has been a member ofthe Windows NT development team at Microsoft since1991. His areas of focus are network protocols and dis-tributed systems. Massa received a BSE in computerscience from the University of Pennsylvania.

Contact the authors at Microsoft Corp., One MicrosoftWay, Redmond, WA 98052; [email protected].

COMPONENTSOFTWARE:

USING COMPONENTSTO TRANSFORM

SOFTWARE DEVELOPMENTThe software industry is increasingly setting its

sight on component-based development as away to solve many of its most pressing

problems.

Electronic Submissions Due:December 1

Publication Date:July 1999

Guest Editors:Christine Mingins

[email protected]

Bertrand [email protected]

For more information, see thefull call on p. 87 and the detailed author guidelines

at http://computer.org/computer

C A L L F O RPA P E R S

.