a hybrid operating system cluster solution (pdf)

Architect of an Open WorldTM

Published: June 2009

Dr. Patrice Calegari, HPC Application Specialist, BULL S.A.S. Thomas Varlet, HPC Technology Solution Professional, Microsoft

A Hybrid OS Cluster Solution Dual-Boot and Virtualization with Windows HPC Server 2008 and Linux Bull Advanced Server for Xeon

A Hybrid OS Cluster Solution: Dual-Boot and Virtualization with Windows HPC Server 2008 and Linux Bull Advanced Server for Xeon

The proof of concept presented in this document is neither a product nor a service offered by Microsoft or BULL S.A.S.

The information contained in this document represents the current view of Microsoft Corporation and BULL S.A.S. on the issues discussed as of the date of publication. Because Microsoft and BULL S.A.S. must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft or BULL S.A.S., and Microsoft and BULL S.A.S. cannot guarantee the accuracy of any information presented after the date of publication.

This White Paper is for informational purposes only. MICROSOFT and BULL S.A.S. MAKE NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS DOCUMENT.

Complying with all applicable copyright laws is the responsibility of the user. Without limiting the rights under copyright, no part of this document may be reproduced, stored in or introduced into a retrieval system, or transmitted in any form or by any means (electronic, mechanical, photocopying, recording, or otherwise), or for any purpose, without the express written permission of Microsoft Corporation and BULL S.A.S.

Microsoft and BULL S.A.S. may have patents, patent applications, trademarks, copyrights, or other intellectual property rights covering subject matter in this document. Except as expressly provided in any written license agreement from Microsoft or BULL S.A.S., as applicable, the furnishing of this document does not give you any license to these patents, trademarks, copyrights, or other intellectual property.

© 2008, 2009 Microsoft Corporation and BULL S.A.S. All rights reserved.

NovaScale is a registered trademark of Bull S.A.S.

Microsoft, Hyper-V, Windows, Windows Server, and the Windows logo are trademarks of the Microsoft group of companies.

PBS GridWorks®, GridWorks™, PBS Professional®, PBS™ and Portable Batch System® are trademarks of Altair Engineering, Inc.

The names of actual companies and products mentioned herein may be the trademarks of their respective owners.

Initial publication: release 1.2, 52 pages, published in June 2008 Minor updates: release 1.5, 56 pages, published in Nov. 2008 This paper with meta-scheduler implementation: release 2.0, 76 pages, published in June 2009

2


Abstract

The choice of an operating system (OS) for a high performance computing (HPC) cluster is a critical decision for IT departments. The goal of this paper is to show that simple techniques are available today to optimize the return on investment by making that choice unnecessary, and keeping the HPC infrastructure versatile and flexible. This paper introduces Hybrid Operating System Clusters (HOSC). An HOSC is an HPC cluster that can run several OS’s simultaneously. This paper addresses the situation where two OS’s are running simultaneously: Linux Bull Advanced Server for Xeon and Microsoft® Windows® HPC Server 2008. However, most of the information presented in this paper can apply to 3 or more simultaneous OS’s, possibly from other OS distributions, with slight adaptations. This document gives general concepts as well as detailed setup information. Firstly, technologies necessary to design an HOSC are defined (dual-boot, virtualization, PXE, resource manager and job scheduler). Secondly, different approaches of HOSC architectures are analyzed and technical recommendations are given with a focus on computing performance and management flexibility. The recommendations are then implemented to determine the best technical choices for designing an HOSC prototype. The installation setup of the prototype and the configuration steps are explained. A meta-scheduler based on Altair PBS Professional is implemented. Finally, basic HOSC administrator operations are listed and ideas for future works are proposed.

This paper can be downloaded from the following web sites:

http://www.bull.com/techtrends

http://www.microsoft.com/downloads

http://technet.microsoft.com/en-us/library/cc700329(WS.10).aspx

3





ABSTRACT.......................................................................................................................................................... 3

1 INTRODUCTION .......................................................................................................................................... 7

2 CONCEPTS AND PRODUCTS......................................................................................................................... 9

2.1 MASTER BOOT RECORD (MBR) ............................................................................................................................9 2.2 DUAL‐BOOT.......................................................................................................................................................9 2.3 VIRTUALIZATION...............................................................................................................................................10 2.4 PXE ...............................................................................................................................................................12 2.5 JOB SCHEDULERS AND RESOURCE MANAGERS IN A HPC CLUSTER ................................................................................13 2.6 META‐SCHEDULER............................................................................................................................................13 2.7 BULL ADVANCED SERVER FOR XEON .....................................................................................................................14 2.7.1 Description ...........................................................................................................................................14 2.7.2 Cluster installation mechanisms ..........................................................................................................14

2.8 WINDOWS HPC SERVER 2008 ...........................................................................................................................16 2.8.1 Description ...........................................................................................................................................16 2.8.2 Cluster installation mechanisms ..........................................................................................................16

2.9 PBS PROFESSIONAL ..........................................................................................................................................18

3 APPROACHES AND RECOMMENDATIONS...................................................................................................19

3.1 A SINGLE OPERATING SYSTEM AT A TIME................................................................................................................19 3.2 TWO SIMULTANEOUS OPERATING SYSTEMS ............................................................................................................21 3.3 SPECIALIZED NODES...........................................................................................................................................23 3.3.1 Management node ..............................................................................................................................23 3.3.2 Compute nodes ....................................................................................................................................23 3.3.3 I/O nodes..............................................................................................................................................24 3.3.4 Login nodes ..........................................................................................................................................24

3.4 MANAGEMENT SERVICES....................................................................................................................................25 3.5 PERFORMANCE IMPACT OF VIRTUALIZATION ...........................................................................................................25 3.6 META‐SCHEDULER FOR HOSC ............................................................................................................................26 3.6.1 Goals ....................................................................................................................................................26 3.6.2 OS switch techniques ...........................................................................................................................26 3.6.3 Provisioning and distribution policies ..................................................................................................26

4 TECHNICAL CHOICES FOR DESIGNING AN HOSC PROTOTYPE.......................................................................27

4.1 CLUSTER APPROACH ..........................................................................................................................................27 4.2 MANAGEMENT NODE ........................................................................................................................................27 4.3 COMPUTE NODES..............................................................................................................................................27 4.4 MANAGEMENT SERVICES....................................................................................................................................28 4.5 HOSC PROTOTYPE ARCHITECTURE........................................................................................................................32 4.6 META‐SCHEDULER ARCHITECTURE........................................................................................................................33

4


5 SETUP OF THE HOSC PROTOTYPE ...............................................................................................................34

5.1 INSTALLATION OF THE MANAGEMENT NODES..........................................................................................................34 5.1.1 Installation of the RHEL5.1 host OS with Xen ......................................................................................34 5.1.2 Creation of 2 virtual machines .............................................................................................................34 5.1.3 Installation of XBAS management node on a VM................................................................................36 5.1.4 Installation of InfiniBand driver on domain 0 ......................................................................................36 5.1.5 Installation of HPCS head node on a VM .............................................................................................36 5.1.6 Preparation for XBAS deployment on compute nodes.........................................................................37 5.1.7 Preparation for HPCS deployment on compute nodes.........................................................................37 5.1.8 Configuration of services on HPCS head node .....................................................................................38

5.2 DEPLOYMENT OF THE OPERATING SYSTEMS ON THE COMPUTE NODES..........................................................................39 5.2.1 Deployment of XBAS on compute nodes..............................................................................................39 5.2.2 Deployment of HPCS on compute nodes..............................................................................................40

5.3 LINUX‐WINDOWS INTEROPERABILITY ENVIRONMENT ...............................................................................................43 5.3.1 Installation of the Subsystem for Unix‐based Applications (SUA)........................................................43 5.3.2 Installation of the Utilities and SDK for Unix‐based Applications ........................................................43 5.3.3 Installation of add‐on tools..................................................................................................................43

5.4 USER ACCOUNTS...............................................................................................................................................43 5.5 CONFIGURATION OF SSH.....................................................................................................................................44 5.5.1 RSA key generation ..............................................................................................................................44 5.5.2 RSA key.................................................................................................................................................44 5.5.3 Installation of freeSSHd on HPCS compute nodes................................................................................45 5.5.4 Configuration of freeSSHd on HPCS compute nodes............................................................................45

5.6 INSTALLATION OF PBS PROFESSIONAL...................................................................................................................45 5.6.1 PBS Professional Server setup..............................................................................................................46 5.6.2 PBS Professional setup on XBAS compute nodes .................................................................................46 5.6.3 PBS Professional setup on HPCS nodes ................................................................................................46

5.7 META‐SCHEDULER QUEUES SETUP .......................................................................................................................46 5.7.1 Just in time provisioning setup.............................................................................................................48 5.7.2 Calendar provisioning setup ................................................................................................................48

6 ADMINISTRATION OF THE HOSC PROTOTYPE .............................................................................................49

6.1 HOSC SETUP CHECKING .....................................................................................................................................49 6.2 REMOTE REBOOT COMMAND ..............................................................................................................................49 6.3 SWITCH A COMPUTE NODE OS TYPE FROM XBAS TO HPCS ......................................................................................49 6.4 SWITCH A COMPUTE NODE OS TYPE FROM HPCS TO XBAS ......................................................................................50 6.4.1 Without sshd on the HPCS compute nodes..........................................................................................50 6.4.2 With sshd on the HPCS compute nodes ...............................................................................................50

6.5 RE‐DEPLOY AN OS ............................................................................................................................................50 6.6 SUBMIT A JOB WITH THE META‐SCHEDULER............................................................................................................51

5


6.7 CHECK NODE STATUS WITH THE META‐SCHEDULER...................................................................................................52

7 CONCLUSION AND PERSPECTIVES ..............................................................................................................55

APPENDIX A: ACRONYMS .................................................................................................................................56

APPENDIX B: BIBLIOGRAPHY AND RELATED LINKS ............................................................................................58

APPENDIX C: MASTER BOOT RECORD DETAILS..................................................................................................60

C.1 MBR STRUCTURE .............................................................................................................................................60 C.2 SAVE AND RESTORE MBR...................................................................................................................................60

APPENDIX D: FILES USED IN EXAMPLES.............................................................................................................61

D.1 WINDOWS HPC SERVER 2008 FILES ....................................................................................................................61 D.1.1 Files used for compute node deployment ............................................................................................61 D.1.2 Script for IPoIB setup............................................................................................................................62 D.1.3 Scripts used for OS switch ....................................................................................................................63

D.2 XBAS FILES .....................................................................................................................................................64 D.2.1 Kickstart and PXE files..........................................................................................................................64 D.2.2 DHCP configuration..............................................................................................................................65 D.2.3 Scripts used for OS switch ....................................................................................................................65 D.2.4 Network interface bridge configuration ..............................................................................................67 D.2.5 Network hosts ......................................................................................................................................68 D.2.6 IB network interface configuration ......................................................................................................68 D.2.7 ssh host configuration..........................................................................................................................68

D.3 META‐SCHEDULER SETUP FILES ............................................................................................................................69 D.3.1 PBS Professional configuration files on XBAS.......................................................................................69 D.3.2 PBS Professional configuration files on HPCS ......................................................................................69 D.3.3 OS load balancing files.........................................................................................................................69

APPENDIX E: HARDWARE AND SOFTWARE USED FOR THE EXAMPLES...............................................................72

E.1 HARDWARE .....................................................................................................................................................72 E.2 SOFTWARE ......................................................................................................................................................72

APPENDIX F: ABOUT ALTAIR AND PBS GRIDWORKS ..........................................................................................73

F.1 ABOUT ALTAIR .................................................................................................................................................73 F.2 ABOUT PBS GRIDWORKS ..................................................................................................................................73

APPENDIX G: ABOUT MICROSOFT AND WINDOWS HPC SERVER 2008 ...............................................................74

G.1 ABOUT MICROSOFT ..........................................................................................................................................74 G.2 ABOUT WINDOWS HPC SERVER 2008.................................................................................................................74

APPENDIX H: ABOUT BULL S.A.S. ......................................................................................................................75

6


1 Introduction

The choice of the right operating system (OS) for a high performance computing (HPC) cluster can be a very difficult decision for IT departments. And this choice will usually have a big impact on the Total Cost of Ownership (TCO) of the cluster. Parameters like multiple user needs, application environment requirements and security policies are adding to the complex human factors included in training, maintenance and support planning, all leading to associated risks on the final return on investment (ROI) of the whole HPC infrastructure. The goal of this paper is to show that simple techniques are available today to make that choice unnecessary, and keep your HPC infrastructure versatile and flexible.

In this white paper we will study how to provide the best flexibility for running several OS’s on an HPC cluster. There are two main types of approaches to providing this service depending on whether a single operating system is selected each time the whole cluster is booted, or whether several operating systems are run simultaneously on the cluster. The most common approach of the first type is called the dual-boot cluster (described in [1] and [2]). For the second type of approach, we introduce the concept of a Hybrid Operating System Cluster (HOSC): a cluster with some computing nodes running one OS type while the remaining nodes run another OS type. Several approaches to both types are studied in this document in order to determine their properties (requirements, limits, feasibility, and usefulness) with a clear focus on computing performance and management flexibility.

The study is limited to 2 operating systems: Linux Bull Advanced Server for Xeon 5v1.1 and Microsoft Windows HPC Server 2008 (respectively noted XBAS and HPCS in this paper). For optimizing the interoperability between the two OS worlds, we use the Subsystem for Unix-based Applications (SUA) for Windows. The description of the methodologies is as general as possible in order to apply to other OS distributions but examples are given exclusively in the XBAS/HPCS context. The concepts developed in this document could apply to 3 or more simultaneous OS’s with slight adaptations. However, this is out of the scope of this paper.

We introduce a meta-scheduler that provides a single submission point for both Linux and Windows. It selects the cluster nodes with the OS type required by submitted jobs. The OS type of compute nodes can be switched automatically and safely without administrator intervention. This optimizes computational workloads by adapting the distribution of OS types among the compute nodes.

A technical proof of concept is given by designing, installing and running an HOSC prototype. This prototype can provide computing power under both XBAS and HPCS simultaneously. It has two virtual management nodes (aka head nodes) on a single server and the choice of the OS distribution among the compute nodes can be done dynamically. We have chosen Altair PBS Professional software to demonstrate a meta-scheduler implementation. This project is the result of the collaborative work of Microsoft and Bull.

Chapter 2 defines the main technologies used in HOSC: the Master Boot Record (MBR), the dual-boot method, the virtualization, the Pre-boot eXecution Environment (PXE), the resource manager and job scheduler tools. If you are already familiar with these concepts, you may want to skip this chapter and go directly to Chapter 3 that analyzes different approaches to HOSC architectures and gives technical

7


recommendations for their design. The recommendations are implemented in Chapter 4 in order to determine the best technical choices for building an HOSC prototype. The installation setup of the prototype and the configuration steps are explained in Chapter 5. Appendix D shows the files that were used during this step. Finally, basic HOSC administrator operations are listed in Chapter 6 and ideas for future works are proposed in Chapter 7, which concludes this paper.

This document is intended for computer scientists who are familiar with HPC cluster administration.

All acronyms used in this paper are listed in Appendix A. Complementary information can be found in the documents and web pages listed in Appendix B.

8


2 Concepts and products

We assume that the readers may not be familiar with every concept discussed in the remaining chapters in both Linux and Windows environments. Therefore, this chapter introduces the technologies (Master Boot Record, Dual-boot, virtualization and Pre-boot eXecution Environment) and products (Linux Bull Advanced Server, Windows HPC Server 2008 and PBS Professional) mentioned in this document.

If you are already familiar with these concepts or are more interested in general Hybrid OS Cluster (HOSC) considerations, you may want to skip this chapter and go directly to Chapter 3.

2.1 Master Boot Record (MBR) The 512-byte boot sector is called the Master Boot Record (MBR). It is the first sector of a partitioned data storage device such as a hard disk. The MBR is usually overwritten by operating system (OS) installation procedures; the MBR previously written on the device is then lost.

The MBR includes the partition table of the 4 primary partitions and a bootstrap code that can start the OS or load and run the boot loader code (see the complete MBR structure in Table 3 of Appendix C.1). A partition is encoded as a 16-byte structure with size, location and characteristic fields. The first 1-byte field of the partition structure is called the boot flag.

Windows MBR starts the OS installed on the active partition. The active partition is the first primary partition that has its boot flag enabled. You can select an OS by activating the partition where it is installed. Tools diskpart.exe and fdisk can be used to change partition activation. Appendix D.1.3 and Appendix D.2.3 give examples of commands that enable/disable the boot flag.

Linux MBR can run a boot loader code (e.g., GRUB or Lilo). You can then select an OS interactively from its user interface at the console. If no choice is given at the console, the OS selection is taken from the boot loader configuration file that you can edit in advance before a reboot (e.g., grub.conf for the GRUB boot loader). If necessary, the Linux boot loader configuration file (that is written in a Linux partition) can be replaced from a Windows command line with the dd.exe tool.

Appendix C.2 explains how to save and restore the MBR of a device. It is very important to understand how the MBR works in order to properly configure dual-boot systems.

2.2 Dual-boot Dual-booting is an easy way to have several operating systems (OS) on a node. When an OS is run, it has no interaction with the other OS installed so the native performance of the node is not affected by the use of the dual-boot feature. The only limitation is that these OS’s cannot be run simultaneously.

When designing a dual-boot node, the following points should be analyzed:

• The choice of the MBR (and choice of the boot loader if applicable)

9


• The disk partition restrictions. For example, Windows must have a system partition on at least one primary partition of the first device)

• The compatibility with Logical Volume Managers (LVM). For example, RHEL5.1 LVM creates a logical volume with the entire first device by default and this makes it impossible to install a second OS on this device.

When booting a computer, the dual-boot feature gives the ability to choose which OS to start from multiple OS’s installed on that computer. At boot time, the way you can select the OS of a node depends on the installed MBR. A dual-boot method that relies on Linux MBR and GRUB is described in [1]. Another dual-boot method that exploits the properties of active partitions is described in [2] an [3].

2.3 Virtualization The virtualization technique is used to hide the physical characteristics of computers and only present a logical abstraction of these characteristics. Virtual Machines (VM) can be created by the virtualization software: each VM has virtual resources (CPUs, memory, devices, network interfaces, etc.) whose characteristics (quantity, size, etc.) are independent from those available on the physical server. The OS installed in a VM is called a guest OS: the guest OS can only access the virtual resources available in its VM. Several VMs can be created and run on one physical node. These VMs appear like physical machines for the applications, the users and the other nodes (physical or virtual).

Virtualization is interesting in the context of our study for two reasons:

1. It makes possible the installation of several management nodes (MN) on a single physical server. This is an important point for installing several OS on a cluster without increasing its cost with the installation of an additional physical MN server.

2. It provides a fast and rather easy way to switch from one OS to another: by starting a VM that runs an OS while suspending another VM that runs another OS.

A hypervisor is a software layer that runs at a higher privilege level on the hardware. The virtualization software runs in a partition (domain 0 or dom0), from where it controls how the hypervisor allocates resources to the virtual machines. The other domains where the VMs run are called unprivileged domains and noted domU. A hypervisor normally enforces scheduling policies and memory boundaries. In some Linux implementations it also provides access to hardware devices via its own drivers. On Windows, it does not.

The virtualization software can be:

• Host-based (like VMware): this means that the virtualization software is installed on a physical server with a classical OS called the host OS.

• Hypervisor-based (like Windows Server® 2008 Hyper-V™ and Xen): in this case, the hypervisor runs at a lower level than the OS. The “host OS” becomes just another VM that is automatically started at boot time. Such virtualization architecture is shown in Figure 1.

10


Figure 1 - Overview of hypervisor-based virtualization architecture

“Full virtualization” is an approach which requires no modification to the hosted operating system, providing the illusion of a complete system of real hardware devices. Such Hardware Virtual Machines (HVM) require hardware support provided for example by Intel® Virtual Technology (VT) and AMD-V technology. Recent Intel® Xeon® processors support full virtualization thanks to the Intel® VT. Windows is only supported on fully-virtualized VMs and not on para-virtualized VMs. “para-virtualization” is an approach which requires modification to the operating system in order to run in a VM.

The market provides many virtualization software packages among which:

• Xen [6]: a freeware for Linux included in the RHEL5 distribution which allows a maximum of 8 virtual CPUs per virtual machine (VM). Oracle VM and Sun xVM VirtualBox are commercial implementations.

• VMware [7]: commercial software for Linux and Windows which allows a maximum of 4 virtual CPUs per VM.

• Hyper-V [8]: a solution provided by Microsoft which only works on Windows Server 2008 and allows only 1 virtual CPU per VM for non-Windows VM.

• PowerVM [9] (formerly Advanced POWER Virtualization): an IBM solution for UNIX and Linux on most processor architectures that does not support Windows as a guest OS.

11


• Virtuozzo [10]: a ‘Parallels, Inc’ solution designed to deliver near native physical performance. It only supports VMs that run the same OS as the host OS (i.e., Linux VMs on Linux hosts and Windows VMs on Windows hosts).

• OpenVZ [11]: an operating system-level virtualization technology licensed under GPL version 2. It is a basis of Virtuozzo [10]. It requires both the host and guest OS to be Linux, possibly of different distributions. It has a low performance penalty compared to a standalone server.

2.4 PXE The Pre-boot eXecution Environment (PXE) is an environment to boot computers using a network interface independently of available data storage devices or installed OS. The end goal is to allow a client to network boot and receive a network boot program (NBP) from a network boot server.

In a network boot operation, the client computer will:

1. Obtain an IP address to gain network connectivity: when a PXE-enabled boot is initiated, the PXE-based ROM requests an IP address from a Dynamic Host Configuration Protocol (DHCP) server using the normal DHCP discovery process (see the detailed process in Figure 2). It will receive from the DHCP server an IP address lease, information about the correct boot server and information about the correct boot file.

2. Discover a network boot server: with the information from the DHCP server the client establishes a connection to the PXE servers (TFTP, WDS, NFS, CIFS, etc.).

3. Download the NBP file from the network boot server and execute it: the client uses Trivial File Transfer Protocol (TFTP) to download the NBP. Examples of NBP are: pxelinux.0 for Linux and WdsNbp.com for Windows Server.

When booting a compute node with PXE, the goal can be to install or run it with an image deployed through the network, or just to run it with an OS installed on its local disk. In the latter case, the PXE just answers the compute node requests by indicating that it must boot on the next boot device listed in its BIOS.

Figure 2 - DHCP discovery process

NODE

bootpc(68): boot protocol client on port 68

Broadcast IP source = 0.0.0.0

DHCP SERVER

bootps(67): boot protocol server on port 67

Broadcast IP source = DHCP server IP addr.

1 - DH PDISCOVERC (client MAC address)

2 - DHCPOFFER (NEW IP address)

3 PREQUEST - DHC (NEW IP address)

4 - DH PACK C (NEW IP address and boot information)

12


2.5 Job schedulers and resource managers in a HPC cluster In an HPC cluster, a resource manager (aka Distributed Resource Management System (DRMS) or Distributed Resource Manager (DRM)) gathers information about all cluster resources that can be used by application jobs. Its main goal is to give accurate resource information about the cluster usage to a job scheduler.

A job scheduler (aka batch scheduler or batch system) is in charge of unattended background executions. It provides a user interface for submitting, monitoring and terminating jobs. It is usually responsible for the optimization of job placement on the cluster nodes. For that purpose it deals with resource information, administrator rules and user rules: job priority, job dependencies, resource and time limits, reservation, specific resource requirements, parallel job management, process binding, etc. With time, job schedulers and resource managers evolved in such a way that they are now usually integrated under a unique product name. Here are such noteworthy products:

• PBS Professional [12]: supported by Altair for Linux/Unix and Windows

• Torque [13]: an open source job scheduler based on the original PBS project. It can be used as a resource manager by other schedulers (e.g., Moab workload manager).

• SLURM (Simple Linux Utility for Resource Management) [14]: freeware and open source

• LSF (Load Sharing Facility) [15]: supported by Platform for Linux/Unix and Windows

• SGE (Sun Grid Engine) [16]: supported by Sun Microsystems

• OAR [17]: freeware and open source for Linux, AIX and SunOS/Solaris

• Microsoft Windows HPC Server 2008 job scheduler: included in the Microsoft HPC pack [5]

2.6 Meta-Scheduler According to Wikipedia [18], “Meta-scheduling or Super scheduling is a computer software technique of optimizing computational workloads by combining an organization's multiple Distributed Resource Managers into a single aggregated view, allowing batch jobs to be directed to the best location for execution”. In this paper, we consider that the meta-scheduler is able to submit jobs on cluster nodes with heterogeneous OS types and that it can switch automatically the OS type of these nodes when necessary (for optimizing computational workloads). Here is a partial list of meta-schedulers currently available:

• Moab Grid Suite and Maui Cluster scheduler [19]: supported by Cluster Resources, Inc.

• GridWay [20]: a Grid meta-scheduler by the Globus Alliance

• CSF (Community Scheduler Framework) [21]: an open source framework (an add-on to the Globus Toolkit v.3) for implementing a grid meta-scheduler, developed by Platform Computing

Recent job schedulers can sometime be adapted and configured to behave as “simple” meta-schedulers.

13


2.7.1 Description

2.7.2

2.7 Bull Advanced Server for Xeon

Bull Advanced Server for Xeon (XBAS) is a robust and efficient Linux solution that delivers total cluster management. It addresses each step of the cluster lifecycle with a centralized administration interface: installation, fast and reliable software deployments, topology-aware monitoring and fault handling (to dramatically lower time-to-repair), cluster optimization and expansion. Integrated, tested and supported by Bull [4], XBAS federates the very best of Open Source components, complemented by leading software packages from well known Independent Software Vendors, and gives them a consistent view of the whole HPC cluster through a common cluster database: the clusterdb. XBAS is fully compatible with standard RedHat Enterprise Linux (RHEL). Latest Bull Advanced Server for Xeon 5 release (v3.1) is based on RHEL5.31.

Cluster installation mechanisms The Installation of an XBAS cluster starts with the setup of the management node (see the installation & configuration guide [22]). The compute nodes are then deployed by automated tools.

BIOS settings must be set so that XBAS compute nodes boot on network with PXE by default. The PXE files stored on the management node indicate if a given compute node should be installed (i.e., its DEFAULT label is ks) or if it is ready to be run (i.e., its DEFAULT label is local_primary).

In the first case, a new OS image should be deployed2. During the PXE boot process, operations to be executed on the compute node are written in the kickstart file. Tools based on PXE are provided by XBAS to simplify the installation of compute nodes. The “preparenfs” tool writes the configuration files with the information given by the administrator and with those found in the clusterdb. The generated configuration files are: the PXE files (e.g., /tftpboot/C0A80002), the DHCP configuration file (/etc/dhcpd.conf), the kickstart file (e.g., /release/ks/kickstart) and the NFS export file (/etc/exportfs). No user interface access (remote or local) to the compute node is required during its installation phase with the preparenfs tool. Figure 3 shows the sequence of interactions between a new XBAS compute node being installed and the servers run on the management node (DHCP, TFTP and NFS). On small clusters, the “preparenfs” tool can be used to install every CN. On large clusters the ksis tool can be used to optimize the total deployment time of the cluster by cloning the first CN installed with the “preparenfs” tool.

In the second case, the CN is already installed and the compute node just needs to boot locally on its local disk. Figure 4 shows the XBAS compute node normal boot scheme.

1 The Bull Advanced Server for Xeon 5 release that was used to illustrate examples in this paper is v1.1 based on RHEL5.1 because this was the latest release when we built the first prototypes in May 2008.

2 In this document, we define the “deployment of an OS” as the installation of a given OS on several nodes from a management node. A more restrictive definition that only applies to the duplication of OS images on the nodes is often used in technical literature.

14


Man

agem

ent n

ode

(192

.168

.0.1

)

Com

pute

nod

e

/release/RHEL5.1 ?

/release/RHEL5.1 + /release/XHPC

Boot RHEL5.1/vmlinuz kernel through network

Boot micro kernel pxelinux.0 and translates IP address in

hexadecimal format:192.168.0.2 = C0A80002

vmlinuz + initrd.img

pxelinux.0 ?

Bios settingsBoot order:1 Network2 local HD

Boot on network and looks for a DHCP server

mac address=00:30:19:D6:77:8A

xbas1 192.168.0.2

192.168.0.1 pxelinux.0

Power on

Installation of RHEL5.1 through NFS with the

kickstart file info

Execute instructions from the kickstart file

/release/ks/kickstart

/release/ks/kickstart

/tftpboot/C0A80002DEFAULT=ksLABEL ks

KERNEL RHEL5.1/vmlinuzAPPEND ksdevice=eth0 ip=dhcp

ks=nfs:192.168.0.1:/release/ks/kickstart initrd=RHEL5.1/initrd.img

/release/ks/kickstart#Kickstart file#with disk partitions info and#OS DVD image location. DNS=192.168.0.1/release/RHEL5.1

/etc/exportfs/release/RHEL5.1 <world>/release/XHPC <world>

/etc/dhcpd.conffilename ‘’pxelinux.0’’fixed-address 192.168.0.2 next-server 192.168.0.1hardware ethernet 00:30:19:D6:77:8Aoption host-name ‘’xbas1’’

pxelinux.0

vmlinuz + initrd.img ?

/release/ks/kickstart.cfg ?

pxelinux.0

RHEL5.1/vmlinuz

RHEL5.1/initrd.img

COA80002

C0A80002 ?

NFS

NFS

TFTP

TFTP

TFTP

DHCP

Connect to “next-server”

Execute instructions from the PXE file C0A80002

Figure 3 – XBAS compute node PXE installation scheme

Man

agem

ent n

ode

(192

.168

.0.1

)

Com

pute

nod

e (1

92.1

68.0

.2)

Boot Linux kernel in XBAS5 environment on

local disk

Boot micro kernel pxelinux.0 and translates IP address in

hexadecimal format:192.168.0.2 = C0A80002

chain.c32

pxelinux.0 ?




xbas1 192.168.0.2

192.168.0.1 pxelinux.0

Power on

/tftpboot/C0A80002DEFAULT=local_primaryLABEL local_primary

KERNEL chain.c32APPEND hd0

/etc/dhcpd.conffilename ‘’pxelinux.0’’fixed-address 192.168.0.2 next-server 192.168.0.1hardware ethernet 00:30:19:D6:77:8Aoption host-name ‘’xbas1’’

pxelinux.0

chain.c32 ?

pxelinux.0

chain.c32

COA80002

C0A80002 ?

TFTP

TFTP

TFTP

DHCP

Connect to “next-server”

Execute instructions from the PXE file C0A80002

Figure 4 – XBAS compute node PXE boot scheme

15


2.8.1 Description

2.8.2

2.8 Windows HPC Server 2008

Microsoft Windows HPC Server 2008 (HPCS), the successor to Windows Computer Cluster Server (WCCS) 2003, is based on the Windows Server 2008 operating system and is designed to increase productivity, scalability and manageability. This new name reflects Microsoft HPC’s readiness to tackle the most challenging HPC workloads [5]. HPCS includes key features, such as new high-speed networking, highly efficient and scalable cluster management tools, advanced failover capabilities, a service oriented architecture (SOA) job scheduler, and support for partners’ clustered file systems. HPCS gives access to an HPC platform that is easy to deploy, operate, and integrate with existing enterprise infrastructures

Cluster installation mechanisms The Installation of a Windows HPC cluster starts with the setup of the head node (HN). For the deployment of a compute node (CN), HPCS uses Windows Deployment Service (WDS), which fully installs and configures HPCS and adds the new node to the set of Windows HPC compute nodes. WDS is a deployment tool provided by Microsoft, it is the successor of Remote Installation services (RIS), and it handles all the compute node installation process and acts as a TFTP server.

During the first installation step, Windows Preinstallation Environment (WinPE) is the boot operating system. It is a lightweight version of Windows Server 2008 that is used for the deployment of servers. It is intended as a 32-bit or 64-bit replacement for MS-DOS during the installation phase of Windows, and can be booted via PXE, CD-ROM, USB flash drive or hard disk.

BIOS settings should be set so that HPCS compute nodes boot on network with PXE (we assume that a private network exists and that CNs send PXE requests there first). From the head node point of view, a compute node must be deployed if it doesn’t have any entry into the Active Directory (AD), or if the cluster administrator has explicitly specified that it must be re-imaged. When a compute node with no OS boots, it first sends a DHCP request in order to get an IP address, a valid network boot server and the name of a network boot program (NBP). When the DHCP server has answered, the CN downloads the NBP called WdsNbp.com from the WDS server. The purpose is to detect the architecture and to wait for other downloads from the WDS server.

Then, on the HPCS administration console of the head node, the new compute node appears as “pending approval”. The installation starts once the administrator assigns a deployment template to it. A WinPE image is sent and booted on the compute node; files are transferred in order to prepare the Windows Server 2008 installation, and an unattended installation of Windows Server 2008 is played. Finally, the compute node is joined to the domain and the cluster. Figure 5 shows the details of PXE boot operations executed during the installation procedure.

If the CN has already been installed, the AD already contains the corresponding computer object, so the WDS server sends him a NBP called abortpxe.com which boots the server by using the next boot item in the BIOS without waiting for a timeout. Figure 6 shows the PXE boot operations executed in this case.

16


Hea

d no

de(1

92.1

68.0

.1)

Com

pute

nod

e

WdsNbp.com ?




192.168.0.2 next=192.168.0.1

Boot/x64/WdsNbp.com

Power on

WdsNbp.com

WdsNbp.com

DHCP

Boot micro kernel and configure IP address WDS

TFTP

Boot micro kernel and configure IP address

Wait for HN approval

Create an AD account

Check for AD account

ADApprove CNAssign template

WDSTFTP

BOOT.WIM + diskpart.txt

Play Microsoft® HPC Pack 2008 installation

Install Windows Server® 2008 Join the domain

HPC Pack setup

Boot kernel WinPEand partition the disk

WIM images

CIFS

CN is in the domain

Boot Windows Server® 2008

Join the cluster

AD

Ask for the Windows Server® imageWDS

TFTP

WDSTFTP

diskpart.txt

unattend.xml

WDSTFTPpxeboot.com (or .n12)

pxeboot.com

unattend.xml

Boot micro kernel pxeboot.com (or .n12)

Figure 5 – HPCS compute node PXE installation scheme

Hea

d no

de(1

92.1

68.0

.1)

Com

pute

nod

e (1

92.1

68.0

.2)

WdsNbp.com ?




192.168.0.2 next=192.168.0.1

Boot/x64/WdsNbp.com

Power on

WdsNbp.com

DHCP

Boot micro kernel and configure IP address WDS

TFTP

Boot micro kernel and configure IP address

Wait for HN approval Check for AD account

AD

Boot Windows Server®2008 on local disk

abortpxe.com

AD account exists

WDSTFTP

WDSTFTP

WdsNbp.com

abortpxe.com

Figure 6 – HPCS compute node PXE boot scheme

17


2.9 PBS Professional This Section presents PBS Professional, the job scheduler that we used as meta-scheduler for building the HOSC prototype described in Chapter 5. PBS Professional is part of the PBS GridWorks software suite. It is the professional version of the Portable Batch System (PBS), a flexible workload management system, originally developed to manage aerospace computing resources at NASA. PBS Professional has since become the leader in supercomputer workload management and the de facto standard on Linux clusters. A few of the more important features of PBS Professional 10 are listed below:

• Enterprise-wide Resource Sharing provides transparent job scheduling on any PBS system by any authorized user. Jobs can be submitted from any client system both local and remote.

• Multiple User Interfaces provides a traditional command line and a graphical user interface for submitting batch and interactive jobs; querying job, queue, and system status; and monitoring job.

• Job Accounting offers detailed logs of system activities for charge-back or usage analysis per user, per group, per project, and per compute host.

• Parallel Job Support works with parallel programming libraries such as MPI. Applications can be scheduled to run within a single multi-processor computer or across multiple systems.

• Job-Interdependency enables the user to define a wide range of interdependencies between jobs.

• Computational Grid Support provides an enabling technology for metacomputing and computational grids.

• Comprehensive API includes a complete Application Programming Interface (API).

• Automatic Load-Leveling provides numerous ways to distribute the workload across a cluster of machines, based on hardware configuration, resource availability, keyboard activity, and local scheduling policy.

• Common User Environment offers users a common view of the job submission, job querying, system status, and job tracking over all systems.

• Cross-System Scheduling ensures that jobs do not have to be targeted to a specific computer system. Users may submit their job, and have it run on the first available system that meets their resource requirements.

• Job Priority allows users the ability to specify the priority of their jobs.

• Username Mapping provides support for mapping user account names on one system to the appropriate name on remote server systems. This allows PBS Professional to fully function in environments where users do not have a consistent username across all hosts.

• Broad Platform Availability is achieved through support of Windows and every major version of UNIX and Linux, from workstations and servers to supercomputers.

18


3 Approaches and recommendations

In this chapter, we will explain the different approaches to offer several OS’s on a cluster. The approaches discussed in Sections 3.1 and 3.2 are summarized in Table 1 on the next page.

3.1 A single operating system at a time Let us examine the case where all nodes run the same OS. The cluster OS of the cluster is selected at boot time. Switching from an OS to another can be done by:

• Re-installing the selected OS on the cluster if necessary. But since this process can be long it is not realistic for frequent changes. This is noted as approach 1 in Table 1.

• Deploying a new OS image on the whole cluster depending on the OS choice. The deployment can be done on local disks or in memory with diskless compute nodes. It is difficult to deal with the OS change on the management node in such an environment: either the management node is dual-booted (this is approach 7 in Table 1), or an additional server is required to distribute the OS image of the MN. This can be interesting in some specific cases: on HPC clusters with diskless CN when the OS switches are rare, for example. Otherwise, this approach is not very convenient. The deployment technique can be used in a more appropriate manner for clusters with 2 simultaneous OS’s (i.e., 2 MNs); this will be shown in the next Section with approaches 3 and 11.

• Dual-booting the selected OS from dual-boot disks. Dual-booting the whole cluster (management and computing nodes) is a good and very practical solution that was introduced in [1] and [2]. This approach, noted 6 in Table 1, is the easiest way to install and manage a cluster with several OS’s but it can only apply for small clusters with few users when no flexibility is required. If only the MNs are on a dual-boot server while the CNs are installed with a single OS (half of the CNs having an OS while the others have another), the solution has no sense because only half of the cluster can be used at a time in this case (this is approach 5). If the MNs are on a dual-boot server while the CNs are installed in VMs (2 VMs being installed on each compute server), the solution has no real sense either because the added value of using VMs (quick OS switching for instance) is cancelled by the need of booting the MN server (this is approach 8).

Whatever the OS switch method, a complete cluster reboot is needed at each change. This implies cluster unavailability during reboots, a need for OS usage schedules and potential conflicts between user needs, hence a real lack of flexibility.

In Table 1, approaches 1, 5, 6, 7, and 8 define clusters that can run 2 OS’s but not simultaneously. Even if such clusters do not stick to the Hybrid Operating System Cluster (HOSC) definition given in Chapter 1, they can be considered as a simplified approach of its concept.

19


2 Compute Nodes (CN) with 2 different OS’s

1 OS per server (2 servers)

Dual-boot (1 server)

OS image deployment

(1 server)

Virtualization (2 CN

simultaneously on 1 server)

1 OS per server

(2 servers)

1 Starting point: 2 half size

independent clusters with 2

OS’s, or 1 full size single OS cluster re-installed with a different OS when

needed: the expensive solution without flexibility

2 Good HOSC solution for large clusters with OS

flexibility requirement

3 An HOSC solution that can be interesting for

large clusters: with diskless CNs or when the OS type of CNs is rarely switched

4 An HOSC solution with

potential performance

issues on compute nodes

and extra-cost for the additional management

node

Dual-boot (1 server)

5 This “single OS at a time” solution makes absolutely no sense since

only half of the CN can be used at a

time

6 Good classical dual-boot cluster

solution

7 A “single OS at a time” solution that can only be interesting for diskless CNs

8 Having virtual CNs has no real sense since the

MN must be rebooted to

switch the OS

2 M

anag

emen

t Nod

es (M

N) w

ith 2

diff

eren

t OS’

s

Virtualization (2 MN

simultaneously on 1 server)

9 2 half size independent

clusters with a single MN server:

a bad HOSC solution with no

flexibility and very little cost saving

10 Good HOSC solution for

medium-sized clusters with OS

flexibility requirement

(without additional

hardware cost)

11 An HOSC solution that can be interesting for

small clusters with diskless CNs

12 Every node is virtual: the most flexible

HOSC solution but with too many

performance uncertainties at

the moment

Table 1 - Possible approaches to HPC clusters with 2 operating systems

20


3.2 Two simultaneous operating systems The idea is to provide, with a single cluster, the capability to have several OS’s running simultaneously on an HPC cluster. This is what we defined as a Hybrid Operating System Cluster (HOSC) in Chapter 1. Each compute node (CN) does not need to run every OS simultaneously. A single OS can run on a given CN while another OS runs on other CNs at the same time. The CNs can be dual-boot servers, diskless servers, or virtual machines (VM). The cluster is managed from separate management nodes (MN) with different OS’s. MN can be installed on several physical servers or on several VMs running on a single server. In Table 1, approaches 2, 3, 4, 9, 10, 11 and 12 are HOSC.

HPC users may consider HPC clusters with two simultaneous OS’s rather than a single OS at a time for four main reasons:

1. To improve resource utilization and adapt the workload dynamically by easily changing the ratio of OS’s (e.g., Windows vs. Linux compute nodes) in a cluster for different kinds of usage.

2. To be able to migrate smoothly from one OS to the other, giving time to port applications and train users.

3. Simply to be able to try a new OS without stopping the already installed one (i.e., install a HPCS cluster at low cost on an existing Bull Linux cluster or install a Bull Linux cluster at low cost on an existing HPCS cluster).

4. To integrate specific OS environments (e.g., with legacy OS’s and applications) in a global IT infrastructure.

The simplest approach for running 2 OS’s on a cluster is to install each OS on half (or at least a part) of the cluster when it is built. This approach is equivalent to building 2 single OS clusters! Therefore it cannot be classified as a cluster with 2 simultaneous OS’s. Moreover, this solution is expensive with its 2 physical MN servers and it is absolutely not flexible since the OS distribution (i.e., the OS allocation to nodes) is fixed in advance. This approach is similar to approach 1 already discussed in the previous section.

An alternative to this first approach is to use a single physical server with 2 virtual machines for installing the 2 MNs. In this case there is no additional hardware cost but there is still no flexibility for the choice of the OS distribution on the CNs since this distribution is done when the cluster is built. This approach is noted 9.

On clusters with dual-boot CNs the OS distribution can be dynamically adapted to the user and application needs. The OS of a CN can be changed just by rebooting the CN aided by a few simple dual-boot operations (this will be demonstrated in Sections 6.3 and 6.4). With such dual-boot CNs, the 2 MNs can be on a single server with 2 VMs: this approach, noted 10, is very flexible and requires no additional hardware cost. It is a good HOSC solution, especially for medium-sized clusters.

21


With dual-boot CNs, the 2 MNs can also be installed on 2 physical servers instead of 2 VMs: this approach, noted 2, can only be justified on large clusters because of the extra cost due to a second physical MN.

A new OS image can be (re-)deployed on a CN on request. This technique allows changing the OS distribution on CNs on a cluster quite easily. However, this is mainly interesting for clusters with diskless CNs because re-deploying an OS image for each OS switch is slower and consumes more network bandwidth than the other techniques discussed in this paper (dual-boot or virtualization). This can also be interesting if the OS type of CNs is not switched too frequently. The MNs can then be installed in 2 different ways: either the MNs are installed on 2 physical servers (this is approach 3 that is interesting for large clusters with diskless CNs or when the OS type of CNs is rarely switched) or they are installed on 2 VMs (this is approach 11 that is interesting for small and medium size diskless clusters).

The last technique for installing 2 CNs on a single server is to use virtual machines (VM). In this case, every VM can be up and running simultaneously or only a single VM may run on each compute server while the others are suspended. The switch from an OS to another can then be done very quickly. Using several virtual CNs of the same server simultaneously is not recommended since the total performance of the VMs is bounded by the native performance of the physical server and so no benefit can be expected from such a configuration. Installing CNs on VMs makes it easier and quicker to switch from one OS to another compared to a dual-boot installation but performance of the CNs may be decreased by the computing overhead due to the virtualization software layer. Section 3.5 briefly presents articles that analyze the performance impact of virtualization for HPC. Once again, the 2 MNs can be installed on 2 physical servers (this is approach 4 for large clusters), or they can be installed on 2 VMs (this is approach 12 for small and medium-sized clusters). This latter approach is 100% virtual with only virtual nodes. This is the most flexible solution, and very promising for the future; however it is too early to use it now because of performance uncertainties.

For the approaches with 2 virtual nodes (2 CNs or 2 MNs) on a server, the host OS can be Linux or Windows and any virtualization software could be used. The 6 approaches using VMs have thus dozens of virtualization implementations.

The key points to check for choosing the right virtualization environment are listed here by order of importance:

1. List of supported guest OS’s

2. Virtual resource limitations (maximum number of virtual CPUs, maximum number of network interfaces, virtual/physical CPU binding features, etc.)

3. Impact on performance (CPU cycles, memory access latency and bandwidth, I/Os, MPI optimizations)

4. VM management environment (tools and interfaces for VM creation, configuration and monitoring)

22


Also, for the approaches with 2 virtual nodes (2 CNs or 2 MNs) on a server, the 2 nodes can be configured on 2 VMs or one can be a VM while the other is just installed on the server host OS. When upgrading an existing HPC cluster from a classical single OS configuration to an HOSC configuration, it might look interesting at first glance to configure a MN (or a CN) on the host OS. For example, one virtual machine could be created on an existing management node and the second management node could be installed on this VM. Even if this configuration looks nice and quick and easy to setup, it should never be used. Indeed, running any application or using resources of the host OS is not a recommended virtualization practice. This creates a non-symmetrical situation between applications running on the host OS and those running on the VM. This may lead to load balancing issues and resource access failures.

On an HOSC with dual-boot CNs, re-deployed CNs or virtual CNs, the OS distribution can be changed dynamically without disturbing the other nodes. This could even be done automatically by a resource manager in a unified batch environment3.

The dual boot technique limits the number of installed OS’s on a server because only 4 primary partitions can be declared in the MBR. So, on an HOSC, if more OS’s are necessary and no primary partition is available anymore, the best solution is to install virtual CNs, and to run them one by one while the others are suspended on each CN (depending on the selected OS for that CN). The MNs should be installed on VMs as much as possible (like in approach 12), but several physical servers can be necessary (as in approach 4). This can happen in the case of large clusters for which the cost of an additional server is negligible. This can also happen so as to keep a good level of performance when a lot of OS’s are installed on the HOSC and thus many MNs are needed.

3.3 Specialized nodes In an HPC cluster, specialized nodes dedicated to certain tasks are often used. The goal is to distribute roles, for example, in order to reduce the management node (MN) load. We can usually distinguish 4 types of specialized nodes: the management nodes, the compute nodes (CN), the I/O nodes and the login nodes. A cluster usually has 1 MN and many CNs. It can have several login and I/O nodes. On small clusters, a node can be dedicated to several roles: a single node can be a management, login and I/O node simultaneously.

3.3.1

3.3.2

Management node The management node (MN), named Head Node (HN) in the HPCS documentation, is dedicated to providing services (infrastructure, scheduler, etc.) and to running the cluster management software. It is responsible for the installation and setup of the compute nodes (e.g., OS image deployment).

Compute nodes The compute nodes (CN) are dedicated to computation. They are optimized for code execution, so they are running a limited number of services. Users are not supposed to log in on them.

3 The batch solution is not investigated in this study but could be considered in the future.

23


3.3.3

3.3.4

I/O nodes I/O nodes are in charge of Input/Output requests for the file systems.

For I/O intensive applications, an I/O node is necessary to reduce MN load. This is especially true when the MNs are installed on virtual machines (VM). When a virtual MN handles heavy I/O requests it can dramatically impact the I/O level of performance of the second virtual MN.

If an I/O node is aimed at serving nodes with different OS’s then it must have at least one network interface for each OS subnet (i.e., a subnet that is declared for every node that runs with the same OS). Section 4.4 and 4.5 show an example of OS subnets.

An I/O node could be installed with Linux or Windows for configuring a NFS server. NFS clients and servers are supported on both OS’s. But the Lustre file system (delivered by Bull with XBAS) is not available for Windows clusters so Lustre I/O nodes can only be installed on Linux I/O nodes (for the Linux CN usage only4). Other commercial cluster / parallel file systems are available for both Linux and Windows (e.g., CXFS).

The I/O node can serve one file system shared by both OS nodes or two independent file systems (one for each OS subnet). In the case of 2 independent file systems, 1 or 2 I/O nodes can be used.

Login nodes Login nodes are used as cluster front end for user login, code compilation and data visualization. They are specially used to:

• login

• develop, edit and compile programs

• debug parallel code programs

• submit a job to the cluster

• visualize the results returned by a job

Login nodes could run a Windows or Linux OS and they can be installed on dual-boot servers, virtual machines or independent servers. A login node is usually only connected to other nodes running the same OS as its own.

For the HPCS cluster, the use of a login node is not mandatory, as a job can be submitted from any Windows client with the Microsoft HPC Pack installed (with the scheduler graphical interface or command line) by using an account into the cluster domain. A login node can be used to provide a gateway to enter into the cluster domain.

4 Lustre and GPFSTM clients for Windows are announced to be available soon.

24


3.4 Management services From the infrastructure configuration point of view, we should study the potential interactions between services that can be delivered from each MN (e.g., DHCP, TFTP, NTP, etc.). The goal is to avoid any conflict between MN services while cluster operations or computations are done simultaneously on both OS’s. This is especially complex during the compute node boot phase since the PXE procedure requires DHCP and TFTP access from its very early start time. A practical case with XBAS and HPCS is shown in Section 4.4.

At least the following services are required:

• a unique DHCP server (for PXE boot)

• a TFTP server (for PXE boot)

• a NFS server (for Linux compute node deployment)

• a CIFS server (for HPCS compute node deployment)

• a WDS server (for HPCS deployment)

• a NTP server (for the virtualization software and for MPI application synchronization)

3.5 Performance impact of virtualization Many scientific articles deal with the performance impact of virtualization on servers in general. Some recent articles are more focused on HPC requirements.

One of these articles compares virtualization technologies for HPC (see [25]). This paper systematically evaluates various VMs for computationally intensive HPC applications using various standard scientific benchmarks using VMware Server, Xen, and OpenVZ. It examines the suitability of full virtualization, para-virtualization, and operating system-level virtualization in terms of network utilization, SMP performance, file system performance, and MPI scalability. The analysis shows that none match the performance of the base system perfectly: OpenVZ demonstrates low overhead and high performance, Xen demonstrated excellent network bandwidth but its exceptionally high latency hindered its scalability, VMware Server, while demonstrating reasonable CPU-bound performance, was similarly unable to cope with the NPB MPI-based benchmark.

Another article evaluates the performance impact of Xen on MPI and process execution for HPC Systems (see [26]). It investigates subsystem and overall performance using a wide range of benchmarks and applications. It compares the performance of a para-virtualized kernel against three Linux operating systems and concludes that in general, the Xen para-virtualizing system poses no statistically significant overhead over other OS configurations.

25


3.6.1 Goals

3.6.2

3.6.3

3.6 Meta-scheduler for HOSC

The goal of a meta-scheduler used for an HOSC can be:

• Purely performance oriented: the most efficient OS is automatically chosen for a given run (based on backlog, statistics, knowledge data base, input data size, application binary, etc)

• OS compatibility driven: if an application is only available for a given OS then this OS must be used!

• High availability oriented: a few nodes with each OS are kept available all the time in case of requests that must be treated extremely quickly or in case of failure of running nodes.

• Energy saving driven: the optimal number of nodes with each OS are booted while the others are shut down (depending on the number of jobs in queue, the profile of active users, job history, backlog, time table, external temperature, etc.)

OS switch techniques The OS switch techniques that a meta-scheduler can use are those already discussed at the beginning of Chapter 3 (see Table 1). The meta-scheduler must be able to handle all the processes related to these techniques:

• Reboot a dual-Boot compute nodes (or power it on and off on demand)

• Activate/deactivate virtual machines that work as compute nodes

• Re-deploy the right OS and boot compute nodes (on diskless servers for example)

Provisioning and distribution policies The OS type distribution among the nodes can be:

• Unplanned (dynamic): the meta-scheduler estimates dynamically the optimal size of node partitions with each OS type (depending on job priority, queue, backlog, etc.), then it grows and shrinks these partitions accordingly by switching OS type on compute nodes. This is usually called “just in time provisioning”.

• Planned (dynamic): the administrators plan the OS distribution based on time, dates, team budget, project schedules, people vacations, etc. The size of the node partitions with each OS type are fixed for given periods of time. This is usually called “calendar provisioning”.

• Static: the size of node partitions with each OS type are fixed once for all and the meta-scheduler cannot switch OS type. This is the simplest and less efficient case.

26


4 Technical choices for designing an HOSC prototype

We want to build a flexible medium-sized HOSC with XBAS and HPCS. We only have a small 5-server cluster to achieve this goal but it will be sufficient to simulate the usage of a medium-sized cluster. We start from this cluster with InfiniBand and Gigabit network. The complete description of the hardware is given in Appendix E. We have discussed the possible approaches in the previous chapter. Let us now see what choice should be made in the particular case of this 5-server cluster. In the remainder of the document, this cluster is named the HOSC prototype.

4.1 Cluster approach According to recommendations given in the previous chapter, the most appropriate approach for medium-sized clusters is that with 2 virtual management nodes on one server and dual-boot compute nodes. This is the approach noted 10 in Table 1 of Section 3.

4.2 Management node For the virtualization software choice, we cannot choose Hyper-V since the XBAS VM must be able to use more than a single CPU to serve the cluster management requests in the best conditions. We cannot choose virtualization software that does not support HPCS for obvious reasons. Finally, we have to choose between VMware and Xen which both fulfill the requirements for our prototype. RHEL5.1 is delivered with the XBAS 5v1.1 software stack and thus it is the most consistent Linux choice for the host OS. So in the end, we chose Xen as our virtualization software since it is included in the RHEL5.1 distribution. Figure 7 shows the MN architecture for our HOSC prototype.

4.3 Compute nodes We have chosen approach 10, so CNs are dual-boot servers with XBAS and HPCS installed on local disks. We chose the Windows MBR for dual-booting CNs because it is easier to change the active partition of a node than to edit its grub.conf configuration file at each OS switch request. This is especially true when the node is running Windows since the grub.conf file is stored on the Linux file system: a common file system (on a FAT32 partition for example) would then be needed to share file grub.conf.

When the OS type of CNs is switched manually, we decided to allow the OS type switch commands to be sent only from the MN that runs the same OS as itself. In other words, the HPCS MN can “give up” one of its CNs to the XBAS cluster and the XBAS MN can “give up” one of its CNs to the HPCS cluster, but no MN can “take” a CN from the cluster with a different OS. This rule was chosen to minimize the risk of switching the OS of a CN while it is used for computation with its current OS configuration. When the OS type of CNs is switched automatically by a meta-scheduler, OS type switch commands are sent from the meta-scheduler server. To help switching OS on CNs from the MNs, simple scripts were written. They are listed in Appendices D.1.3 and D.2.3, and an example of their use is shown in Sections 6.3 and 6.4. Depending on the OS type that is booted, the node has a different hostname and IP address. This information is sent by a DHCP server whose configuration is updated at each OS switch request as explained in next section.

27


Figure 7 - Management node virtualization architecture

4.4 Management services We have to make choices to create a global infrastructure architecture for deploying, managing and using the two OS’s on our HOSC prototype:

• The DHCP is the critical part, as it is the single point of entry when a compute node boots. In our prototype it is running on the XBAS management node. The DHCP configuration file contains a section for each node with its characteristics (hostname, MAC and IP address) and the PXE information. Depending on the administrator needs, this section can be changed for deploying and booting XBAS or HPCS on a compute node. (see an example of dhcp.conf file changes in Appendix D.2.2)

• WDS and/or TFTP server: each management node has its own server because the installation procedures are different. A booting compute node is directed to the correct server by the DHCP server.

• Directory Service is provided by Active Directory (AD) for HPCS and by LDAP for XBAS. Our prototype will not offer a unified solution, but since synchronization mechanisms between AD and LDAP exist, a unified solution could be investigated.

28


• DNS: this service can be provided by the XBAS management node or the HPCS head node. The DNS should be set as dynamic in order to provide simpler access for the AD. In our prototype, we set a DNS server on the HPCS head node for the Windows nodes, and we use /etc/hosts files for name resolution on XBAS nodes.

Recommendations given in Section 3.4 can be applied to our prototype by configuring the services as shown in Figure 8 and in Table 2.

Figure 8 - Architecture of management services

Service Machine OS IP Mask Accessed by NTP Xen domain0 RHEL5.1 192.168.10.1 255.255.0.0 all nodes DHCP all nodes TFTP NFS

Linux VM XBAS 192.168.0.1 255.255.0.0 XBAS nodes

WDS (TFTP, PXE) CIFS DNS

Windows VM HPCS 192.168.1.1 255.255.0.0 HPCS nodes

Table 2 – Network settings for the services of the HOSC prototype

The netmask is set to 255.255.0.0 because it must provide connectivity between Xen domain 0 and each DomU virtual machine.

Figures 9 and 10 describe respectively XBAS and HPCS compute node deployment steps, while Figures 11 and 12 describe respectively XBAS and HPCS compute node normal boot steps on our HOSC prototype. They show how the PXE operations detailed in Figures 3, 4, 5 and 6 of Chapter 2 are consistently adapted in our heterogeneous OS environment with a unique DHCP server on the XBAS MN and a Windows MBR on the CNs.

29


Figure 9 - Deployment of a XBAS compute node on our HOSC prototype

Figure 10 - Deployment of a HPCS compute node on our HOSC prototype

30


Figure 11 - Boot of a XBAS compute node on our HOSC prototype

Figure 12 - Boot of a HPCS compute node on our HOSC prototype

31


4.5 HOSC prototype architecture The application network (i.e., InfiniBand network) should not be on the same subnet as the private network (i.e., gigabit network): we chose 172.16.0.[1-5] and 172.16.1.[1-5] IP address ranges for the application network address assignment.

The complete cluster architecture that results from the decisions taken in the previous sections is shown in Figure 13 below:

Figure 13 - HOSC prototype architecture

If for some reasons the IB interface cannot be configured on the HN, you should setup a loop back network interface instead and configure it with the IPoIB IP address (e.g., 172.16.1.1 in Figure 13). If for some reasons the IB interface cannot be configured on the MN, its setup can be skipped since it is not mandatory to connect the IB interface on the MN.

In the next chapter we will show how to install and configure the HOSC prototype with this architecture.

InfiniBand Gigabit

OR XBAS1

XBAS4

XBAS2

XBAS3

HPCS0

XBAS0

DOMAIN0

HPCS2

HPCS4

HPCS1

HPCS3

OR

OR

OR

OR

RHEL5.1 with Xen Gigabit (IB switch management)

Gigabit (intranet/ internet)

129.183.251.53 (eth1)

129.183.251.40 (xenbr1)

192.168.0.220 129.183.251.41 (xenbr1)

172.16.0.1 192.168.10.1 (eth0)

172.16.1.1 192.168.0.1 (xenbr0)

192.168.1.1 (xenbr0)

192.168.1.[2-5] 192.168.0.[2-5]172.16.0.[2-5]

172.16.1.[2-5]

32


4.6 Meta-scheduler architecture Without a meta-scheduler, users need to connect to the required cluster management node in order to submit his job. In this case, each cluster has its own management node with its own scheduler (as shown on the left side of Figure 14). By using a meta-scheduler, we offer a single point of entry to use the power of the HOSC whatever the OS type required by the job (as shown on the right side of Figure 14).

Figure 14 - HOSC meta-scheduler architecture (in order to have a simpler scheme, the HOSC is represented as two independent clusters: one with each OS type)

On the meta-scheduler, we create two job queues, one for the XBAS cluster and another one for HPCS cluster. So according to the user request, the job will be automatically redirected to the correct cluster. The meta-scheduler will also be managing the switch from an OS type to the other according to the clusters workload.

We chose PBS Professional to be used as meta-scheduler for our prototype because of the experience we already have with it on Linux and Windows platforms. PBS server should be installed on a node that is accessible from every other nodes of the HOSC. We chose to install it on the XBAS management node. PBS MOM (Machine Oriented Mini-server) is installed on all compute nodes (HPCS and XBAS) so they can be controlled by the PBS server.

In the next chapter we will show how to install and configure this meta-scheduler on our HOSC prototype.

33


5.1.1

5 Setup of the HOSC prototype

This Chapter describes the general setup of the Hybrid Operating System Cluster (HOSC) defined in the previous Chapter. The initial idea was to install Windows HPC Server 2008 on an existing Linux cluster without affecting the existing Linux cluster installation. However, in our case it appeared that the installation procedure requires reinstalling the management node with the 2 virtual machines. So finally the installation procedure is given for an HOSC installation done from scratch.

5.1 Installation of the management nodes

Installation of the RHEL5.1 host OS with Xen If you have an already configured XBAS cluster, do not forget to save the clusterdb (XBAS cluster data base) and all your data stored on the current management node before reinstalling it with RHEL5.1 and its virtualization software Xen.

Check that Virtual Technology (VT) is enabled in the BIOS settings of the server.

Install Linux RHEL5.1 from the DVD on the management server and select “virtualization” when optional packages are proposed. SELinux must be disabled. Erase all existing partitions and design your partition table so that enough free space is available in a volume group for creating logical volumes (LV). LVs are virtual partitions used for the installation of virtual machines (VM), each VM being installed on one LV. Volume groups and logical volumes are managed by the Logical Volume Manager (LVM). The advised size of a LV is 30-50GB: leave at least 100GB of free space on the management server for the creation of 2 LVs.

It is advisable to install an up-to-date gigabit driver. One is included on the XBAS 5v1.1 XHPC DVD.

[xbas0:root] rpm -i XHPC/RPMS/e1000-7.6.15.4-2.x86_64.rpm

5.1.2 Creation of 2 virtual machines A good candidate for the easy management of Xen Virtual machines is the Bull Hypernova tool. Hypernova is an internal-to-Bull software environment based on RHEL5.1/Xen3 for the management of virtual machines (VM) on Xeon® and Itanium2® systems. HN-Master is the web graphical interface (see Figure 15) that manages VMs in the Hypernova environment. It can be used to create, delete, install and clone VM; modify VM properties (network interfaces, number of virtual CPUs, memory, etc.); start, pause, stop and monitor VM status.

2 virtual machines are needed to install the XBAS management node and the HPCS head node. Create these 2 Xen virtual machines on the management server. The use of HN-Master is optional and all operations done in the Hypernova environment could also be done with Xen commands in a basic Xen environment. For the use of HN-Master, httpd service must be started (type “chkconfig --level 35 httpd on” to start it automatically at boot time).

34


Figure 15 - HN-Master user interface

The following values are used to create each VM:

• Virtualization mode: Full

• Startup allocated memory: 2048

• Virtual CPUs number: 45

• Virtual CPUs affinity type: mapping

• Logical Volume size: 50GB

Create 2 network interface bridges, xenbr0 and xenbr1, so that each VM can have 2 virtual network interfaces (one on the private network and one on the public network). Detailed instructions for configuring 2 network interface bridges are shown in Appendix D.2.4.

5 In case of problems when installing the OS’s (e.g., #IRQ disabled while files are copied), select only 1 virtual CPU for the VM during the OS installation step.

35


5.1.3 Installation of XBAS management node on a VM Install XBAS on the first virtual machine. If applicable, use the clusterdb and the network configuration of the initial management node. Update the clusterdb with the new management node MAC addresses: the xenbr0 and xenbr1 MAC address of the VM. Follow the instructions given in the BAS for Xeon installation and configuration guide [22], and choose the following options for the MN setup:

[xbas0:root] cd /release/XBAS5V1.1

[xbas0:root] ./install -func MNGT IO LOGIN -prod RHEL XHPC XIB

Update the clusterdb with the new management node MAC-address (see [22] and [23] for details).

5.1.4 Installation of InfiniBand driver on domain 0 The InfiniBand (IB) drivers and libraries should be installed on domain 0. They are available on the XIB DVD included in the XBAS distribution. Assuming that the content of the XIB DVD is copied on the XBAS management node (with the IP address 192.168.0.1) in directory /release as it is requested in the installation guide [22], the following commands should be executed:

[xbas0:root] mkdir /release

[xbas0:root] mount 192.168.0.1:/release /release

[xbas0:root] scp [email protected]:/etc/yum.repos.d/*.repo /etc/yum.repos.d

[xbas0:root] yum install perl-DBI perl-XML-Parser perl-XML-Simple

[xbas0:root] yum install dapl infiniband-diags libibcm libibcommon libibmad libibumad libibverbs libibverbs-utils libmlx4 libmthca librdmacm librdmacm-utils mstflint mthca_fw_update ofed-docs ofed-scripts opensm-libs perftest qperf --disablerepo=local-rhel

5.1.5 Installation of HPCS head node on a VM Install Windows Server 2008 on the second virtual machine as you would do on any physical server. Then the following instructions should be executed in this order:

1. set the Head Node (HN) hostname

2. configure the HN network interfaces

3. enable remote desktop (this is recommended for a remote administration of the cluster)

4. set “Internet Time Synchronization” so that the time is the same on the HN and the MN

5. install the Active Directory (AD) Domain Services and create a new domain for your cluster with the wizard (dcpromo.exe), or configure the access to your existing AD on your local network

6. install the Microsoft HPC Pack

36


5.1.6

5.1.7

Preparation for XBAS deployment on compute nodes Check that there is enough space on the first device of the compute nodes for creating an additional primary partition (e.g., on /dev/sda). If not, make some space by reducing the existing partitions or by redeploying XBAS compute nodes with the right partitioning (using the preparenfs command and a dedicated kickstart file). Edit the kickstart file accordingly to an HOSC compatible disk partition. For example, /boot on /dev/sda1, / on /dev/sda2 and SWAP on /dev/sda3. An example is given in Appendix D.2.1.

Create a /opt/hosc directory and export it with NFS. Then mount it on every node of the cluster and install the HOSC files listed in Appendix D.2.3 in it:

• switch_dhcp_host

• activate_partition_HPCS.sh

• fdisk_commands.txt

• from_XBAS_to_HPCS.sh and from_HPCS_to_XBAS.sh

Preparation for HPCS deployment on compute nodes First configure the cluster by following the instructions given in the HPC Cluster Manager MMC to-do list:

1. Configure your network:

a. Topology: compute nodes isolated on private and application networks (topology 3)

b. DHCP server and NAT: not activated on the private interface

c. Firewall is “off” on private network (this is for the compute nodes only because the firewall needs to be “on” for the head node)

2. Provide installation credentials

3. Configure the naming of the nodes (this step is mandatory even if it is not useful in our case: the new node names will be imported from an XML file that we will create later). You can specify: HPCS%1%

4. Create a deployment template with operating system and “Windows Server 2008” image

Bring the HN online in the management console: click on “Bring Online” in the “Node Management” window of the “HPC Cluster Manager” MMC.

Add a recent network adapter Gigabit driver to the OS image that will be deployed: click on “Manage drivers” and add the drivers for Intel PRO/1000 version 13.1.2 or higher (PROVISTAX64_v13_1_2.exe can be downloaded from Intel web site).

37


Add a recent IB driver (see [27]) that supports Network Direct (ND). Then edit the compute node template and add a “post install command” task that configures IPoIB IP address and register ND on the compute nodes. The IPoIB configuration can be done by the script setIPoIB.vbs provided in Appendix D.1.2. The ND registration is done by the command:

C:\> ndinstall -i

Two files used by the installation template must be edited in order to keep existing XBAS partitions untouched on compute nodes while deploying HPCS. For example, choose the fourth partition (/dev/sda4) for the HPCS deployment (see Appendix D.1.1 for more details):

• unattend.xml

• diskpart.txt

Create a shared C:\hosc directory and install the HOSC files listed in Appendix D.1.3 in it:

• activate_partition_XBAS.bat

• diskpart_commands.txt

• from_HPCS_to_XBAS.bat

5.1.8

Configuration of services on HPCS head node6 The DHCP service is disabled on HPCS head node (it was not activated during the installation step). The firewall must be enabled on the head node for the private network. It must be configured to drop all incoming network packets on local ports 67/UDP and 68/UDP in order to block any DHCP traffic that might be produced by the Windows Deployment Service (WDS). This is done by creating 2 inbound rules from the Server Manager MMC. Click on:

Server Manager → Configuration → Windows Firewall with Advanced Security → Inbound Rules → New Rule

Then select the following options:

1. Rule type: Port

2. Protocol and ports: 67 (or 68 for the second rule)

3. Action: Block the connection

4. Name: UDP/67 blocked (or UDP/68 blocked for the second rule)

Instead of blocking these ports, it is also possible to disable all inbound rules that are enabled by default on UDP ports 67 and 68.

6 Thanks a lot to Christian Terboven (research associate in the HPC group of the Center for Computing and Communication at RWTH Aachen University) for his helpful contribution to this configuration phase.

38


5.2 Deployment of the operating systems on the compute nodes The order in which the OS’s are deployed is not important but must be the same on every compute node. The order should thus be decided before starting any node installation or deployment. The installation scripts (such as diskpart.txt for HPCS or kickstart.<identifier> for XBAS) must be edited accordingly in the desired order. In this example, we chose to deploy XBAS first. The partition table we plan to create is:

/dev/sda1 /boot 100MB ext3 (Linux) /dev/sda2 / 50GB ext3 (Linux) /dev/sda3 SWAP 16GB (Linux) /dev/sda4 C:\ 50GB ntfs (Windows)

First, check that the BIOS settings of all CNs are configured for PXE boot (and not local hard disk boot). They should boot on the eth0 Gigabit Ethernet (GE) card. For example, the following settings are correct:

Boot order 1 - USB key 2 - USB disk 3 - GE card 4 - SATA disk

5.2.1 Deployment of XBAS on compute nodes Follow the instructions given in the BAS5 for Xeon installation & configuration guide [22]. Here is the information that must be entered to the preparenfs tool in order to generate a kickstart file (the kickstart file could also be written manually with this information on other Linux distribution systems):

1. RHEL DVD is copied in: /release/RHEL5.1

2. partitioning method is: automatic (i.e., a predefined partitioning is used)

3. interactive mode is: not used (the installation is unattended)

4. VNC is: not enabled

5. BULL HPC installer is in: /release/XBAS5V1.1

6. node function is: COMPUTEX

7. optional BULL HPC software is: XIB

8. IP of NFS server is the default: 192.168.0.99

9. the nodes to be installed are: xbas[1-4]

10. hard reboot done by preparenfs: No

39


Once generated, the kickstart file needs a few modifications in order to fulfill the HOSC disk partition requirements: see an example of these modifications in Appendix D.2.1.

When the modifications are done, boot the compute nodes and the PXE mechanisms will start to install XBAS on the compute nodes with the information stored in the kickstart file. Figure 16 shows the console of a CN while it is PXE booting for its XBAS deployment.

Intel(R) Boot Agent GE v1.2.36 Copyright (C) 1997-2005, Intel Corporation CLIENT MAC ADDR: 00 30 48 33 4C F6 GUID: 53D19F64 D663 A017 8922 003048334CF6 CLIENT IP: 192.168.0.2 MASK: 255.255.0.0 DHCP IP: 192.168.0.1 PXELINUX 2.11 2004-08-16 Copyright (C) 1994-2004 H. Peter Anvin UNDI data segment at: 000921C0 UNDI data segment size: 62C0 UNDI code segment at: 00098480 UNDI code segment size: 3930 PXE entry point found (we hope) at 9848:0106 My IP address seems to be COA80002 192.168.0.2 ip=192.168.0.2:192.168.0.1:0.0.0.0:255.255.0.0 TFTP prefix: Trying to load: pxelinux.cfg/00-30-48-33-4C-F6 Trying to load: pxelinux.cfg/COA80002 boot: Booting...

Figure 16 - XBAS compute node console while the node starts to PXE boot

It is possible to install every CN with the preparenfs tool or to install a single CN with the preparenfs tool and then duplicate it on every other CN servers with the help of the ksis deployment tool. However, the use of ksis is only possible if XBAS is the first installed OS, since ksis overwrites all existing partitions. So it is advisable to only use the preparenfs tool for CN installation on a HOSC.

Check that the /etc/hosts file is consistent on XBAS CNs (see Appendix D.2.5). Configure the IB interface on each node by editing file ifcfg-ib0 (see Appendix D.2.6) and enable the IB interface by starting the openibd service:

[xbas1:root] service openibd start

In order to be able to boot Linux with the Windows MBR (after having installed HPCS on the CNs), install the GRUB boot loader on the first sector of the /boot partition by typing on each CN:

[xbas1:root] grub-install /dev/sda1

The last step is to edit all PXE files in /tftboot directory and set both TIMEOUT and PROMPT variables to 0 in order to boot compute nodes quicker.

5.2.2 Deployment of HPCS on compute nodes On the XBAS management node, change the DHCP configuration file so the compute nodes point to the Windows WDS server when they PXE boot. Edit the DHCP configuration file /etc/dhcpd.conf for each CN host section and change the fields as shown in Appendix D.2.2 (filename, fixed-address, host-

40


name next-server and server-name). The DHCP configuration file changes can be done by using the switch_dhcp_host script (see Appendix D.2.3) for each compute node. Once the changes are done in the file, the dhcpd service must be restarted in order to take changes into account. For example, type:

[xbas0:root] switch_dhcp_host xbas1

File /etc/dhcp.conf is updated with host hpcs1

[xbas0:root] switch_dhcp_host xbas2

File /etc/dhcp.conf is updated with host hpcs2

[...]

[xbas0:root] service dhcpd restart

Shutting down dhcpd: [ OK ]

Starting dhcpd: [ OK ]

Now prepare the deployment of the nodes for the HPCS management console: get the MAC address of all new compute nodes and create an XML file with the MAC address, compute node name and domain name of each node. An example of such an XML file (my_cluster_nodes.xml) is given in Appendix D.1.1. Import this XML file from the administrative console (see Figure 17) and assign a deployment “compute node template” to the nodes.

Figure 17 - Import node XML interface

Boot the compute nodes. Figure 18 shows the console of a CN while it is PXE booting for its HPCS deployment with a DHCP server on the XBAS management node (192.168.0.1) and a WDS server on the HPCS head node (192.168.1.1).

41


Intel(R) Boot Agent GE v1.2.36 Copyright (C) 1997-2005, Intel Corporation CLIENT MAC ADDR: 00 30 48 33 4C F6 GUID: 53D19F64 D663 A017 8922 003048334CF6 CLIENT IP: 192.168.1.2 MASK: 255.255.0.0 DHCP IP: 192.168.0.1 Downloaded WDSNBP... Architecture: x64 Contacting Server: 192.168.1.1 ............

Figure 18 - HPCS compute node console while the node starts to PXE boot

The nodes will appear with the “provisioning” state in the management console as shown in Figure 19.

Figure 19 - Management console showing the “provisioning” compute nodes

After a while the compute node console shows that the installation is complete as in Figure 20.

Figure 20 - Compute node console shows that its installation is complete

At the end of the deployment, the compute node state is “offline” in the management console. The last step is to click on “Bring online” in order to change the state to “online”. The HPCS compute nodes can now be used.

42


5.3.1

5.3.2

5.3.3

5.3 Linux-Windows interoperability environment In order to enhance the interoperability between the two management nodes, we set up a Unix/Linux environment on the HPCS head node using the Subsystem for Unix-based Applications (SUA). We also install SUA supplementary tools such as openssh that can be useful for HOSC administration tasks (e.g., ssh can be used to execute commands from a management node to the other in a safe manner).

The installation of SUA is not mandatory for setting up an HOSC and many tools can also be found from other sources but it is a rather easy and elegant way to have a homogeneous HOSC environment: firstly, it provides a lots of Unix tools on Windows systems, and secondly it provides a framework for porting and running Linux applications in a Windows environment.

The installation is done in 3 steps.

Installation of the Subsystem for Unix-based Applications (SUA) The Subsystem for Unix-based Applications (SUA) is part of Windows Server 2008 distribution. To turn the SUA features on, open the “Server Manager” MMC, select the “Features” section in the left frame of the MMC and click on “Add Features”. Then check the box for “Subsystem for UNIX_based Applications” and click on “Next” and “Install”.

Installation of the Utilities and SDK for Unix-based Applications Download “Utilities and SDK for UNIX-based Applications_AMD64.exe” from Microsoft web site [28]. Run the custom installation and select the following packages in addition to those included in the default installation: “GNU utilities” and “GNU SDK”.

Installation of add-on tools Download the “Power User” add-on bundle available from Interops Systems [29] on the SUA community web site [30]. The provided installer pkg-current-bundleuser60.exe handles all package integration, environment variables and dependencies. Install the bundle on the HPCS head node. This will install, configure and start an openssh server daemon (sshd) on the HPCS HN.

Other tools, such as proprietary compilers, can also be installed in the SUA environment.

5.4 User accounts Users must have the same login name on all nodes (XBAS and HPCS). As mentioned in Section 4.4, we decided not to use LDAP on our prototype but it is advised to use it on larger clusters.

User home directories should at least be shared on all compute nodes running the same OS: for example, an NFS exported directory /home_nfs/test_user/ on XBAS CNs and a shared CIFS directory C:\Users\test_user\ on HPCS CNs for user test_user.

It is also be possible (and even recommended) to have a unique home directory for both OS’s by configuring Samba [36] on XBAS nodes.

43


5.5.1

5.5 Configuration of ssh

RSA key generation Generate your RSA keys (or DSA keys, depending on your security policy) on the XBAS MN (see [23]):

[xbas0:root] ssh-keygen -t rsa -N ′′

This should also be done for each user account.

5.5.2 RSA key Configure ssh so that it does not request to type a password when the root user connects from the XBAS MN to the other nodes. For the XBAS CNs (and the HPCS HN if openssh is installed with the SUA), copy the keys (private and public) generated on the XBAS MN.

For example, type:

[xbas0:root] cd /root/.ssh

[xbas0:root] cp id_rsa.pub authorized_keys

[xbas0:root] scp id_rsa id_rsa.pub authorized_keys root@hpcs0:.ssh/

[xbas0:root] scp id_rsa id_rsa.pub authorized_keys root@xbas1:.ssh/

Enter root password when requested (it will never be requested anymore later).

This should also be done for each user account.

For copying the RSA key on the HPCS CNs see Section 5.5.4.

By default, the first time a server connects to a new host it checks if its “server” RSA public key (stored in /etc/ssh/) is already known and it asks the user to validate the authenticity of this new host. In order to avoid typing the “yes” answer for each node of the cluster different ssh configurations are possible:

• The easiest, but less secure, solution is to disable the host key checking in file /etc/ssh/ssh_config by setting: StrictHostKeyChecking no

• Another way is to merge the RSA public key of all nodes in a file that is copied on each node: the /etc/ssh/ssh_known_hosts file. A trick is to duplicate the same server private key (stored in file /etc/ssh/ssh_host_rsa_key) and thus the same public key (stored in file /etc/ssh/ssh_host_rsa_key.pub) on every node. The generation of the ssh_known_hosts file is then easier since each node has the same public key. An example of such an ssh_known_hosts file is given in Appendix D.2.7.

44


5.5.3

5.5.4

Installation of freeSSHd on HPCS compute nodes If you want to use PBS Professional and the OS balancing feature that was developed for our HOSC prototype, a ssh server daemon is required on each compute node. The sshd daemon is already installed by default on the XBAS CNs and it should be installed on the HPCS CNs: we chose the freeSSHd [34] freeware. This software can be downloaded from [34] and its installation is straight-forward: execute freeSSHd.exe, keep all default values proposed during the installation process and accept to “run FreeSSHd as a system service”.

Configuration of freeSSHd on HPCS compute nodes In the freeSSHd configuration window:

• add the user “root'”

o select “Authorization: Public key (SSH only)“

o select “User can use: Shell”

• select “Password authentication: Disabled”

• select “Public key authentication: Required“

The configuration is stored in file C:\Program Files (x86)\freeSSHd\FreeSSHDService.ini so you can copy this file on each HPCS CN instead of configuring them one by one with the configuration window. You must modify the IP address field (SSHListenAddress=<CN_IP_address>) in the FreeSSHDService.ini file for each CN. The freeSSHd system service needs to be restarted to take the new configuration into account.

Then finish the setup by copying the RSA key file /root/.ssh/id_rsa.pub from the XBAS MN to file C:\Program Files (x86)\freeSSHd\ssh_authorized_keys\root on the HPCS CNs. Edit this file (C:\Program Files (x86)\freeSSHd\ssh_authorized_keys\root) and remove the @xbas0 string at the end of the file: it should end with the string root instead of root@xbas0.

5.6 Installation of PBS Professional7 For installing PBS Professional on the HOSC cluster, first install the PBS Professional server on a management node (or at least on a server that shares the same subnet as all the HOSC nodes), then install PBS MOM (Machine Oriented Mini-server) on each CN (HPCS and XBAS). The basic information is given in this Section. For more detailed explanations follow the instructions of the PBS Professional Administrator’s Guide [31].

7 Thanks a lot to Laurent Aumis (SEMEA GridWorks Technical Manager at ALTAIR France) for his valuable help and expertise in setting up this PBS Professional configuration.

45


5.6.1 PBS Professional Server setup Install PBS server on the XBAS MN: during the installation process, select “PBS Installation: 1. Server, execution and commands” (see [31] for detailed instructions). By default, the MOM (Machine Oriented Mini-server) is installed with the server. Since the MN should not be used as a compute node, stop PBS with “/etc/init.d/pbs stop”, disable the MOM by setting PBS_START_MOM=0 in file /etc/pbs.conf (see Appendix D.3.1) and restart PBS with “/etc/init.d/pbs start”.

If you want to use a UID/GID on Windows and Linux nodes without UID unified, you need to set the flag flatuid=true with the qmgr tool. UID/GID of PBS server will be used. Type:

[xbas0:root] qmgr

[xbas0:root] Qmgr: set server flatuid=True

[xbas0:root] Qmgr: exit

5.6.2

5.6.3

PBS Professional setup on XBAS compute nodes Install PBS MOM on the XBAS CNs: during the installation process, select “PBS Installation: 2. Execution only” (see [31]). Add PBS_SCP=/usr/bin/scp in file /etc/pbs.conf (see Appendix D.3.1) and restart PBS MOM with “/etc/init.d/pbs restart”.

It would also be possible to use $usecp in PBS to move files around instead of scp. Samba [36] could be configured on Linux systems to allow the HPCS compute nodes to drop files directly to Linux servers.

PBS Professional setup on HPCS nodes First, log on the HPCS HN and create a new user in the cluster domain for PBS administration: pbsadmin. Create file lmhosts on each HPCS node with PBS server hostname and IP address (as shown in Appendix D.3.2). Then install PBS Professional on each HPCS node:

1. select setup type “Execution” (only) on CNs and “Commands” (only) on the HN

2. enter pbsadmin user password (as defined on the PBS server: on XBAS MN in our case)

3. enter PBS server hostname (xbas0 in our case)

4. keep all other default values that are proposed by the PBS installer

5. reboot the node

5.7 Meta-Scheduler queues setup Create queues for each OS type and set the default_chunk.arch accordingly (it must be consistent with the resources_available.arch field of the nodes).

Here is a summary of the PBS Professional configuration on our HOSC prototype. The following is a selection of the most representative information reported by the PBS queue manager (qmgr):

46


Qmgr: print server # Create and define queue windowsq create queue windowsq set queue windowsq queue_type = Execution set queue windowsq default_chunk.arch = windows set queue windowsq enabled = True set queue windowsq started = True # Create and define queue linuxq create queue linuxq set queue linuxq queue_type = Execution set queue linuxq default_chunk.arch = linux set queue linuxq enabled = True set queue linuxq started = True # Set server attributes. set server scheduling = True set server default_queue = workq set server log_events = 511 set server mail_from = adm set server query_other_jobs = True set server resources_default.ncpus = 1 set server default_chunk.ncpus = 1 set server scheduler_iteration = 60 set server flatuid = True set server resv_enable = True set server node_fail_requeue = 310 set server max_array_size = 10000 set server pbs_license_min = 0 set server pbs_license_max = 2147483647 set server pbs_license_linger_time = 3600 set server license_count = "Avail_Global:0 Avail_Local:1024 Used:0 High_Use:8" set server eligible_time_enable = False Qmgr: print node xbas1 # Create and define node xbas1 create node xbas1 set node xbas1 state = free set node xbas1 resources_available.arch = linux set node xbas1 resources_available.host = xbas1 set node xbas1 resources_available.mem = 16440160kb set node xbas1 resources_available.ncpus = 4 set node xbas1 resources_available.vnode = xbas1 set node xbas1 resv_enable = True set node xbas1 sharing = default_shared Qmgr: print node hpcs2 # Create and define node hpcs2 create node hpcs2 set node hpcs2 state = free set node hpcs2 resources_available.arch = windows set node hpcs2 resources_available.host = hpcs2 set node hpcs2 resources_available.mem = 16775252kb set node hpcs2 resources_available.ncpus = 4 set node hpcs2 resources_available.vnode = hpcs2 set node hpcs2 resv_enable = True set node hpcs2 sharing = default_shared

47


5.7.1 Just in time provisioning setup This paragraph describes the implementation of a simple example of “just in time” provisioning (see Section 3.6.3). We developed a Perl script (see pbs_hosc_os_balancing.pl in Appendix D.3.2) that gets PBS server information about queues, jobs and nodes for both OS’s (e.g., number of free nodes, number of nodes requested by jobs in queues, number of nodes requested by the smallest job). Based on this information, the script checks a simple rule that defines the cases when the OS type of CNs should be switched. If the rule is “true” then the script selects free CNs and switches their OS type. In our example, we defined a conservative rule (i.e., the number of automatic OS switches is kept low):

“Let us define η as the smallest number of nodes requested by a queued job for a given OS type A. Let us define α (respectively β) as the number of free nodes with the OS type A (respectively B). If η>α (i.e., there are not enough free nodes to run the submitted job with OS type A) and if β≥η-α (at least η-α nodes are free with the OS type B) then the OS type of η-α nodes should be switched from B to A”.

The script is run periodically based on the schedule defined by the crontab of the PBS server host. The administrator can also switch more OS’s manually if necessary at any time (see Sections 6.3 and 6.4). The crontab setup can be done by editing the following lines with the crontab command8:

[xbas0:root] crontab -e

# run HOSC Operating System balancing script every 10 minutes (noted */10) */10 * * * * /opt/hosc/pbs_hosc_os_balancing.pl

The OS distribution balancing is then controlled by this cron job. Instead of running the pbs_hosc_os_balancing.pl script as a cron job, it would also be possible to call it as an external scheduling resource sensor (see [31] for information about PBS Professional scheduling resources), or to call it with PBS Professional hooks (see [31]). For developing complex OS balancing rules, the Perl script could be replaced by a C program (for details about PBS Professional API see [33]).

This simple script could be further developed in order to be more reliable. For example:

• check that the script is only run once at a time (by setting a lock file for example),

• allow to switch the OS type of more than η-α nodes at once if the number of free nodes and the number of queued jobs is high (this can happen when many small jobs are submitted),

• impose a delay between two possible switches of OS type on each compute node.

5.7.2

Calendar provisioning setup This paragraph just gives the main ideas to setup calendar provisioning (see Section 3.6.3). As for the previous provisioning example, the setup should rely on the cron mechanism. A script that can switch the OS type of a given number of compute nodes could easily be written (by slightly modifying the scripts provided in Appendix of this paper). This script could be run hourly as a cron job and it could read the requested number of nodes with each OS type from a configuration file written by the administrator.

8 “crontab -e” opens the /var/spool/cron/root file in a vi mode and restarts the cron service automatically.

48


6 Administration of the HOSC prototype

6.1 HOSC setup checking Basically, the cluster checking is done as if there were 2 independent clusters. The fact that an HOSC is used does not change anything at this level. The usual cluster diagnosis tests should thus be used.

For HPCS, this means that the basic services and connectivity tests should be run first, followed by the automated diagnosis tests from the “cluster management” MMC.

For XBAS, the sanity checks can be done with basic Linux commands (ping, pdsh, etc.) and monitoring tools like Nagios (see [23] and [24] for details).

6.2 Remote reboot command A reboot command can be sent remotely to compute nodes by the management nodes.

The HPCS head node can send a reboot command to its HPCS compute nodes only (soft reboot) with “clusrun”. For example:

C:\> clusrun /nodes:hpcs1,hpcs2 shutdown /r /f /t 5 /d p:2:4

Use “clusrun /all” for rebooting all HPCS compute nodes (the head node should not be declared as a compute node; otherwise this command would reboot it too).

The XBAS management node can send a reboot command to its XBAS compute nodes only (soft reboot) with pdsh. For example:

[xbas0:root] pdsh –w xbas[1-4] reboot

The XBAS management node can also reboot any compute node (HPCS or XBAS) with the NovaScale control “nsctrl” command (hard reboot). For example:

[xbas0:root] nsctrl reset xbas[1-4]

6.3 Switch a compute node OS type from XBAS to HPCS To switch a compute node OS from XBAS to HPCS, type the from_XBAS_to_HPCS.sh command on the XBAS management node (you must be logged on as “root”). See Appendix D.2.3 for information on this command implementation. For example, if you want to switch the OS type of node xbas2, type:

[xbas0:root] from_XBAS_to_HPCS.sh xbas2

The compute node is then automatically rebooted with the HPCS OS type.

49


6.4.1

6.4 Switch a compute node OS type from HPCS to XBAS

Without sshd on the HPCS compute nodes To switch a compute node OS from HPCS to XBAS, first execute the switch_dhcp_host command on the XBAS management node and restart the dhcp service. This can be done locally on the XBAS MN console or remotely from the HPCS HN using a secure shell client (e.g., putty or openssh). Type:

[xbas0:root] switch_dhcp_host hpcs2

[xbas0:root] service dhcpd restart

Then take the node offline in the MMC and type the from_HPCS_to_XBAS.bat command in a “command prompt” window of the HPCS head node. See Appendix D.1.3 for information on this command implementation. For example, if you want to switch the OS of node hpcs2, type:

C:\> from_HPCS_to_XBAS.bat hpcs2

The compute node is then automatically rebooted with the XBAS OS type.

6.4.2 With sshd on the HPCS compute nodes If you installed a ssh server daemon (e.g., FreeSSHd) on the HPCS CNs then you can also type the following command from the XBAS management node. It will execute all the commands (listed in previous Section) from the XBAS MN without having to log on the HPCS HN. Type:

[xbas0:root] from_HPCS_to_XBAS.sh hpcs2

The compute node is then automatically rebooted with the XBAS OS type.

This script was mainly implemented to be used with a meta-scheduler since it is not recommended to switch the OS type of a HPCS CN by sending a command from the XBAS MN (see Section 4.3).

6.5 Re-deploy an OS The goal is to be able to re-deploy an OS on an HOSC without impacting the other OS that is already installed. Do not forget to save your MBR since it can be overwritten during the installation phase (see Appendix C.2).

For re-deploying XBAS compute nodes, ksis tools cannot be used (it would erase existing Windows partitions). The “preparenfs” command is the only tool that can be used. The partition declarations done in the kickstart file should then be edited in order to reuse existing partitions and not to remove them or recreate new ones. The modifications are slightly different from those done for the first install. If the existing partitions are those created with the kickstart file shown as example in Appendix D.2.1:

/dev/sda1 /boot 100MB ext3 (Linux) /dev/sda2 / 50GB ext3 (Linux) /dev/sda3 SWAP 16GB (Linux) /dev/sda4 C:\ 50GB ntfs (Windows)

50


Then the new kickstart file used for re-deploying a XBAS compute node should include the lines below:

/release/ks/kickstart.<identifier>

… part /boot --fstype="ext3" --onpart sda1 part / --fstype="ext3" --onpart sda2 part swap --noformat --onpart sda3 …

In the PXE file stored on the MN (e.g., /tftboot/C0A80002 for node xbas1), the DEFAULT label should be set back to ks instead of local_primary. The CN can then be rebooted for starting the re-deployment process.

For re-deploying Windows HPC Server 2008 compute nodes, check that the partition number in unattend.xml file is consistent with the existing partition table and if necessary edit it (in our example: <PartitionID>4</PartitionID>). Edit diskpart.txt file so that it only re-formats the NTFS Windows partition without cleaning or removing the existing partitions (see Appendix D.1.1). Manually update/delete the previous computer and hostname declaration in the Active Directory before re-deploying the nodes and then play the compute node deployment template as for the first install.

6.6 Submit a job with the meta-scheduler For detailed explanation about using PBS Professional and submitting jobs read the PBS Professional User’s Guide [32]. This paragraph just gives an example of the specificities of our meta-scheduler environment.

Let us suppose we have a user named test_user. This user has two applications to run: one for each OS type. He also has two job submission scripts: my_job_Win.sub for the Windows application and my_job_Lx.sub for the Linux application:

my_job_Win.sub

#PBS -l select=2:ncpus=4:mpiprocs=4 #PBS -q windowsq C:\Users\test_user\my_windows_application

my_job_Lx.sub

#!/bin/bash #PBS -l select=2:ncpus=4:mpiprocs=4 #PBS -q linuxq /home/test_user/my_linux_application

Whatever the OS type the application should run on, the scripts can be submitted from any Windows or Linux computer with the same qsub command. The only requirement is that the computer needs to have credentials to connect with the PBS Professional server.

51


The command lines can be typed from a Windows system:

C:\> qsub my_job_Win.sub

C:\> qsub my_job_Lx.sub

or the command lines can be typed from a Linux system:

[xbas0:test_user] qsub my_job_Win.sub

[xbas0:test_user] qsub my_job_Lx.sub

You can check the PBS queue status with the qstat command. Here is the example of an output:

[xbas0:root] qstat -n xbas0: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time --------------- -------- -------- ---------- ------ --- --- ------ ----- - ----- 129.xbas0 thomas windowsq my_job_Win 3316 2 8 -- -- R 03:26 hpcs3/0*4+hpcs4/0*4 130.xbas0 laurent linuxq my_job_Lx. 21743 2 8 -- -- R 01:23 xbas1/0*4+xbas2/0*4 131.xbas0 patrice linuxq my_job_Lx. -- 2 8 -- -- Q -- -- 132.xbas0 patrice linuxq my_job_Lx. -- 1 4 -- -- Q -- -- 133.xbas0 laurent windowsq my_job_Win -- 2 8 -- -- Q -- -- 134.xbas0 thomas windowsq my_job_Win -- 1 4 -- -- Q -- -- 135.xbas0 thomas windowsq my_job_Win -- 1 1 -- -- Q -- -- 136.xbas0 patrice linuxq my_job_Lx. -- 1 1 -- -- Q -- --

6.7 Check node status with the meta-scheduler The status of the nodes can be checked with the PBS Professional monitoring tool. Each physical node appears twice in the PBS monitor window: once for each OS type. For example, the first node appears with two hostnames (xbas1 and hpcs1). The hostname associated with the running OS type is flagged as “free” or “busy” while the other hostname is flagged as “offline”. This gives a complete view of the OS type distribution on the HOSC.

Figure 21 shows that the two first CNs run XBAS while the two other CNs run HPCS. It also shows that all four CNs are busy. This corresponds to the qstat output shown as example in the previous Section above. Figure 22 shows that there are three free CNs running XBAS and one busy CN running HPCS on our HOSC prototype. Since we do not want to run applications on the XBAS MN (xbas0), we disabled its MOM (see Section 5.6.1). That is why it is seen as “down” in both Figures.

52


Figure 21 - PBS monitor with all 4 compute nodes busy (2 with XBAS and 2 with HPCS)

Figure 22 - PBS monitor with 1 busy HPCS compute node and 3 free XBAS compute nodes

53


54


7 Conclusion and perspectives

We studied 12 different approaches to HPC clusters that can run 2 OS’s. We particularly focused on those being able to run the 2 OS’s simultaneously and we named them: Hybrid Operating System Clusters (HOSC). The 12 approaches have dozens of possible implementations among which the most common alternatives were discussed, resulting in technical recommendations for designing an HOSC.

This collaborative work between Microsoft and Bull gave the opportunity to build an HOSC prototype that provides computing power under Linux Bull Advanced Server for Xeon and Windows HPC Server 2008 simultaneously. The prototype has 2 virtual management nodes installed on 2 Xen virtual machines run on a single host server with RHEL5.1, and 4 dual-boot compute nodes that boot with the Windows master boot record. The methodology to dynamically switch the OS type easily on some compute nodes without disturbing the other compute nodes was provided.

A meta-scheduler based on Altair PBS Professional was implemented. It provides a single submission point for both Linux and Windows and it adapts automatically (with some simple rules given as example) the distribution of OS types among the compute nodes to the user needs (i.e., the pool of submitted jobs).

This successful project could be continued with the aim of improving the current HOSC prototype features. Ideas of possible improvements are to

• develop a unique monitoring tool for both OS compute nodes (e.g., based on Ganglia [35]);

• centralize user account management (e.g., with Samba [36]);

• work on interoperability between PBS and HPCS job scheduler (e.g., by using the tools of OGF, the Open Grid Forum [37]).

We could also work on security aspects that were intentionally overlooked during this first study. More intensive and exhaustive performance tests with virtual machines (e.g., InfiniBand ConnectX virtualization feature, virtual processor binding, etc.) could also be done. Finally, a third OS could be installed on our HOSC prototype to validate the general nature of the method exposed.

More generally, the framework presented in this paper should be considered as a building block for more specific implementations. Various requirements of real applications, environments or loads could lead to sensibly different or more sophisticated developments. We hope that this initial building block will help those who will add subsequent layers, and we are eager to hear about successful production environments designed from there9.

9 Do not hesitate to send your comments to the authors about this paper and your HOSC experiments: [email protected] and [email protected].

55

mailto:[email protected]

mailto:[email protected]


Appendix A: Acronyms

AD Active Directory (Microsoft) BAS Bull Advanced Server BIOS Basic Input Output System CIFS Common Internet File System (Microsoft) clusterdb cluster management data base (Bull) CN Compute Node CSF Community Scheduler Framework DHCP Dynamic Host Configuration Protocol Dom0 Domain 0 (Xen) DomU Unprivileged Domain (Xen) DRM Distributed Resource Manager DRMS Distributed Resource Management System DSA Digital Signature Algorithm EHA Ethernet Hardware Address (aka MAC address) FAT32 File Allocation Table file system with 32-bit addresses GE Gigabit Ethernet GID Group IDentifier GNU Gnu's Not Unix GPFS General Parallel File SystemTM (IBM) GPL GNU General Public License GRUB GRand Unified Bootloader HN Head Node (Windows) HOSC Hybrid Operating System Cluster HPC High Performance Computing HPCS Windows HPC Server® 2008 (Microsoft) HVM Hardware Virtual Machine IB InfiniBand IP Internet Protocol IPoIB Internet Protocol over InfiniBand protocol IT Information Technology LDAP Lightweight Directory Access Protocol LILO LInux LOader LINUX Linux Is Not UniX (Linus Torvald's UNIX) LSF Load Sharing Facility LVM Logical Volume Manager MAC address Media Access Control address (aka EHA)

56


MBR Master Boot Record MMC Microsoft Management Console MN Management Node (Bull) MOM Machine Oriented Mini-server (Altair) MPI Message Passing Interface MULTICS Multiplexed Information and Computing Service NBP Network Boot Program ND Network Direct (Microsoft) NFS Network File System NPB NASA advanced supercomputing (NAS) Parallel Benchmarks NTFS New Technology File System (Windows) OGF Open Grid Forum OS Operating System PBS Portable Batch System PXE Pre-boot eXecution Environment RHEL RedHat Enterprise Linux ROI Return On Investment RSA Rivest, Shamir, and Adelman SDK Software Development Kit SGE Sun Grid Engine SLURM Simple Linux Utility for Resource Management SSH Secure SHell SUA Subsystem for Unix-based Applications TCO Total Cost of Ownership TCP Transmission Control Protocol TFTP Trivial File Transfer Protocol UDP User Datagram Protocol UID User IDentifier UNIX This is a pun on MULTICS (not an acronym!) VM Virtual Machine VNC Virtual Network Computing VT Virtual Technology (Intel®) WCCS Windows Compute Cluster Server WDS Windows Deployment Service WIM Windows IMage (Microsoft) WinPE Windows Preinstallation Environment (Microsoft) XBAS Bull Advanced Server for Xeon XML eXtensible Markup Language

57


Appendix B: Bibliography and related links

[1] “Dual Boot: Windows Compute Cluster Server 2003 and Linux - Setup and Configuration Guide”, July 2007. This white paper describes the installation and configuration of an HPC cluster for a dual-boot of Windows Compute Cluster Server 2003 (WCCS) and Linux OpenSuSE. http://www.microsoft.com/downloads/details.aspx?FamilyID=1457BC0A-EAFF-4303-99ED-B199AB1C0857&displaylang=en

[2] “Dual Boot: Windows Compute Cluster Server and Rocks Cluster Distribution - Setup and Configuration Guide”, Jason Bucholtz, HPC Practice Lead, X-ISS, Michael Zebrowski, HPC Analyst, X-ISS, 2007. This white paper describes the installation and configuration of an HPC cluster for a dual-boot of WCCS 2003 and Rocks Cluster Distribution (formerly called NPACI Rocks). http://www.microsoft.com/downloads/details.aspx?FamilyID=e73a468e-2dbf-4782-8faa-aaa20acb63f8&DisplayLang=en

[3] “Dual-boot Linux and HPC Server 2008” on G. Marchetti blog: http://blogs.technet.com/gmarchetti/archive/2007/12/11/dual-boot-linux-and-hpc-server-2008.aspx

[4] BULL S.A.S. HPC solutions: http://www.bull.com/hpc

[5] Windows HPC Server: http://www.microsoft.com/hpc and http://www.windowshpc.net

[6] Xen: http://xen.xensource.com

[7] VMware: http://www.vmware.com

[8] Hyper-V: http://www.microsoft.com/windowsserver2008/en/us/virtualization-consolidation.aspx

[9] PowerVM: http://www-03.ibm.com/systems/power/software/virtualization/index.html

[10] Virtuozzo: http://www.parallels.com/en/products/virtuozzo/

[11] OpenVZ: http://openvz.org

[12] PBS Professional: http://www.pbsgridworks.com/ and http://www.altair.com/

[13] Torque: http://www.clusterresources.com/pages/products/torque-resource-manager.php

[14] SLURM: https://computing.llnl.gov/linux/slurm/

[15] LSF: http://www.platform.com/Products/platform-lsf

[16] SGE: http://gridengine.sunsource.net

[17] OAR: http://oar.imag.fr/index.html

[18] Wikipedia: http://www.wikipedia.org

[19] Moab and Maui: http://www.clusterresources.com

[20] GridWay: http://www.gridway.org

[21] Community Scheduler Framework: http://sourceforge.net/projects/gcsf

58

http://www.microsoft.com/downloads/details.aspx?FamilyID=1457BC0A-EAFF-4303-99ED-B199AB1C0857&displaylang=en

http://www.microsoft.com/downloads/details.aspx?FamilyID=1457BC0A-EAFF-4303-99ED-B199AB1C0857&displaylang=en

http://www.microsoft.com/downloads/details.aspx?FamilyID=e73a468e-2dbf-4782-8faa-aaa20acb63f8&DisplayLang=en

http://www.microsoft.com/downloads/details.aspx?FamilyID=e73a468e-2dbf-4782-8faa-aaa20acb63f8&DisplayLang=en

http://blogs.technet.com/gmarchetti/archive/2007/12/11/dual-boot-linux-and-hpc-server-2008.aspx

http://www.bull.com/hpc

http://www.microsoft.com/hpc

http://www.windowshpc.net/

http://xen.xensource.com/

http://www.vmware.com/

http://www.microsoft.com/windowsserver2008/en/us/virtualization-consolidation.aspx

http://www-03.ibm.com/systems/power/software/virtualization/index.html

http://www.parallels.com/en/products/virtuozzo/

http://openvz.org/

http://www.pbsgridworks.com/

http://www.altair.com/

http://www.clusterresources.com/pages/products/torque-resource-manager.php

https://computing.llnl.gov/linux/slurm/

http://www.platform.com/Products/platform-lsf

http://gridengine.sunsource.net/

http://oar.imag.fr/index.html

http://www.wikipedia.org/

http://www.clusterresources.com/

http://www.gridway.org/

http://sourceforge.net/projects/gcsf


[22] “BAS5 for Xeon - Installation & Configuration Guide”, Ref: 86 A2 87EW00, April 2008

[23] “BAS5 for Xeon - Administrator’s guide”, Ref: 86 A2 88EW, April 2008

[24] “BAS5 for Xeon - User’s guide”, Ref: 86 A2 89EW, April 2008

[25] "A Comparison of Virtualization Technologies for HPC", John Paul Walters, Vipin Chaudhary, Minsuk Cha, Salvatore Guercio Jr., Steve Gallo, In Proceedings of the 22nd International Conference on Advanced Information Networking and Applications (AINA 2008), pp. 861-868, 2008 DOI= http://doi.ieeecomputersociety.org/10.1109/AINA.2008.45

[26] “Evaluating the Performance Impact of Xen on MPI and Process Execution For HPC Systems”, Youseff, L., Wolski, R., Gorda, B., and Krintz, C. In Proceedings of the 2nd international Workshop on Virtualization Technology in Distributed Computing, Virtualization Technology in Distributed Computing, IEEE Computer Society, 2006, DOI= http://dx.doi.org/10.1109/VTDC.2006.4

[27] Mellanox Basic InfiniBand Software Stack for Windows HPC Server 2008 including NetworkDirect support http://www.mellanox.com/products/MLNX_WinOF.php

[28] Utilities and SDK for Subsystem for UNIX-based Applications (SUA) in Microsoft Windows Vista RTM/Windows Vista SP1 and Windows Server 2008 RTM: http://www.microsoft.com/downloads/details.aspx?familyid=93ff2201-325e-487f-a398-efde5758c47f&displaylang=en

[29] Interops Systems: http://www.interopsystems.com

[30] SUA Community: http://www.suacommunity.com

[31] PBS Professional 10.0 Administrator’s Guide, 610 pages, GridWorks, Altair, 2009

[32] PBS Professional 10.0 User’s Guide, 304 pages, GridWorks, Altair, 2009

[33] PBS Professional 10.0 external reference specification, GridWorks, Altair, 2009

[34] freeSSHd and freeFTPd: http://www.freesshd.com

[35] Ganglia: http://ganglia.info

[36] Samba: http://www.samba.org

[37] Open Grid Forum: http://www.ogf.org

[38] Top500 supercomputing site: http://www.top500.org

This paper can be downloaded from the following web sites:




59

http://doi.ieeecomputersociety.org/10.1109/AINA.2008.45

http://dx.doi.org/10.1109/VTDC.2006.4

http://www.mellanox.com/products/MLNX_WinOF.php

http://www.microsoft.com/downloads/details.aspx?familyid=93ff2201-325e-487f-a398-efde5758c47f&displaylang=en

http://www.microsoft.com/downloads/details.aspx?familyid=93ff2201-325e-487f-a398-efde5758c47f&displaylang=en

http://www.interopsystems.com/

http://www.suacommunity.com/

http://www.freesshd.com/

http://ganglia.info/

http://www.samba.org/

http://www.ogf.org/

http://www.top500.org/





Appendix C: Master boot record details

C.1 MBR Structure The Master Boot Record (MBR) defined in Section 2.1 occupies the first sector of a device (we assume that the size of a sector is always 512 bytes). Its structure is shown in Table 3 below.

Address Hex Dec

Description Size in bytes

0000 0 Code Area ≤ 446 01B8 440 Optional disk signature 4 01BC 444 Usually null: 0x0000 2

01BE 446 Table of primary partitions (four 16-byte partition structures) 64

01FE 510 55h 01FF 511 AAh

MBR signature: 0xAA55 2

MBR total size: 446 + 64 + 2 = 512

Table 3 - Structure of a Master Boot Record

C.2 Save and restore MBR If you want to save a MBR, just copy the first sector of the first device in a file (keep a copy of that file on another media if you want to protect it from a device failure).

On Linux, type:

[xbas0:root] dd if=/dev/sda of=mbr.bin bs=512 count=1

If you want to restore the MBR replace the first sector with the saved file.

On Linux, type:

[xbas0:root] dd if=mbr.bin of=/dev/sda bs=512 count=1

On Windows Server 2008, MBR can be restored even if not previously saved. Type:

C:\> bootrec /FixMbr

60


Appendix D: Files used in examples

Here are the files (scripts, configuration files, etc.) written or modified to build the HOSC prototype and to validate information given in this document.

D.1 Windows HPC Server 2008 files

D.1.1 Files used for compute node deployment The first 2 files are used by the deployment template and they need to be modified in order to fulfill the HOSC requirements. The 3rd XML file is used for template deployment based on CN MAC addresses.

C:\Program Files\Microsoft HPC Pack\Data\InstallShare\unattend.xml

C:\Program Files\Microsoft HPC Pack\Data\InstallShare\Config\diskpart.txt

… <InstallTo> <DiskID>0</DiskID> <PartitionID>1</PartitionID> </InstallTo> …

… <InstallTo> <DiskID>0</DiskID> <PartitionID>4</PartitionID> </InstallTo> …

If XBAS uses the first 3 partitions then Windows can be installed on the 4th partition.

select disk 0 clean create partition primary assign letter=c format FS=NTFS LABEL="Node" QUICK OVERRIDE active exit

The “clean” instruction removes all existing partitions. It must be deleted to preserve existing partitions.

select disk 0 create partition primary select volume 0 remove select volume 1 assign letter=c format FS=NTFS LABEL="Node" QUICK OVERRIDEactive exit

This is needed because of a removable USB device declared as “volume C” by default on R421 systems. This must be adapted to your system.

For deployment

select volume 0 format FS=NTFS LABEL="Node" QUICK OVERRIDEactive exit

Existing partitions (Linux and Windows) are kept and the Windows partition is re-formatted.

For re-deployment

61

my_cluster_nodes.xml


62

<?xml version="1.0" encoding="utf-8"?> <Nodes xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns="http://schemas.microsoft.com/HpcNodeConfigurationFile/2007/12"> <Node Name="hpcs1" Domain="WINISV"> <MacAddress>003048334cf6</MacAddress> </Node> <Node Name="hpcs2" Domain="WINISV"> <MacAddress>003048334d04</MacAddress> </Node> <Node Name="hpcs3" Domain="WINISV"> <MacAddress>003048334d3c</MacAddress> </Node> <Node Name="hpcs4" Domain="WINISV"> <MacAddress>003048347990</MacAddress> </Node> </Nodes>

D.1.2 Script for IPoIB setup

setIPoIB.vbs

set objargs=wscript.arguments Set fs=CreateObject("Scripting.FileSystemObject") Set WshNetwork = WScript.CreateObject("WScript.Network") wscript.sleep(10000) hostname=WshNetwork.ComputerName ip=GetIP(hostname) Set logFile = fs.opentextfile("c:\netconfig.log",8,True) WScript.Echo "Computername: " & hostname WScript.Echo "IP: " & ip logfile.writeline("Computername: " & hostname) logfile.writeline("IP: " & ip) res=setIPoIB(ip) logfile.writeline(res) wscript.echo res '------------------------------------------------------------------------- Function GetIP(hostname) set sh = createobject("wscript.shell") set fso = createobject("scripting.filesystemobject") workfile = "c:\PrivateIPadress.txt" sh.run "%comspec% /c netsh interface ip show addresses private > " & workfile,0,true Set ts = fso.opentextfile(workfile) data = split(ts.readall,vbcr) ts.close fso.deletefile workfile for n = 0 to ubound(data) if instr(data(n),"Address") then parts = split(data(n),":") GetIP= trim(cstr(parts(1))) end if IP = "could not resolve IP address"


Next End Function '--------------------------------------------------------------------- Function setIPoIB(IPAddress) PartialIP=Split(ipaddress,".") strIPAddress = Array("10.1.0." & PartialIP(3)) strSubnetMask = Array("255.255.255.0") strGatewayMetric = Array(1) WScript.Echo "IB: " & strIPAddress(0) strComputer = "." Set objWMIService = GetObject("winmgmts:" _ & "{impersonationLevel=impersonate}!\\" & strComputer & "\root\cimv2") Set colNetAdapters = objWMIService.ExecQuery _ ("select * from win32_networkadapterconfiguration where IPEnabled=true and description like 'Mellanox%'") For Each objNetAdapter in colNetAdapters errEnable = objNetAdapter.EnableStatic(strIPAddress, strSubnetMask) If errEnable = 0 Then SetIPoIB="The IP address on Infiniband has been changed" Else SetIPoIB="The IP address on IB could not be changed. Error: " & errEnable End If Next End Function

D.1.3 Scripts used for OS switch Here are the scripts developed on the HPCS head node to switch the OS type of a compute node from HPCS to XBAS:

C:\hosc\activate_partition_XBAS.bat

@echo off rem the argument is the head node hostname for shared file system mount. For example: \\HPCS0 echo ... Partitioning disk... diskpart.exe /s %1\hosc\diskpart_commands.txt echo ... Shutting down node %COMPUTERNAME% ... shutdown /r /f /t 20 /d p:2:4

C:\hosc\diskpart_commands.txt

select disk 0 select partition 1 active

C:\hosc\from_HPCS_to_XBAS.bat

@echo off rem the argument is the node hostname. For example: hpcs1 echo Check that file dhcpd.conf is updated on the XBAS management node ! if NOT "%1"=="" clusrun /nodes:%1 %LOGONSERVER%\hosc\activate_partition_XBAS.bat %LOGONSERVER% if "%1"=="" echo "usage: from_HPCS_to_XBAS.bat <hpcs_hostname>"

63


D.2 XBAS files

D.2.1 Kickstart and PXE files Here is an example of modifications that must be done in the kickstart file generated by the preparenfs tool in order to fulfill the HOSC requirements:

/release/ks/kickstart.<identifier> (for example kickstart.22038)

Here is an example of a PXE file generated by preparenfs for node xbas1. Before deployment, the DEFAULT label is set to ks and after deployment the DEFAULT label is set to local_primary automatically.

/tftboot/C0A80002 (complete file before compute node deployment)

# GENERATED BY PREPARENFS SCRIPT TIMEOUT 10 DEFAULT ks PROMPT 1 LABEL local_primary KERNEL chain.c32 APPEND hd0 LABEL ks KERNEL RHEL5.1/vmlinuz APPEND console=tty0 console=ttyS1,115200 ksdevice=eth0 lang=en ip=dhcp ks=nfs:192.168.0.99:/release/ks/kickstart.22038 initrd=RHEL5.1/initrd.img driverload=igb LABEL rescue KERNEL RHEL5.1/vmlinuz APPEND console=ttyS1,115200 ksdevice=eth0 lang=en ip=dhcp method=nfs:192.168.0.99:/release/RHEL5.1 initrd=RHEL5.1/initrd.img rescue driverload=igb

… part / --asprimary --fstype="ext3" --ondisk=sda --size=10000 part /usr --asprimary --fstype="ext3" --ondisk=sda --size=10000 part /opt --fstype="ext3" --ondisk=sda --size=10000 part /tmp --fstype="ext3" --ondisk=sda --size=10000 part swap --fstype="swap" --ondisk=sda --size=16000 part /var --fstype="ext3" --grow --ondisk=sda --size=10000 …

… part /boot --asprimary --fstype="ext3" --ondisk=sda --size=100 part / --asprimary --fstype="ext3" --ondisk=sda --size=50000part swap --fstype="swap" --ondisk=sda --size=16000…

64

/tftboot/C0A80002 (head of the file after compute node deployment)


65

# GENERATED BY PREPARENFS SCRIPT TIMEOUT 10 DEFAULT local_primary PROMPT 1

The remainder of the file is unchanged. Set TIMEOUT and PROMPT to 0 in order to boot nodes quicker.

D.2.2 DHCP configuration The initial DHCP configuration file must be changed for HPCS CN deployment: the global next-server field must be deleted and each CN host section must be modified as shown below:

/etc/dhcpd.conf

The NBP file path must be written with double \\ in order to be correctly interpreted during the PXE boot.

D.2.3 Scripts used for OS switch Here are the scripts developed on the XBAS management node to switch the OS of a compute node:

/opt/hosc/switch_dhcp_host

#!/usr/bin/python -t import os, os.path, sys ############## Cluster characteristics must be written here ################ xbas_hostname_base='xbas' hpcs_hostname_base='hpcs' field_dict = {hpcs_hostname_base:{'filename':'"Boot\\\\x64\\\\WdsNbp.com";\n', 'fixed-address':'192.168.1.', 'next-server':'192.168.1.1;\n', 'server-name':'"192.168.1.1";\n'}, xbas_hostname_base:{'filename':'"pxelinux.0";\n', 'fixed-address':'192.168.0.', 'next-server':'192.168.0.1;\n', 'server-name':'"192.168.0.1";\n'}}

next-server 192.168.0.99; ########### END GLOBAL PARAMETERS subnet 192.168.0.0 netmask 255.255.0.0{ authoritative; host xbas1 { filename "pxelinux.0"; fixed-address 192.168.0.2; hardware ethernet 00:30:48:33:4c:f6; option host-name "xbas1"; }

# global “next-server” entry is removed. ########### END GLOBAL PARAMETERS subnet 192.168.0.0 netmask 255.255.0.0{ authoritative; host hpcs1 { filename "Boot\\x64\\WdsNbp.com"; fixed-address 192.168.1.2; hardware ethernet 00:30:48:33:4c:f6; option host-name "hpcs1"; next-server 192.168.1.1; server-name "192.168.1.1"; option domain-name-servers 192.168.1.1; }

Remark: this modification can be done by the switch_dhcp_host script


if (len(sys.argv) <> 2):

print ('usage: switch_dhcp_host <current compute node hostname>') sys.exit(1) elif (len(str(sys.argv[1]))>1) and (str(sys.argv[1])[-2:].isdigit()): node_base = str(sys.argv[1])[:-2] node_rank = str(sys.argv[1])[-2:] else: node_base = str(sys.argv[1])[:-1] node_rank = str(sys.argv[1])[-1:] if (node_base == xbas_hostname_base ): old_hostname= xbas_hostname_base + node_rank new_hostname=hpcs_hostname_base + node_rank new_node_base = hpcs_hostname_base elif (node_base == hpcs_hostname_base): old_hostname=hpcs_hostname_base + node_rank new_hostname= xbas_hostname_base + node_rank new_node_base = xbas_hostname_base else: print ('unknown hostname: ' + sys.argv[1]) sys.exit(1) file_name = '/etc/dhcpd.conf' if not os.path.isfile(file_name): print file_name + ' does not exists !' sys.exit(1) status = 'File ' + file_name + ' was not modified' file_name_save = file_name + '.save' file_name_temp = file_name + '.temp' old_file = open(file_name,'r') new_file = open(file_name_temp,'w') S = old_file.readline() while S: if (S[0:11] == 'next-server'): S = old_file.readline() # Removes global next-server line if (S.find('host ' + old_hostname) <> -1): while (S.find('hardware ethernet') == -1): S = old_file.readline() # Skips old host section lines hardware_ethernet=S.split()[2] # Gets host Mac address while (S.find('}') == -1): S = old_file.readline() # Skips old host section lines # Writes new host section lines: new_file.write(' host ' + new_hostname + ' {\n') new_file.write(' filename ' + field_dict[new_node_base]['filename']) new_file.write(' fixed-address ' + field_dict[new_node_base]['fixed-address'] + str(int(node_rank)+1) + ';\n') new_file.write(' hardware ethernet ' + hardware_ethernet + '\n') new_file.write(' option host-name ' + '"' + new_hostname + '";\n') new_file.write(' next-server ' + field_dict[new_node_base]['next-server']) new_file.write(' server-name ' + field_dict[new_node_base]['server-name']) if (new_node_base == hpcs_hostname_base): new_file.write('option domain-name-servers '+field_dict[new_node_base]['next-server']) new_file.write(' }\n') status = 'File ' + file_name + ' is updated with host ' + new_hostname else: new_file.write(S) # Copies the line from the original file without modifications S = old_file.readline() # End while loop

66


old_file.close()

new_file.close() if os.path.isfile(file_name_save): os.remove(file_name_save) os.rename(file_name,file_name_save) os.rename(file_name_temp,file_name) print status print ('Do not forget to validate changes by typing: service dhcpd restart') sys.exit(0) # End of switch_dhcp_host script

/opt/hosc/activate_partition_HPCS.sh

#!/bin/sh #the argument is the node hostname. For example: xbas1 ssh $1 fdisk /dev/sda < /opt/hosc/fdisk_commands.txt

/opt/hosc/fdisk_commands.txt

a 4 a 1 w q

/opt/hosc/from_XBAS_to_HPCS.sh

#!/bin/sh #the argument is the node hostname. For example: xbas1 /opt/hosc/switch_dhcp_host $1 /sbin/service dhcpd restart /opt/hosc/activate_partition_HPCS.sh $1 ssh $1 shutdown -r -t 20 now

/opt/hosc/from_HPCS_to_XBAS.sh

#!/bin/sh #this script requires a ssh server daemon to be installed on the HPCS compute nodes #the argument is the compute node hostname. For example: hpcs1 #HPCS head node hostname is hard coded in this script as: hpcs0 /opt/hosc/switch_dhcp_host $1 /sbin/service dhcpd restart ssh $1 -l root cmd /c \\\\hpcs0\\hosc\\activate_partition_XBAS.bat \\\\hpcs0

D.2.4 Network interface bridge configuration For configuring 2 network interface bridges, xenbr0 and xenbr1, replace the following line in file:

/etc/xen/xen-config.sxp

(network-script network-bridge) (network-script my-network-bridges)

67


Then create file:

/etc/xen/scripts/my-network-bridges

#!/bin/bash XENDIR="/etc/xen/scripts" $XENDIR/network-bridge "$@" netdev=eth0 bridge=xenbr0 vifnum=0 $XENDIR/network-bridge "$@" netdev=eth1 bridge=xenbr1 vifnum=1

D.2.5 Network hosts The hosts file declares the IP addresses of the network interfaces of Linux nodes. XBAS CNs needs to have the same hosts file. Here is an example for our HOSC cluster:

/etc/hosts

127.0.0.1 localhost.localdomain localhost 192.168.0.1 xbas0 192.168.0.2 xbas1 192.168.0.3 xbas2 192.168.0.4 xbas3 192.168.0.5 xbas4 172.16.0.1 xbas0-ic0 172.16.0.2 xbas1-ic0 172.16.0.3 xbas2-ic0 172.16.0.4 xbas3-ic0 172.16.0.5 xbas4-ic0

D.2.6 IB network interface configuration For configuring the IB interface on each node, create/edit the following file with the right IP address. Here is an example for the compute node xbas1:

/etc/sysconfig/network-scripts/ifcfg-ib0

DEVICE=ib0 ONBOOT=yes BOOTPROTO=static NETWORK=192.168.220.0 IPADDR=192.168.220.2

D.2.7 ssh host configuration

/etc/ssh/ssh_known_hosts

xbas0,192.168.0.1 ssh-rsa AAAB3NzaC1yc2EAAABIwAAAQE/yiPG/x5gl+dq5XXhffF456fggDFt … lC92dxQUE5qQ== xbas1,192.168.0.2 ssh-rsa AAAB3NzaC1yc2EAAABIwAAAQE/yiPG/x5gl+dq5XXhffF456fggDFt … lC92dxQUE5qQ== xbas2,192.168.0.3 ssh-rsa AAAB3NzaC1yc2EAAABIwAAAQE/yiPG/x5gl+dq5XXhffF456fggDFt … lC92dxQUE5qQ== xbas3,192.168.0.4 ssh-rsa AAAB3NzaC1yc2EAAABIwAAAQE/yiPG/x5gl+dq5XXhffF456fggDFt … lC92dxQUE5qQ== xbas4,192.168.0.5 ssh-rsa AAAB3NzaC1yc2EAAABIwAAAQE/yiPG/x5gl+dq5XXhffF456fggDFt … lC92dxQUE5qQ==

68


D.3 Meta-scheduler setup files

D.3.1 PBS Professional configuration files on XBAS Here is an example of PBS Professional configuration file for PBS server on the XBAS MN:

/etc/pbs.conf

PBS_EXEC= /opt/pbs/default PBS_HOME=/var/spool/PBS PBS_START_SERVER=1 PBS_START_MOM=0 PBS_START_SCHED=1 PBS_SERVER=xbas0 PBS_SCP=/usr/bin/scp

Here is an example of PBS Professional configuration file for PBS MOM on the XBAS CNs:

/etc/pbs.conf

PBS_EXEC= /opt/pbs/default PBS_HOME=/var/spool/PBS PBS_START_SERVER=0 PBS_START_MOM=1 PBS_START_SCHED=0 PBS_SERVER=xbas0 PBS_SCP=/usr/bin/scp

D.3.2 PBS Professional configuration files on HPCS Here is an example of the lmhosts file needed on HPCS nodes:

C:\Windows\System32\drivers\etc\lmhosts

192.168.0.1 xbas0 #PBS server for HOSC

D.3.3 OS load balancing files This script gets information from the PBS server and switches the OS type of compute nodes according to the rule defined in Section 5.7.1:

“Let us define η as the smallest number of nodes requested by a queued job for a given OS type A. Let us define α (respectively β) as the number of free nodes with the OS type A (respectively B). If η>α (i.e., there are not enough free nodes to run the submitted job with OS type A) and if β≥η-α (at least η-α nodes are free with the OS type B) then the OS type of η-α nodes should be switched from B to A”.

69

/opt/hosc/pbs_hosc_os_balancing.pl


70

#!/usr/bin/perl #use strict; #Gets information with pbsnodes about free nodes $command_pbsnodes = "/usr/pbs/bin/pbsnodes -a |"; open (PBSC, $command_pbsnodes ) or die "Failed to run command: $command_pbsnodes"; @cmd_output = <PBSC>; close (PBSC); foreach $line (@cmd_output) { if (($line !~ /^(\s+)\w+/) && ($line !~ /^(\s+)$/) &&($line =~ /^(.*)\s+/)) { $nodename = $1; push (@pbsnodelist, $nodename); $pbsnodes->{$nodename}->{state} = 'unknown'; $pbsnodes->{$nodename}->{arch} = 'unknown'; } elsif ($line =~ "state") { $pbsnodes->{$nodename}->{state} = (split(' ', $line))[2]; } elsif ($line =~ "arch") { $pbsnodes->{$nodename}->{arch} = (split(' ', $line))[2]; } } foreach my $node (@pbsnodelist) { if ($pbsnodes->{$node}->{state}=~"free") { if ($pbsnodes->{$node}->{arch}=~"linux") { push (@free_linux_nodes, $node); } else { push (@free_windows_nodes, $node); } } } #Gets information with qstat about the number of nodes requested by queued jobs $command_qstat = "/usr/pbs/bin/qstat -a |"; open (PBSC, $command_qstat ) or die "Failed to run command: $command_qstat"; @cmd_output = <PBSC>; close (PBSC); $nb_windows_nodes_of_smallest_job = 1e09; $nb_linux_nodes_of_smallest_job = 1e09; foreach $line (@cmd_output) { if ((split(' ', $line))[9] =~ "Q") { $nb_nodes = (split(' ', $line))[5]; if ($line =~ "windowsq") { $nb_windows_nodes_queued += $nb_nodes; if ($nb_nodes < $nb_windows_nodes_of_smallest_job) { $nb_windows_nodes_of_smallest_job = $nb_nodes; } } elsif ($line =~ "linuxq") { $nb_linux_nodes_queued += $nb_nodes; if ($nb_nodes < $nb_linux_nodes_of_smallest_job) { $nb_linux_nodes_of_smallest_job = $nb_nodes; } } } }


#STDOUT is redirected to a LOG file open LOG, ">>/tmp/pbs_hosc_log.txt"; select LOG; #Compute the number of possible requested nodes whose OS type should be switched $requested_windows_nodes = $nb_windows_nodes_of_smallest_job - scalar @free_windows_nodes; $requested_linux_nodes = $nb_linux_nodes_of_smallest_job - scalar @free_linux_nodes; #The decision rule based on previous information is applied if (($nb_windows_nodes_of_smallest_job > scalar @free_windows_nodes) && (scalar @free_linux_nodes >= $requested_windows_nodes)){ #switch $requested_windows_nodes nodes from XBAS to HPCS for ($i = 0; $i < $requested_windows_nodes; $i++) { $command_offline = "/usr/pbs/bin/pbsnodes -o $free_linux_nodes[$i]"; system ($command_offline); $command_switch_to_HPCS = "/opt/hosc/from_XBAS_to_HPCS.sh $free_linux_nodes[$i]"; system ($command_switch_to_HPCS); ($new_node = $free_linux_nodes[$i]) =~ s/xbas/hpcs/; $command_online = "/usr/pbs/bin/pbsnodes -c $new_node"; system ($command_online); print "switch OS type from XBAS to HPCS: $free_linux_nodes[$i] -> $new_node\n"; } } elsif (($nb_linux_nodes_of_smallest_job > scalar @free_linux_nodes) && (scalar @free_windows_nodes >= $requested_linux_nodes)) { #switch $requested_linux_nodes nodes from HPCS to XBAS for ($i = 0; $i < $requested_linux_nodes; $i++) { $command_offline = "/usr/pbs/bin/pbsnodes -o $free_windows_nodes[$i]"; system ($command_offline); $command_switch_to_XBAS= "/opt/hosc/from_HPCS_to_XBAS.sh $free_windows_nodes[$i]"; system ($command_switch_to_XBAS); ($new_node = $free_windows_nodes[$i]) =~ s/hpcs/xbas/; $command_online = "/usr/pbs/bin/pbsnodes -c $new_node"; system ($command_online); print "switch OS type from HPCS to XBAS: $free_windows_nodes[$i] -> $new_node\n"; } } close LOG;

The above script is run periodically every 10 minutes as defined by the crontab file:

/var/spool/cron/root

# run HOSC Operating System balancing script every 10 minutes (noted */10) */10 * * * * /opt/hosc/pbs_hosc_os_balancing.pl

71


Appendix E: Hardware and software used for the examples

Here are the details of the hardware and software configuration used to illustrate examples. They were used to build the HOSC prototype and to validate information given in this document. Any Bull NovaScale or bullx cluster with Linux Bull Advanced Server for Xeon and Windows HPC Server 2008 could be used in the same manner.

E.1 Hardware • 1 Bull NovaScale R460 server

o 2 dual core Intel® Xeon® processors (5130 - Woodcrest) at 2GHz

o 8 GB memory, 2x 146GB SAS disks

• 4 Bull NovaScale R421 servers

o 2 dual core Intel® Xeon® processors (5160 - Woodcrest) at 3GHz

o 16 GB Memory, 2x 160GB SATA disks

• Voltaire ISR 9024D-M InfiniBand Switch and 5 HCA-410EX-D (4X)

• Cisco Gigabit switch (24 ports)

E.2 Software • Windows

o Windows HPC Server 2008: Windows Server 2008 Standard and the Microsoft HPC Pack

o Intel® network adapter driver for Windows Vista and Server 2008 x64 v13.1.2

o Mellanox InfiniBand Software Stack for Windows HPC Server 2008 v1.4.1

o Microsoft Utilities and SDK for UNIX-based Applications AMD64 (v. 10.0.6030.0) and Interops Systems “Power User” add-on bundle (v. 6.0)

o PBS Professional 10.1 for Windows Server 2008 x86_64

o freeSSHd 1.2.1

• Linux

o Bull Advanced Server for Xeon 5v1.1: Red Hat Enterprise Linux 5.1 including Xen 3.0.3 with Bull XHPC and XIB packs (optional: Bull Hypernova 1.1.B2)

o PBS Professional 10.1 for Linux x86_64

72


Appendix F: About Altair and PBS GridWorks

F.1 About Altair Altair empowers client innovation and decision-making through technology that optimizes the analysis, management and visualization of business and engineering information. Privately held with more than 1,400 employees, Altair has offices throughout North America, South America, Europe and Asia/Pacific. With a 20-year-plus track record for product design, advanced engineering software and grid computing technologies, Altair consistently delivers a competitive advantage to customers in a broad range of industries.

To learn more, please visit http://www.altair.com.

F.2 About PBS GridWorks Altair's PBS GridWorks is a suite of on-demand grid computing technologies that allows enterprises to maximize ROI on computing infrastructure assets. PBS GridWorks is the most widely implemented software environment for grid-, cluster- and on-demand computing worldwide. The suite's flagship product, PBS Professional, provides a flexible, on-demand computing environment that allows enterprises to easily share diverse (heterogeneous) computing resources across geographic boundaries.

To learn more, please visit http://www.pbsgridworks.com.

73

http://www.altair.com/

http://www.pbsgridworks.com/


Appendix G: About Microsoft and Windows HPC Server 2008

G.1 About Microsoft Founded in 1975, Microsoft (Nasdaq “MSFT”) is the worldwide leader in software, services and solutions that help people and businesses realize their full potential.

More information about Microsoft is available at: http://www.microsoft.com.

G.2 About Windows HPC Server 2008 Windows HPC Server 2008, the next generation of high-performance computing (HPC), provides enterprise-class tools for a highly productive HPC environment. Built on Windows Server 2008, 64-bit technology, Windows HPC Server 2008 can efficiently scale to thousands of processing cores and includes management consoles that help you to proactively monitor and maintain system health and stability. Job scheduling interoperability and flexibility enables integration between Windows and Linux based HPC platforms, and supports batch and service oriented application (SOA) workloads. Enhanced productivity, scalable performance, and ease of use are some of the features that make Windows HPC Server 2008 best-of-breed for Windows environments.

More information and resources for Windows HPC Server 2008 are available at:

Windows HPC Server 2008 Web site: http://www.microsoft.com/hpc

Windows HPC Community Web site: http://windowshpc.net

74

http://www.microsoft.com/

http://www.microsoft.com/hpc

http://windowshpc.net/


Appendix H: About BULL S.A.S.

Bull is one of the leading European IT companies, and has become an indisputable player in the High-Performance Computing field in Europe, with exceptional growth over the past four years, major contracts, numerous records broken, and significant investments in R&D.

In June 2009, Bull confirmed its commitment to supercomputing, with the launch of its bullx range: the first European-designed supercomputers to be totally dedicated to Extreme Computing. Designed by Bull’s team of specialists working in close collaboration with major customers, bullx embodies the company’s strategy to become one of the three worldwide leaders in Extreme Computing, and number one in Europe. The bullx supercomputers benefit from the know-how and skills of Europe’s largest center of expertise dedicated to Extreme Computing. Delivering anything from a few teraflops to several petaflops of computing power, they are easy to implement by everyone from a small R&D office to a world-class data center.

Bull has now won worldwide recognition thanks to several TOP500-class systems (see [38]). Bull gathered significant momentum in HPC in recent years, with over 120 customers in 15 countries across three continents. The spread of countries and industry sectors covered, as well as the sheer diversity of solutions that Bull has sold, illustrates the reputation that the company now enjoys. From the first major supercomputer installed at the CEA to the numerous supercomputers delivered to many higher education establishments in Brazil, France, Spain, Germany and the United Kingdom – such as the two large clusters acquired by the Jülich Research Center, which deliver a global peak performance of more than 300 teraflops. In industry, prestigious customers including Alcan, Pininfarina, Dassault-Aviation and Alenia have chosen Bull solutions. And Miracle Machines in Singapore implemented a Bull supercomputer that will be used to study and help predict tsunamis.

Alongside this commercial success, the breaking of a number of world records highlights Bull's expertise in the design and integration of the most advanced technologies. Bull systems have achieved some major performance records, particularly for ultra-large file systems, image searches in very large-scale databases (the engines of future research), and the search for new prime numbers. These systems have also been used to carry out the most extensive simulation ever of the formation of the structures of the Universe.

To prepare the systems of the future, Bull is a founder or a member of several important consortia including Parma which forms part of ITEA2 and brings a large number of European research centers to develop the next generation of parallel systems. Finally, Bull is a founder member of the POPS consortium - under the auspices of the SYSTEM@TIC competitiveness cluster based in the Ile de France region, which is developing tomorrow's petascale systems.

Bull and the French Atomic Energy Authority (CEA) are currently collaborating to design and build Tera 100, the future petascale supercomputer to support the French nuclear simulation program.

For more information, visit http://www.bull.com/hpc

75

http://www.bull.com/hpc

76


a hybrid operating system cluster solution (pdf)

Documents