virtualinux - dipartimento di informaticaaldinuc/talks/2007_virtualinux_masteros.pdf · ocfs2 and...

32
M. Aldinucci, M. Torquati, et al. (Unipi) P. Zuccato, A. Gervaso (Eurotech S.p.A.) VirtuaLinux An open source HPC clustering solution http://sourceforge.net/projects/virtualinux/

Upload: vukhuong

Post on 20-Mar-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: VirtuaLinux - Dipartimento di Informaticaaldinuc/talks/2007_Virtualinux_MasterOS.pdf · OCFS2 and GFS shared Þ le systems Virtual Clusters (VC) Each virtual node (VM) ... Full vs

M. Aldinucci, M. Torquati, et al. (Unipi)

P. Zuccato, A. Gervaso (Eurotech S.p.A.)

VirtuaLinuxAn open source HPC clustering solution

http://sourceforge.net/projects/virtualinux/

Page 2: VirtuaLinux - Dipartimento di Informaticaaldinuc/talks/2007_Virtualinux_MasterOS.pdf · OCFS2 and GFS shared Þ le systems Virtual Clusters (VC) Each virtual node (VM) ... Full vs

Eurotech HPChttp://www.eurotech.it

Pierfrancesco ZuccatoResponsabile di Eurotech HPC

Page 3: VirtuaLinux - Dipartimento di Informaticaaldinuc/talks/2007_Virtualinux_MasterOS.pdf · OCFS2 and GFS shared Þ le systems Virtual Clusters (VC) Each virtual node (VM) ... Full vs

Industry & research

The VirtuaLinux projectEntirely founded by Eurotech S.p.A.

Aims to solve industrial problemsThey do need pure and applied research

Scientific goals published in A-class conferences

Developed software released as open source GPL

Page 4: VirtuaLinux - Dipartimento di Informaticaaldinuc/talks/2007_Virtualinux_MasterOS.pdf · OCFS2 and GFS shared Þ le systems Virtual Clusters (VC) Each virtual node (VM) ... Full vs

Problem statement

Clustercollection of (high-density) legacy independent machines connected by means of a LAN

are fragilemaster node is a single point of failure

disks-on-blades are a common source of node failure

are hard to install and to maintaininstallation requires days

skilled administrator are neededroot administrator problem

legacy OSes

Page 5: VirtuaLinux - Dipartimento di Informaticaaldinuc/talks/2007_Virtualinux_MasterOS.pdf · OCFS2 and GFS shared Þ le systems Virtual Clusters (VC) Each virtual node (VM) ... Full vs

Common industrial configuration

High-density blades + external SAN (or NAS)RAID SAN much more efficient & robust

Robustness and speed are separately addressed (often at HW level)

Sometimes legally required (e.g. USA Sarbenes-Oxley law)

A number of linux distribution support this configurationThey require the customization of the standard distribution

typically path and configuration of services

Physical Cluster + external SAN InfiniBand + Ethernet4 Nodes x 4 CPUsCluster InfiniBand 192.0.0.0/24Cluster Ethernet 192.0.1.0/24Internet Gateway 131.1.7.6disk1 disk2

node1

4 cpu

node2

4 cpu

node3

4 cpu

node4

4 cpu

IBeth

Page 6: VirtuaLinux - Dipartimento di Informaticaaldinuc/talks/2007_Virtualinux_MasterOS.pdf · OCFS2 and GFS shared Þ le systems Virtual Clusters (VC) Each virtual node (VM) ... Full vs

Common hardware producerproblem

A single cluster configuration does not match all user expectation

I want CentOS, I need Ubuntu, I believe in Uindozthe last one being much more a religion than an OS

Cluster life-cycle in 5 easy steps1. the cluster is installed with factory distribution

2. the cluster falls in the hands of site system administrators

3. they mix-up user requirements • and they believe to be wizards, in reality they sometimes are sorcerer's apprentices

• of course wizard exists, but they want to be paid, and this is forbidden by cluster owner religion

4. after two days they destroy the cluster

5. ask for the factory assistance, goto 1

Page 7: VirtuaLinux - Dipartimento di Informaticaaldinuc/talks/2007_Virtualinux_MasterOS.pdf · OCFS2 and GFS shared Þ le systems Virtual Clusters (VC) Each virtual node (VM) ... Full vs

Virtualization:a brand-new solution?

Anyway, it works (it is abstraction, at the end)High-level (e.g. JVM)

medium-level (e.g. FreeBDS jails)

low-levelSimulation (e.g. Cell), Binary translation (e.g. WMware, QEMU, ...),paravirtualization (XEN, KVM, ...)

Make it possible to consolidate several OSes

make it possible to share a physical resource

insulate user OSes from lower-level OS

Robert P. Goldberg describes the then state of things in his 1974 paper titled Survey of Virtual Machines Research. He says: "Virtual machine systems were originally developed to correct some of the shortcomings of the typical third generation architectures and multi-programming operating systems - e.g., OS/360."

Christopher Strachey published a paper titled Time Sharing in Large Fast Computers in the International Conference on Information Processing at UNESCO, New York, in June, 1959. Later on, in 1974, he clarified in an email to Donald Knuth that:

" ... [my paper] was mainly about multi-programming (to avoid waiting for peripherals) although it did envisage this going on at the same time as a programmer who was debugging his program at a console. I did not envisage the sort of console system which is now so confusingly called time sharing.". Strachey admits, however, that "time sharing" as a phrase was very much in the air in the year 1960.

Page 8: VirtuaLinux - Dipartimento di Informaticaaldinuc/talks/2007_Virtualinux_MasterOS.pdf · OCFS2 and GFS shared Þ le systems Virtual Clusters (VC) Each virtual node (VM) ... Full vs

VirtuaLinux approach

A meta-distributionget a Linux distribution and automatically configure it

Master-lessAny master node; nodes are fully symmetrical

Disk-lessProvide the cluster with fully independent virtual volumes starting from a network-attached SAN

No customization of the OS are required

Transparently supporting Virtual Clustersand the tools to manage them

Page 9: VirtuaLinux - Dipartimento di Informaticaaldinuc/talks/2007_Virtualinux_MasterOS.pdf · OCFS2 and GFS shared Þ le systems Virtual Clusters (VC) Each virtual node (VM) ... Full vs

Virtual clusters

Physical Cluster + external SAN InfiniBand + Ethernet4 Nodes x 4 CPUsCluster InfiniBand 192.0.0.0/24Cluster Ethernet 192.0.1.0/24Internet Gateway 131.1.7.6disk1 disk2

node1

4 cpu

node2

4 cpu

node3

4 cpu

node4

4 cpu

IBeth

Virtual Cluster "green"4 VMs x 1 VCPUs 10.0.3.0/24

VM1

1 vcpu

VM2

1 vcpu

VM3

1 vcpu

VM4

1 vcpu

Virtual Cluster "pink"4VMs x 4VCPUs10.0.0.0/24

VM1

4 vcpu

VM2

4 vcpu

VM3

4 vcpu

VM4

4 vcpu

Virtual Cluster "tan"2 VMs x 2 VCPUs 10.0.1.0/24

VM1

2 vcpu

VM2

2 vcpu

Start VCgreen

Start VCtan

Start VCpink

Suspend tan

Restart &Remap tan

Virtual Cluster "tan"2 VMs x 2 VCPUs 10.0.1.0/24

VM1

2 vcpu

VM2

2 vcpu

Page 10: VirtuaLinux - Dipartimento di Informaticaaldinuc/talks/2007_Virtualinux_MasterOS.pdf · OCFS2 and GFS shared Þ le systems Virtual Clusters (VC) Each virtual node (VM) ... Full vs

The Big Picture

diskless SMP blade N1 diskless SMP blade N2 diskless SMP blade Nm

iSCSI SAN - set of RAID disks (e.g. Linux openfiler)

actively or passively replicated OS services S1, S2, ... (e.g. TFTP, DHCP, NTP, LDAP, DNS, ...)

EVMS volumes (disk abstraction layer)

Virtual Cluster "tan" (e.g. 1 VM x blade, 4 CPUs per VM)

Virtual Cluster "red" (e.g. 2 VMs x blade, 1 CPU per VM)

Xen hypervisor (VMM) & Linux kernel

iSCSI EVMS IBoIPiSCSI EVMS IBoIPiSCSI EVMS IBoIPiSCSI EVMS IBoIP

shared

volume

VM

tan_1

VM

tan_2VM

tan_m

...

S4

active

S2

active

S1

active

S3

active

S5

primary

S6

primary S4

active

S2

active

S1

active

S3

active

S5

backup

S6

backup S4

active

S2

active

S1

active

S3

active

S5

backup

S6

backup

iSCSI EVMS IBoIP

private

volume N2

shared

volume

iSCSI EVMS IBoIP

private

volume Nm

shared

volume

Infiniband 10xswitch

Giga Ethernetswitch

e.g. 192.168.0.0/24

e.g. 192.168.1.0/24

e.g 10.0.1.0/24

VM

red_1

VM

red_3VM

red_k

e.g 10.0.2.0/24

VM

red_2

VM

red_4VM

red_k-1

Disks Abstraction Layer

A set of private and shared EVMS volumes

are mounted via iSCSI in each node of the

cluster:

A private disk (/root) and a OCFS2/GFS

cluster-wide shared SAN are mounted in

each node.

EVMS snapshot technique is used for a time

and space efficient creation of the private

remote disk.

A novel plug-in of EVMS has been designed to

implement this feature.

Linux OS services

All standard Linux services are made fault-

tolerant via either active or passive replication:

Active: Services are started in all nodes; a

suitable configuration enforces load

balance on client requests. E.g. NTP, DNS,

TFTP, DHCP.

Passive (primary-backup): Linux HA with

heartbeat is used as fault detector. E.g.

LDAP, IP gateway.

Kernel Basic Features

All standard Linux modules.

Xen hypervisor, supporting Linux

paravirtualization, and Microsoft Windows

via QEMU binary translation (experimental).

Network connectivity, including Infiniband

userspace verbs and IP over Infiniband.

iSCSI remote storage access.

OCFS2 and GFS shared file systems

Virtual Clusters (VC)

Each virtual node (VM) of a VC is a virtual

machine that can be configured at creation

time. It exploits a cluster-wide shared

storage.

Each VC exploits a private network and can

access the cluster external gateway.

VMs of a VC can be flexibly mapped onto the

cluster nodes.

VCs can be dynamically created, destroyed,

suspended on disk.

private

volume N1

...

Page 11: VirtuaLinux - Dipartimento di Informaticaaldinuc/talks/2007_Virtualinux_MasterOS.pdf · OCFS2 and GFS shared Þ le systems Virtual Clusters (VC) Each virtual node (VM) ... Full vs

High Availabilityby way of the replication of services

Page 12: VirtuaLinux - Dipartimento di Informaticaaldinuc/talks/2007_Virtualinux_MasterOS.pdf · OCFS2 and GFS shared Þ le systems Virtual Clusters (VC) Each virtual node (VM) ... Full vs

High availability

24/7 cluster availability

Redundant hardware 5 Power supplies, 4 independent network switches, ...

iSCSI-over-Infiniband or Fibre channel attached RAID

Redundant servicesAll nodes are identical, and there is no master

All OS services are replicated on all nodesAny blade can be detached, or crash at any point in time with no impact on system availability

Page 13: VirtuaLinux - Dipartimento di Informaticaaldinuc/talks/2007_Virtualinux_MasterOS.pdf · OCFS2 and GFS shared Þ le systems Virtual Clusters (VC) Each virtual node (VM) ... Full vs

Provide services without a (single) server

Service FT model Notes

DHCP active Pre-defined map between IP and MAC

TFTP active All copies provide the same image

NTP active Pre-defined external NTPD fallback via GW

IB manager active Stateless service

DNS active Cache-only

LDAP service-specific Service-specific master redundancy

IP GW passive Heartbeat with IP takeover (via IP aliasing)

Mail node-oriented Local node and relays via DNS

SSH/SCP node-oriented Pre-defined keys

NFS node-oriented Pre-defined configuration

SMB/CIFS node-oriented Pre-defined configuration

Page 14: VirtuaLinux - Dipartimento di Informaticaaldinuc/talks/2007_Virtualinux_MasterOS.pdf · OCFS2 and GFS shared Þ le systems Virtual Clusters (VC) Each virtual node (VM) ... Full vs

Novel boot sequencesupporting the boot of master-less systems

Page 15: VirtuaLinux - Dipartimento di Informaticaaldinuc/talks/2007_Virtualinux_MasterOS.pdf · OCFS2 and GFS shared Þ le systems Virtual Clusters (VC) Each virtual node (VM) ... Full vs

Install without a master

How to install a masterless cluster? Chicken and egg problem!

Solution: Metamaster A transient master node that is then transformed in a standard node

At the end of the installation, all nodes are identical and provide redundant services.

No master node, fully symmetrical nodes

Page 16: VirtuaLinux - Dipartimento di Informaticaaldinuc/talks/2007_Virtualinux_MasterOS.pdf · OCFS2 and GFS shared Þ le systems Virtual Clusters (VC) Each virtual node (VM) ... Full vs

Storage Virtualizationa transparent constant time-space solution

Page 17: VirtuaLinux - Dipartimento di Informaticaaldinuc/talks/2007_Virtualinux_MasterOS.pdf · OCFS2 and GFS shared Þ le systems Virtual Clusters (VC) Each virtual node (VM) ... Full vs

Full vs no replication

...

...

disk or partition n

Node nPhy or Virt

disk or partition 2

Node 2Phy or Virt

disk or partition 1

Node 1Phy or Virt ...

Node nPhy or Virt

disk or partition

Node 2Phy or Virt

Node 1Phy or Virt

...

include

lib

bin

...

var

tmp

usr

etc

boot

dev

/

...

include

lib

bin

...

var

tmp

usr

etc

boot

dev

/

...

include

lib

bin

...

var

tmp

usr

etc

boot

dev

/

include

lib

bin

...

etcs

boot

tmp

vars

dev

usr

...

/

etc_node1

etc_node2

etc_node3

...

var_node1

var_node2

var_node3

...

var -> var_noden

etc -> etc_noden

...

var -> var_node1

etc -> etc_node1

...

Each node (physical or virtual) has its own copy of the whole disk

Transparent, easy to build and update

OS does not need customisation

Inefficient in time and space - O(n*size). Identical OS files are replicated

Each node (physical or virtual) share a disk (a file system, actually)

Not transparent, complex to build and update

OS does need customisation

Efficent in time and space - O(size). OS files are not replicated

Page 18: VirtuaLinux - Dipartimento di Informaticaaldinuc/talks/2007_Virtualinux_MasterOS.pdf · OCFS2 and GFS shared Þ le systems Virtual Clusters (VC) Each virtual node (VM) ... Full vs

VC requirements

Needs both transparency and time-space efficiency

Should be be independent from the OS

“Frequent” & very time consuming operation e.g. 50 nodes x 10 GB x 100MB/s = ~ 2 hour

very optimistic forecast

just to start/suspend the VC (and the system become irresponsive)

Consumes lot of disk spacee.g. 50 nodes x 5 GB = 100 GB

for each VC and for the OS only - no user data here

Page 19: VirtuaLinux - Dipartimento di Informaticaaldinuc/talks/2007_Virtualinux_MasterOS.pdf · OCFS2 and GFS shared Þ le systems Virtual Clusters (VC) Each virtual node (VM) ... Full vs

VC storage properties

Nodes of a VC are homogenous (same OS)99% of OS-related files are identical in all VMs

No reason to have heterogeneous VCs, just start more VCs of different kind

It is low-level virtualization the virtualized is similar to the real

Just keep them in a single copybut don’t tell to the virtual nodes, they believe to be fully independent one each other

A novel usage of the snaphotting techniquecopy-on-write

Page 20: VirtuaLinux - Dipartimento di Informaticaaldinuc/talks/2007_Virtualinux_MasterOS.pdf · OCFS2 and GFS shared Þ le systems Virtual Clusters (VC) Each virtual node (VM) ... Full vs

Snapshot tecnique

time

0001

0010

0010

write

0011

write

1011

write

1111

write

0010snap t1

snap t20011

0111

write

time time

Page 21: VirtuaLinux - Dipartimento di Informaticaaldinuc/talks/2007_Virtualinux_MasterOS.pdf · OCFS2 and GFS shared Þ le systems Virtual Clusters (VC) Each virtual node (VM) ... Full vs

Snapshots

Used to build online backupsBoth original and snapshots can be written

File system independent, transparent

Supported by several tools, e.g. EVMS, LVM2Implemented by linux standard kernel

dm_snapshot module (device mapper)

Can be implemented in several wayscopy-on-write, redirect-on-write, split-mirror, ...

Page 22: VirtuaLinux - Dipartimento di Informaticaaldinuc/talks/2007_Virtualinux_MasterOS.pdf · OCFS2 and GFS shared Þ le systems Virtual Clusters (VC) Each virtual node (VM) ... Full vs

Copy-on-write

Originale Snapshot

Writingon original

Writingon snapshot

Page 23: VirtuaLinux - Dipartimento di Informaticaaldinuc/talks/2007_Virtualinux_MasterOS.pdf · OCFS2 and GFS shared Þ le systems Virtual Clusters (VC) Each virtual node (VM) ... Full vs

Concurrent snapshots

Snapshots not designed for concurrencyAll snapshot should be kept active in all nodes

Linux cannot keep active more than 8 snapshots per node

Novel semantics for snapshotsRelax snapshots semantics while maintaining correctness

“Mark” as read-only parts on the blocks

Page 24: VirtuaLinux - Dipartimento di Informaticaaldinuc/talks/2007_Virtualinux_MasterOS.pdf · OCFS2 and GFS shared Þ le systems Virtual Clusters (VC) Each virtual node (VM) ... Full vs

Storage virtualization

volumes

node

1

node2

noden

sda1 sda2 sda3

container_0

segments

containers

disks

segm1 segm2

R_def Ra_1 Ra_2 Ra_n...

segm3

R_sharedregions

file system

/dev/evms/default

/dev/evms/node1 /dev/evms/node2 /dev/evms/noden /dev/evms/shared

ext3

sda sdb

ext3 ext3 ext3 OCFS2

snap_1 snap_2 snap_n...snaphshots

sdb

EVMS

SAN

...

...

Rb_1 Rb_2 Rb_n

/dev/evms/swap1

swap

/dev/evms/swap2

swap

/dev/evms/swapn

swap

cluster

...

Page 25: VirtuaLinux - Dipartimento di Informaticaaldinuc/talks/2007_Virtualinux_MasterOS.pdf · OCFS2 and GFS shared Þ le systems Virtual Clusters (VC) Each virtual node (VM) ... Full vs

Storage virtualization (VCs)

EVMS

VC "green"

node

1

node

2

node

n

VC "pink"

SAN

container_Dom0

segment_Dom0

...

container_DomU

segment_DomU

VM1 VM2 VMn VM1 VM2 VMn

physicalcluster

Page 26: VirtuaLinux - Dipartimento di Informaticaaldinuc/talks/2007_Virtualinux_MasterOS.pdf · OCFS2 and GFS shared Þ le systems Virtual Clusters (VC) Each virtual node (VM) ... Full vs

0

20

40

60

80

100

120

140

160

Barrier 2 nodes Barrier 4 nodes

Benchmark Name

Late

ncy (

usec)

Dom0_IB

Dom0_IPoIB

Dom0_Geth

DomU_IPoIB

Latency

0

20

40

60

80

100

120

140

160

PingPong PingPing SendrecvBenchmark Name (2 nodes)

Late

ncy (

usec)

Dom0 IB Ubuntu Dom0, Infiniband user-space verbs (MPI-gen2)Dom0 IPoIB Ubuntu Dom0, Infiniband IPoverIB (MPI-TCP)Dom0 Geth Ubuntu Dom0, Giga-Ethernet (MPI-TCP)DomU IPoIB Ubuntu DomU, virtual net on top of Infiniband IPoverIB (MPI-TCP)

Table 1: Intel MBI experiments legend.

cluster launches. With simple scripts it is possible to retrieve from thedatabase the current cluster running status and the physical mapping.

• A command-line library for the creation, the activation and the destructionof the VCs. In the current implementation the library is composed by abunch of Python scripts that implement three main commands:

1. VC Create for the creation of the VC, it fills the database with VC-related static information.

2. VC Control able to launch a previously created VC and deploy it onthe physical cluster according to a mapping policy. It is also able tostop, suspend or resume a VC.

3. VC Destroy that purges a VC from the system; it makes the clean-upof the database.

• A communication layer used for the staging and the execution of the VMs.The current implementation is build on top of the Secure Shell support(ssh).

• A virtual cluster start-time support able to dynamically configure thenetwork topology and the routing policies on the physical nodes. TheVC Control command relies on these feature for VC start-up or shutdown.

6 Experiments

The implementation of VirtuaLinux has been recently completed and tested.We present here some preliminary experiments. The experimental platformconsists of a 4U-case Eurotech cluster hosting 4 high-density blades, each ofthem equipped with a two dual-core AMD [email protected] and 8 GBytes ofmemory. Each blade has 2 Giga-Ethernets and one 10 Gbits/s Infiniband NIC(Mellanox InfiniBand HCA). The blades are connected with a (not managed)Infiniband switch. Experimental data has been collected on two installation ofVirtuaLinux based on two di!erent base Linux distributions:

• Ubuntu Edgy 6.10, Xen 3.0.1 VMM, Linux kernel 2.6.16 Dom0 (Ub-Dom0)and DomU (Ub-DomU);

• CentOS 4.4, no VMM Linux kernel 2.6.9 (CentOS).

24

Page 27: VirtuaLinux - Dipartimento di Informaticaaldinuc/talks/2007_Virtualinux_MasterOS.pdf · OCFS2 and GFS shared Þ le systems Virtual Clusters (VC) Each virtual node (VM) ... Full vs

Dom0 IB Ubuntu Dom0, Infiniband user-space verbs (MPI-gen2)Dom0 IPoIB Ubuntu Dom0, Infiniband IPoverIB (MPI-TCP)Dom0 Geth Ubuntu Dom0, Giga-Ethernet (MPI-TCP)DomU IPoIB Ubuntu DomU, virtual net on top of Infiniband IPoverIB (MPI-TCP)

Table 1: Intel MBI experiments legend.

cluster launches. With simple scripts it is possible to retrieve from thedatabase the current cluster running status and the physical mapping.

• A command-line library for the creation, the activation and the destructionof the VCs. In the current implementation the library is composed by abunch of Python scripts that implement three main commands:

1. VC Create for the creation of the VC, it fills the database with VC-related static information.

2. VC Control able to launch a previously created VC and deploy it onthe physical cluster according to a mapping policy. It is also able tostop, suspend or resume a VC.

3. VC Destroy that purges a VC from the system; it makes the clean-upof the database.

• A communication layer used for the staging and the execution of the VMs.The current implementation is build on top of the Secure Shell support(ssh).

• A virtual cluster start-time support able to dynamically configure thenetwork topology and the routing policies on the physical nodes. TheVC Control command relies on these feature for VC start-up or shutdown.

6 Experiments

The implementation of VirtuaLinux has been recently completed and tested.We present here some preliminary experiments. The experimental platformconsists of a 4U-case Eurotech cluster hosting 4 high-density blades, each ofthem equipped with a two dual-core AMD [email protected] and 8 GBytes ofmemory. Each blade has 2 Giga-Ethernets and one 10 Gbits/s Infiniband NIC(Mellanox InfiniBand HCA). The blades are connected with a (not managed)Infiniband switch. Experimental data has been collected on two installation ofVirtuaLinux based on two di!erent base Linux distributions:

• Ubuntu Edgy 6.10, Xen 3.0.1 VMM, Linux kernel 2.6.16 Dom0 (Ub-Dom0)and DomU (Ub-DomU);

• CentOS 4.4, no VMM Linux kernel 2.6.9 (CentOS).

24

0

100

200

300

400

500

600

700

800

1 4 16 64 256 1K 4K 16

K64

K25

6K 1M 4M

Sendrecv size (Bytes) - 2 nodes

Ba

nd

wid

th (

MB

yte

s/s

)

Bandwidth

0

100

200

300

400

500

600

700

800

1 4 16 6425

61K 4K

16K

64K

256K 1M 4M

Sendrecv size (Bytes) - 4 nodes

Bandw

idth

(M

Byte

s/s

) Dom0_IBDom0_IPoIBDom0_GEthDomU_IPoIB

Page 28: VirtuaLinux - Dipartimento di Informaticaaldinuc/talks/2007_Virtualinux_MasterOS.pdf · OCFS2 and GFS shared Þ le systems Virtual Clusters (VC) Each virtual node (VM) ... Full vs

CPU & OS Performances

Micro-benchmark Unit Ub-Dom0 Ub-DomU CentOS

Simple syscall usec 0.6305 0.6789 0.0822Simple open/close usec 5.0326 4.9424 3.7018Select on 500 tcp fd’s usec 37.0604 37.0811 75.5373Signal handler overhead usec 2.5141 2.6822 1.1841Protection fault usec 1.0880 1.2352 0.3145Pipe latency usec 20.5622 12.5365 9.5663Process fork+execve usec 1211.4000 1092.2000 498.6364

float mul nsec 1.8400 1.8400 1.8200float div nsec 8.0200 8.0300 9.6100double mul nsec 1.8400 1.8400 1.8300double div nsec 9.8800 9.8800 11.3300

RPC/udp latency localhost usec 43.5155 29.9752 32.1614RPC/tcp latency localhost usec 55.0066 38.7324 40.8672TCP/IP conn. to localhost usec 73.7297 57.5417 55.9775Pipe bandwidth MB/s 592.3300 1448.7300 956.21

Ub-Dom0 Ub-DomU Ub-DomUMicro-benchmark vs vs vs

CentOS CentOS Ub-Dom0

Simple syscall +667% +726% +7%Simple open/close +36% +34% -2%Select on 500 tcp fd’s +51% +51% 0%Signal handler overhead +112% +127% +7%Protection fault +246% +293% +13%Pipe latency +115% +31% -40%Process fork+execve +143% +119% -10%

float mul !0% !0% !0%float div !0% !0% !0%double mul !0% !0% !0%double div !0% !0% !0%

RPC/udp latency localhost +35% -7% -31%RPC/tcp latency localhost +35% -5% -30%TCP/IP conn. to localhost +32% +3% -22%Pipe bandwidth -38% +51% +144%

Table 2: VirtuaLinux: evaluation of ISA and OS performances with theLMbench micro-benchmark toolkit[?], and their di!erences (percentage).

25

Page 29: VirtuaLinux - Dipartimento di Informaticaaldinuc/talks/2007_Virtualinux_MasterOS.pdf · OCFS2 and GFS shared Þ le systems Virtual Clusters (VC) Each virtual node (VM) ... Full vs

Storage Virtualization Performances

Additional layer on top of iSCSI read write rewrite

none (reference raw iSCSI access) 60 88 30

EVMS standard volume 66 89 32EVMS snap, fresh files 63 88 31

EVMS snap, files existing on original 63 7 31

Table 1. Performances (MBytes/s) of the proposed stor-age abstraction layer. Results are referred to Bonnie blockread/write/rewrite benchmarks on a iSCSI-attached SAN.

literature (and with CentOS installation, not reported here). Sincenative drivers bypass the VMM, virtualization introduces no over-heads. These drivers are not currently used within the VM (DomU),as they cannot be used to deploy standard Linux services, whichare based on the TCP/IP protocol. At this aim, VirtuaLinux pro-vides the TCP/IP stack on top of the Infiniband network (throughthe IPoIB kernel module). Experiments show that this additionallayer is a major source of overhead (irrespectively of the virtual-ization layer): the TCP/IP stack on top of the 10 Gigabit Infini-band (Dom0 IPoIB) behaves as a 2 Gigabit network. The perfor-mance of a standard Gigabit network is given as reference testbed(Dom0 GEth). Network performance is further slowed down byuser domain driver decoupling that requires data copy betweenfront-end and back-end network drivers. As result, as shown byDomU IPoIB figures, VC virtual networks on top of a 10 Gigabitnetwork, exhibits a Giga-Ethernet-like performances.

The third class of experiments evaluates storage abstractionlayer performance. Performance figures for block accesses3 areshown in Table 1. For standard volumes, the storage abstractionlayer add no overhead with respect to direct access of iSCSI device(first vs second line). The little performance gain is probably due tothe larger block size for storage accesses (block size in the secondcase is 4 times the block size of the first case). As expected, writingfresh files on a snapshot has the same performance of writing ona standard volume (second vs third line). This case models thenormal usage of virtual disks. On the contrary, due to the copy-on-write behaviour, a serious performance penalty is payed thefirst time that a file, existing in the original, should be overwritten(third vs fourth line). In this case, the linking between snapshot andoriginal blocks on the disks should be broken, then the data can bewritten on the snapshot (with the consequent operations on meta-data). Notice, however, that this penalty should be payed just thefirst time, as the rewrite operation exhibits the same performance inthe two cases: after the first write the file on the snapshot became“fresh”, i.e. fully independent from the original. Observe also, thatin VC usage, snapshots differentiates at the first boot, thus theperformance penalty is payed just once, and has no impact afterthe first boot. In the proposed usage, just a little fraction of guestOS files differentiates at the first boot. The storage abstraction layerexhibits a similar behaviour for random accesses on small files: anobservable overhead is introduced only at the first touch of fileson the snapshots that are actually stored in the original. Eventually,the proposed solution does not change the EVMS snapshot creationspeed, that is easily more than one order of magnitude faster thanfull data cloning (only meta-data is copied).

3 Reads and writes are performed on large set of 1M Bytes files. The totalsize is larger than the double of the main memory size in order to minimizeOS caching impact. The rewrite operation reads the file in blocks, modify abyte per block, then overwrite the block.

6. Related WorkCluster Virtualization idea appears in several contexts as evolutionof the single machine virtualization.

In the grid context, the idea evolved from the virtualization ofthe grid node aiming at addressing several issues, such as the sup-port for legacy applications, the security against untrusted code andusers, and the computation deployment independently of site ad-ministration. Among the former proposals, Figueiredo et al. haveproposed the full virtualization of nodes [9], whereas the Denaliproject has focused on supporting lightweight VMs [30]. The sup-port for VCs and their management tend to be integrated in thegrid middleware, as an example in the Globus Toolkit 4 [10]. TheXenoserver project [22] is building a geographically distributed in-frastructure as an extension of the Xen VMM. The In-VIGO project[1] proposed a distributed Grid infrastructure based on VMs, whilethe Virtuoso [28] and Violin [24] projects explore relevant network-ing issues. In this context, VCs are composed of heterogeneousand geographically distributed virtual nodes. In a sense, they aremore grids than clusters of virtual nodes. Therefore, the research ismainly focused on typical problems of grids, such as VM configu-ration generation, VM deployment, security, taming long networklatencies, and instability of physical resources.

Also, VMware Lab Manager appears affine to a VC manager[29]. It enables to create a centralized pool of virtualized servers,storage and networking equipment shared across software devel-opment teams. Automatically and rapidly set up and tear downcomplex, multi-machine software configurations for use in devel-opment and test activities.

The possibility to use snapshots as virtual disks is mentioned inthe LVM user guide [21], but only for a single platform. As dis-cussed along the paper, even on a shared SAN, native LVM/EVMSsnapshots can be hardly used “as is” due to scalability and secu-rity limitations. These limitations spring from the need to activateall snapshots on all physical nodes, independently on virtual-to-physical mapping and the status of the VC.

A plethora of existing solutions supports disk-less clusters, typ-ically via NFS. Some of them natively supports disk-less clusterson top of iSCSI [26]. Almost all of them use explicit copies ofnode private data (either the full root directory or parts of it). Thesesolutions greatly increase the complexity and the fragility of the in-stallation since some of the standard OS packages should be recon-figured to write/read configuration and log files in different paths.VirtuaLinux differentiates from all of them since enables the trans-parent and efficient usage of standard Linux distributions, and na-tively includes the full support for Xen-based VCs. To the best ofour knowledge, no Linux distribution currently achieves all the de-scribed goals.

7. ConclusionsWe have presented a general characterization of a Virtual Cluster,i.e. a coherent aggregates of Virtual Machines sharing a local vir-tual network and a virtual storage. The Virtual Cluster is a naturalevolution of Virtual Machine idea; we have shown that a basic im-plementation of a system supporting Virtual Clusters can derivedfrom already existing tools, and notably, fairly independently froma particular VMM.

We have elaborated on the basic design to provide Virtual Clus-ter implementation with an efficient storage on reasonably generalphysical cluster architecture. At this aim we have proposed a stor-age abstraction layer specifically designed for homogeneous Vir-tual Clusters, which leverages on the affinity of virtual disks withina Virtual Cluster. Data replicated on several virtual disks, in particu-lar guest OS files, and mapped on the same disk are stored avoidingdata replication. This enforces the fast creation of Virtual Clusters

Cluster Virtualization 9 2007/9/4

Page 30: VirtuaLinux - Dipartimento di Informaticaaldinuc/talks/2007_Virtualinux_MasterOS.pdf · OCFS2 and GFS shared Þ le systems Virtual Clusters (VC) Each virtual node (VM) ... Full vs

Scientific results

1. M. Aldinucci, M. Torquati, M. Vanneschi, M. Cacitti, A. Gervaso, P. Zuccato.VirtuaLinux design principles.Technical Report TR-07-13, Computer Science Dept., University of Pisa, June 2007.

2. M. Aldinucci, P. Zuccato.Virtual clusters with no single point of failure.Intl. Supercomputing Conference (ISC2007), Poster session, Dresden, Germany, June 2007.

3. M. Aldinucci, M. Danelutto, M. Torquati, F. Polzella, G. Spinatelli, M. Vanneschi, A. Gervaso, M. Cacitti, and P. Zuccato.VirtuaLinux: virtualized high-density clusters with no single point of failure.In Proc. of the Intl. Conference ParCo 2007: Parallel Computing, Jülich, Germany, September 2007.

4. M. Aldinucci , M. Torquati , P. Zuccato, M. Vanneschi. The VirtuaLinux Storage Abstraction Layer for Efficient Virtual Clustering.Intl. Conference Euromicro PDP 2008, Parallel Distributed and network-based Processing Accepted, to appear in February 2008.

5. VirtuaLinux at Linux Day 2007 Meeting (invited talk, October 27th, 2007)

6. VirtuaLinux at NSS07 Conference (invited talk, November 27th, 2007,

Page 31: VirtuaLinux - Dipartimento di Informaticaaldinuc/talks/2007_Virtualinux_MasterOS.pdf · OCFS2 and GFS shared Þ le systems Virtual Clusters (VC) Each virtual node (VM) ... Full vs

OpenSource

Page 32: VirtuaLinux - Dipartimento di Informaticaaldinuc/talks/2007_Virtualinux_MasterOS.pdf · OCFS2 and GFS shared Þ le systems Virtual Clusters (VC) Each virtual node (VM) ... Full vs

Xen architecture

Dom0 VM0 DomU VM1 DomU VM3

Hardware (SMP, MMU, physical memory, Ethernet, SCSI/IDE)

Hypervisor (Xen Virtual Machine Monitor)

Para-virtualizedGuestOS

(e.g Linux)

Para-virtualizedGuestOS

(e.g Linux)

UnmodifiedGuestOS

(e.g. WinXP)

Para-virtualizedGuestOS

(e.g Linux)

UnmodifiedUser

Software

UnmodifiedUser

Software

Device Manager and

Control SW

UnmodifiedUser

Software

SMP

Control IF Safe HW IF Event Channel Virtual CPU Virtual MMU

Front-EndDevice Drivers

Front-EndDevice Drivers

Front-EndDevice Drivers

Back-EndDevice Drivers

NativeDeviceDrivers

DomU VM2

QEMU bin translator