scalability testing of kadeploy using virtual machines on ... · scalability testing of kadeploy on...

45
Scalability Testing of Kadeploy using Virtual Machines on Grid’5000 Luc Sarzyniec, S ´ ebastien Badia, Emmanuel Jeanvoine, Lucas Nussbaum Grid’5000 Scalability testing of Kadeploy on Grid’5000 1 / 10

Upload: others

Post on 07-Oct-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Scalability Testing of Kadeploy using Virtual Machines on ... · Scalability testing of Kadeploy on Grid’5000 3 / 10. Process overview Kadeploy DHCP TFTP/HTTP (2) triggers reboot

Scalability Testing of Kadeployusing Virtual Machines on Grid’5000

Luc Sarzyniec, Sebastien Badia, Emmanuel Jeanvoine, Lucas Nussbaum

Grid’5000

Scalability testing of Kadeploy on Grid’5000 1 / 10

Page 2: Scalability Testing of Kadeploy using Virtual Machines on ... · Scalability testing of Kadeploy on Grid’5000 3 / 10. Process overview Kadeploy DHCP TFTP/HTTP (2) triggers reboot

Scalability Testing of Kadeployusing Virtual Machines on Grid’5000

Luc Sarzyniec, Sebastien Badia, Emmanuel Jeanvoine, Lucas Nussbaum

Grid’5000

Scalability testing of Kadeploy on Grid’5000 1 / 10

Page 3: Scalability Testing of Kadeploy using Virtual Machines on ... · Scalability testing of Kadeploy on Grid’5000 3 / 10. Process overview Kadeploy DHCP TFTP/HTTP (2) triggers reboot

Kadeploy – OS provisioning for clusters

I Used by sysadmins to install/reinstall compute nodes

I Designed for scalabilityI That matters: faster reinstallation ; shorter downtime

I Built on top of PXE, DHCP, TFTP (or HTTP)

I Support of a broad range of systems (Linux, Xen, *BSD, etc.)

I Manages catalog of images and user permissions

I Open Source (GPL)

http://kadeploy3.gforge.inria.fr/

Scalability testing of Kadeploy on Grid’5000 2 / 10

Page 4: Scalability Testing of Kadeploy using Virtual Machines on ... · Scalability testing of Kadeploy on Grid’5000 3 / 10. Process overview Kadeploy DHCP TFTP/HTTP (2) triggers reboot

Process overview

Kadeploy DHCP TFTP/HTTP

(2) triggers reboot using IPMI or SSH

1 Kadeploy configures PXE profiles2 Kadeploy triggers reboot using IPMI or SSH3 Nodes boot to minimal deployment system sent over the network4 Kadeploy configures nodes and sends system image5 Kadeploy configures PXE profiles again and triggers reboot6 Nodes boot to newly installed system

Scalability testing of Kadeploy on Grid’5000 3 / 10

Page 5: Scalability Testing of Kadeploy using Virtual Machines on ... · Scalability testing of Kadeploy on Grid’5000 3 / 10. Process overview Kadeploy DHCP TFTP/HTTP (2) triggers reboot

Process overview

Kadeploy DHCP TFTP/HTTP

(2) triggers reboot using IPMI or SSH

1 Kadeploy configures PXE profiles

2 Kadeploy triggers reboot using IPMI or SSH3 Nodes boot to minimal deployment system sent over the network4 Kadeploy configures nodes and sends system image5 Kadeploy configures PXE profiles again and triggers reboot6 Nodes boot to newly installed system

Scalability testing of Kadeploy on Grid’5000 3 / 10

Page 6: Scalability Testing of Kadeploy using Virtual Machines on ... · Scalability testing of Kadeploy on Grid’5000 3 / 10. Process overview Kadeploy DHCP TFTP/HTTP (2) triggers reboot

Process overview

Kadeploy DHCP TFTP/HTTP

(2) triggers reboot using IPMI or SSH

1 Kadeploy configures PXE profiles2 Kadeploy triggers reboot using IPMI or SSH

3 Nodes boot to minimal deployment system sent over the network4 Kadeploy configures nodes and sends system image5 Kadeploy configures PXE profiles again and triggers reboot6 Nodes boot to newly installed system

Scalability testing of Kadeploy on Grid’5000 3 / 10

Page 7: Scalability Testing of Kadeploy using Virtual Machines on ... · Scalability testing of Kadeploy on Grid’5000 3 / 10. Process overview Kadeploy DHCP TFTP/HTTP (2) triggers reboot

Process overview

Kadeploy DHCP TFTP/HTTP

(2) triggers reboot using IPMI or SSH

1 Kadeploy configures PXE profiles2 Kadeploy triggers reboot using IPMI or SSH3 Nodes boot to minimal deployment system sent over the network

4 Kadeploy configures nodes and sends system image5 Kadeploy configures PXE profiles again and triggers reboot6 Nodes boot to newly installed system

Scalability testing of Kadeploy on Grid’5000 3 / 10

Page 8: Scalability Testing of Kadeploy using Virtual Machines on ... · Scalability testing of Kadeploy on Grid’5000 3 / 10. Process overview Kadeploy DHCP TFTP/HTTP (2) triggers reboot

Process overview

Kadeploy DHCP TFTP/HTTP

(2) triggers reboot using IPMI or SSH

1 Kadeploy configures PXE profiles2 Kadeploy triggers reboot using IPMI or SSH3 Nodes boot to minimal deployment system sent over the network4 Kadeploy configures nodes and sends system image

5 Kadeploy configures PXE profiles again and triggers reboot6 Nodes boot to newly installed system

Scalability testing of Kadeploy on Grid’5000 3 / 10

Page 9: Scalability Testing of Kadeploy using Virtual Machines on ... · Scalability testing of Kadeploy on Grid’5000 3 / 10. Process overview Kadeploy DHCP TFTP/HTTP (2) triggers reboot

Process overview

Kadeploy DHCP TFTP/HTTP

(2) triggers reboot using IPMI or SSH

1 Kadeploy configures PXE profiles2 Kadeploy triggers reboot using IPMI or SSH3 Nodes boot to minimal deployment system sent over the network4 Kadeploy configures nodes and sends system image5 Kadeploy configures PXE profiles again and triggers reboot

6 Nodes boot to newly installed system

Scalability testing of Kadeploy on Grid’5000 3 / 10

Page 10: Scalability Testing of Kadeploy using Virtual Machines on ... · Scalability testing of Kadeploy on Grid’5000 3 / 10. Process overview Kadeploy DHCP TFTP/HTTP (2) triggers reboot

Process overview

Kadeploy DHCP TFTP/HTTP

(2) triggers reboot using IPMI or SSH

1 Kadeploy configures PXE profiles2 Kadeploy triggers reboot using IPMI or SSH3 Nodes boot to minimal deployment system sent over the network4 Kadeploy configures nodes and sends system image5 Kadeploy configures PXE profiles again and triggers reboot6 Nodes boot to newly installed system

Scalability testing of Kadeploy on Grid’5000 3 / 10

Page 11: Scalability Testing of Kadeploy using Virtual Machines on ... · Scalability testing of Kadeploy on Grid’5000 3 / 10. Process overview Kadeploy DHCP TFTP/HTTP (2) triggers reboot

Scalable remote command execution with Taktuk

Scalability testing of Kadeploy on Grid’5000 4 / 10

Page 12: Scalability Testing of Kadeploy using Virtual Machines on ... · Scalability testing of Kadeploy on Grid’5000 3 / 10. Process overview Kadeploy DHCP TFTP/HTTP (2) triggers reboot

Scalable remote command execution with Taktuk

Sequential?

Scalability testing of Kadeploy on Grid’5000 4 / 10

Page 13: Scalability Testing of Kadeploy using Virtual Machines on ... · Scalability testing of Kadeploy on Grid’5000 3 / 10. Process overview Kadeploy DHCP TFTP/HTTP (2) triggers reboot

Scalable remote command execution with Taktuk

Sequential?

Scalability testing of Kadeploy on Grid’5000 4 / 10

Page 14: Scalability Testing of Kadeploy using Virtual Machines on ... · Scalability testing of Kadeploy on Grid’5000 3 / 10. Process overview Kadeploy DHCP TFTP/HTTP (2) triggers reboot

Scalable remote command execution with Taktuk

Sequential?

Scalability testing of Kadeploy on Grid’5000 4 / 10

Page 15: Scalability Testing of Kadeploy using Virtual Machines on ... · Scalability testing of Kadeploy on Grid’5000 3 / 10. Process overview Kadeploy DHCP TFTP/HTTP (2) triggers reboot

Scalable remote command execution with Taktuk

Sequential?

Scalability testing of Kadeploy on Grid’5000 4 / 10

Page 16: Scalability Testing of Kadeploy using Virtual Machines on ... · Scalability testing of Kadeploy on Grid’5000 3 / 10. Process overview Kadeploy DHCP TFTP/HTTP (2) triggers reboot

Scalable remote command execution with Taktuk

Sequential?

Scalability testing of Kadeploy on Grid’5000 4 / 10

Page 17: Scalability Testing of Kadeploy using Virtual Machines on ... · Scalability testing of Kadeploy on Grid’5000 3 / 10. Process overview Kadeploy DHCP TFTP/HTTP (2) triggers reboot

Scalable remote command execution with Taktuk

Sequential + sliding window (pdsh-like)?

Scalability testing of Kadeploy on Grid’5000 4 / 10

Page 18: Scalability Testing of Kadeploy using Virtual Machines on ... · Scalability testing of Kadeploy on Grid’5000 3 / 10. Process overview Kadeploy DHCP TFTP/HTTP (2) triggers reboot

Scalable remote command execution with Taktuk

Sequential + sliding window (pdsh-like)?

Scalability testing of Kadeploy on Grid’5000 4 / 10

Page 19: Scalability Testing of Kadeploy using Virtual Machines on ... · Scalability testing of Kadeploy on Grid’5000 3 / 10. Process overview Kadeploy DHCP TFTP/HTTP (2) triggers reboot

Scalable remote command execution with Taktuk

Sequential + sliding window (pdsh-like)?

Scalability testing of Kadeploy on Grid’5000 4 / 10

Page 20: Scalability Testing of Kadeploy using Virtual Machines on ... · Scalability testing of Kadeploy on Grid’5000 3 / 10. Process overview Kadeploy DHCP TFTP/HTTP (2) triggers reboot

Scalable remote command execution with Taktuk

Sequential + sliding window (pdsh-like)?

Scalability testing of Kadeploy on Grid’5000 4 / 10

Page 21: Scalability Testing of Kadeploy using Virtual Machines on ... · Scalability testing of Kadeploy on Grid’5000 3 / 10. Process overview Kadeploy DHCP TFTP/HTTP (2) triggers reboot

Scalable remote command execution with Taktuk

Sequential + sliding window (pdsh-like)?

Scalability testing of Kadeploy on Grid’5000 4 / 10

Page 22: Scalability Testing of Kadeploy using Virtual Machines on ... · Scalability testing of Kadeploy on Grid’5000 3 / 10. Process overview Kadeploy DHCP TFTP/HTTP (2) triggers reboot

Scalable remote command execution with Taktuk

Sequential + sliding window (pdsh-like)?

Scalability testing of Kadeploy on Grid’5000 4 / 10

Page 23: Scalability Testing of Kadeploy using Virtual Machines on ... · Scalability testing of Kadeploy on Grid’5000 3 / 10. Process overview Kadeploy DHCP TFTP/HTTP (2) triggers reboot

Scalable remote command execution with Taktuk

Sequential + sliding window (pdsh-like)?

Scalability testing of Kadeploy on Grid’5000 4 / 10

Page 24: Scalability Testing of Kadeploy using Virtual Machines on ... · Scalability testing of Kadeploy on Grid’5000 3 / 10. Process overview Kadeploy DHCP TFTP/HTTP (2) triggers reboot

Scalable remote command execution with Taktuk

Sequential + sliding window (pdsh-like)?

Scalability testing of Kadeploy on Grid’5000 4 / 10

Page 25: Scalability Testing of Kadeploy using Virtual Machines on ... · Scalability testing of Kadeploy on Grid’5000 3 / 10. Process overview Kadeploy DHCP TFTP/HTTP (2) triggers reboot

Scalable remote command execution with Taktuk

Sequential + sliding window (pdsh-like)?

Scalability testing of Kadeploy on Grid’5000 4 / 10

Page 26: Scalability Testing of Kadeploy using Virtual Machines on ... · Scalability testing of Kadeploy on Grid’5000 3 / 10. Process overview Kadeploy DHCP TFTP/HTTP (2) triggers reboot

Scalable remote command execution with Taktuk

Sequential + sliding window (pdsh-like)?

Scalability testing of Kadeploy on Grid’5000 4 / 10

Page 27: Scalability Testing of Kadeploy using Virtual Machines on ... · Scalability testing of Kadeploy on Grid’5000 3 / 10. Process overview Kadeploy DHCP TFTP/HTTP (2) triggers reboot

Scalable remote command execution with Taktuk

Sequential + sliding window (pdsh-like)?

Scalability testing of Kadeploy on Grid’5000 4 / 10

Page 28: Scalability Testing of Kadeploy using Virtual Machines on ... · Scalability testing of Kadeploy on Grid’5000 3 / 10. Process overview Kadeploy DHCP TFTP/HTTP (2) triggers reboot

Scalable remote command execution with Taktuk

In Kadeploy: Tree-based ; logarithmic complexity (vs linear)

I using TakTuk – http://taktuk.gforge.inria.fr/

I HPDC’2009 paper:B. Claudel, G. Huard and O. Richard.TakTuk, Adaptive Deployment of Remote Executions.

Scalability testing of Kadeploy on Grid’5000 4 / 10

Page 29: Scalability Testing of Kadeploy using Virtual Machines on ... · Scalability testing of Kadeploy on Grid’5000 3 / 10. Process overview Kadeploy DHCP TFTP/HTTP (2) triggers reboot

Scalable remote command execution with Taktuk

In Kadeploy: Tree-based ; logarithmic complexity (vs linear)

I using TakTuk – http://taktuk.gforge.inria.fr/

I HPDC’2009 paper:B. Claudel, G. Huard and O. Richard.TakTuk, Adaptive Deployment of Remote Executions.

Scalability testing of Kadeploy on Grid’5000 4 / 10

Page 30: Scalability Testing of Kadeploy using Virtual Machines on ... · Scalability testing of Kadeploy on Grid’5000 3 / 10. Process overview Kadeploy DHCP TFTP/HTTP (2) triggers reboot

Scalable remote command execution with Taktuk

In Kadeploy: Tree-based ; logarithmic complexity (vs linear)

I using TakTuk – http://taktuk.gforge.inria.fr/

I HPDC’2009 paper:B. Claudel, G. Huard and O. Richard.TakTuk, Adaptive Deployment of Remote Executions.

Scalability testing of Kadeploy on Grid’5000 4 / 10

Page 31: Scalability Testing of Kadeploy using Virtual Machines on ... · Scalability testing of Kadeploy on Grid’5000 3 / 10. Process overview Kadeploy DHCP TFTP/HTTP (2) triggers reboot

Scalable remote command execution with Taktuk

In Kadeploy: Tree-based ; logarithmic complexity (vs linear)

I using TakTuk – http://taktuk.gforge.inria.fr/

I HPDC’2009 paper:B. Claudel, G. Huard and O. Richard.TakTuk, Adaptive Deployment of Remote Executions.

Scalability testing of Kadeploy on Grid’5000 4 / 10

Page 32: Scalability Testing of Kadeploy using Virtual Machines on ... · Scalability testing of Kadeploy on Grid’5000 3 / 10. Process overview Kadeploy DHCP TFTP/HTTP (2) triggers reboot

Scalable remote command execution with Taktuk

In Kadeploy: Tree-based ; logarithmic complexity (vs linear)

I using TakTuk – http://taktuk.gforge.inria.fr/

I HPDC’2009 paper:B. Claudel, G. Huard and O. Richard.TakTuk, Adaptive Deployment of Remote Executions.

Scalability testing of Kadeploy on Grid’5000 4 / 10

Page 33: Scalability Testing of Kadeploy using Virtual Machines on ... · Scalability testing of Kadeploy on Grid’5000 3 / 10. Process overview Kadeploy DHCP TFTP/HTTP (2) triggers reboot

Broadcast of system images

imagesserver

Scalability testing of Kadeploy on Grid’5000 5 / 10

Page 34: Scalability Testing of Kadeploy using Virtual Machines on ... · Scalability testing of Kadeploy on Grid’5000 3 / 10. Process overview Kadeploy DHCP TFTP/HTTP (2) triggers reboot

Broadcast of system images

imagesserver

Send from server node to every client?

Scalability testing of Kadeploy on Grid’5000 5 / 10

Page 35: Scalability Testing of Kadeploy using Virtual Machines on ... · Scalability testing of Kadeploy on Grid’5000 3 / 10. Process overview Kadeploy DHCP TFTP/HTTP (2) triggers reboot

Broadcast of system images

imagesserver

Send from server node to every client?

Scalability testing of Kadeploy on Grid’5000 5 / 10

Page 36: Scalability Testing of Kadeploy using Virtual Machines on ... · Scalability testing of Kadeploy on Grid’5000 3 / 10. Process overview Kadeploy DHCP TFTP/HTTP (2) triggers reboot

Broadcast of system images

imagesserver

Use P2P?

Scalability testing of Kadeploy on Grid’5000 5 / 10

Page 37: Scalability Testing of Kadeploy using Virtual Machines on ... · Scalability testing of Kadeploy on Grid’5000 3 / 10. Process overview Kadeploy DHCP TFTP/HTTP (2) triggers reboot

Broadcast of system images

imagesserver

Use P2P?

Scalability testing of Kadeploy on Grid’5000 5 / 10

Page 38: Scalability Testing of Kadeploy using Virtual Machines on ... · Scalability testing of Kadeploy on Grid’5000 3 / 10. Process overview Kadeploy DHCP TFTP/HTTP (2) triggers reboot

Broadcast of system images

imagesserver

In Kadeploy: Topology-aware chained broadcastI Limiting factor: backplane bandwidth of switches

Scalability testing of Kadeploy on Grid’5000 5 / 10

Page 39: Scalability Testing of Kadeploy using Virtual Machines on ... · Scalability testing of Kadeploy on Grid’5000 3 / 10. Process overview Kadeploy DHCP TFTP/HTTP (2) triggers reboot

Testing the scalability of Kadeploy

I Rather specific requirementsI Many reinstallable nodes (infrastructure + deployed nodes)I DHCP server

I Testbed: Grid’5000 - http://www.grid5000.fr/I Testbed for research on distributed systems:

HPC, Grids, P2P, CloudI 10 sites, 25 clusters, 1300 nodes, 7400 coresI Unique features including:

I Hardware-as-a-Service Cloud: redeployment of OS on thebare metal by users (using Kadeploy)

I Dedicated backbone networkI KaVLAN: network isolation

I Still not enough nodes ; virtual machines (KVM) on all nodes

Scalability testing of Kadeploy on Grid’5000 6 / 10

Page 40: Scalability Testing of Kadeploy using Virtual Machines on ... · Scalability testing of Kadeploy on Grid’5000 3 / 10. Process overview Kadeploy DHCP TFTP/HTTP (2) triggers reboot

Testing the scalability of Kadeploy

I Rather specific requirementsI Many reinstallable nodes (infrastructure + deployed nodes)I DHCP server

I Testbed: Grid’5000 - http://www.grid5000.fr/I Testbed for research on distributed systems:

HPC, Grids, P2P, CloudI 10 sites, 25 clusters, 1300 nodes, 7400 coresI Unique features including:

I Hardware-as-a-Service Cloud: redeployment of OS on thebare metal by users (using Kadeploy)

I Dedicated backbone networkI KaVLAN: network isolation

I Still not enough nodes ; virtual machines (KVM) on all nodes

Scalability testing of Kadeploy on Grid’5000 6 / 10

Page 41: Scalability Testing of Kadeploy using Virtual Machines on ... · Scalability testing of Kadeploy on Grid’5000 3 / 10. Process overview Kadeploy DHCP TFTP/HTTP (2) triggers reboot

Testing the scalability of Kadeploy

I Rather specific requirementsI Many reinstallable nodes (infrastructure + deployed nodes)I DHCP server

I Testbed: Grid’5000 - http://www.grid5000.fr/I Testbed for research on distributed systems:

HPC, Grids, P2P, CloudI 10 sites, 25 clusters, 1300 nodes, 7400 coresI Unique features including:

I Hardware-as-a-Service Cloud: redeployment of OS on thebare metal by users (using Kadeploy)

I Dedicated backbone networkI KaVLAN: network isolation

I Still not enough nodes ; virtual machines (KVM) on all nodes

Scalability testing of Kadeploy on Grid’5000 6 / 10

Page 42: Scalability Testing of Kadeploy using Virtual Machines on ... · Scalability testing of Kadeploy on Grid’5000 3 / 10. Process overview Kadeploy DHCP TFTP/HTTP (2) triggers reboot

800 km

IsolatedL2 network

BordeauxGrenoble

Lille (P: 60, V: 428)

Luxembourg

Lyon

Nancy(P: 336, V: 2160)

Reims

Rennes(P: 102, V: 790)

Sophia(P: 137, V: 661)

Toulouse

3-18 VMper node

Totals:

Physical: 635

Virtual: 3999

Scalability testing of Kadeploy on Grid’5000 7 / 10

Page 43: Scalability Testing of Kadeploy using Virtual Machines on ... · Scalability testing of Kadeploy on Grid’5000 3 / 10. Process overview Kadeploy DHCP TFTP/HTTP (2) triggers reboot

Experimental process (fully automated)

1 Virtual testbed preparationI Reserve and reinstall all nodes ; 20 minsI Prepare 33 infrastructure nodes and 635 VM-hosting nodes;

configure everything; start virtual machines ; 20 mins

2 One or more Kadeploy runsI e.g. 3999 virtual nodes (3838 successful) ; 57 minsI Hotspots:

I First reboot: 11 minsI Broadcast: 15 minsI Second reboot: 7 mins

Scalability testing of Kadeploy on Grid’5000 8 / 10

Page 44: Scalability Testing of Kadeploy using Virtual Machines on ... · Scalability testing of Kadeploy on Grid’5000 3 / 10. Process overview Kadeploy DHCP TFTP/HTTP (2) triggers reboot

Limits to scalability and future work

Two major limiting factors:

I Nodes rebootI Relies on unreliable protocols: DHCP, TFTP (HTTP if iPXE)I Mitigated in Kadeploy by using reboot windows

I Remote command execution and broadcast of system imageI Heavily stresses the network ; ARP and TCP timeouts

I Dynamic TakTuk tree ; more ARP neededI Large Cloud infrastructures use per-rack L2 networks

I Future work:I Robustify ARP and TCP (iPXE+kernel tuning)I Improve fault tolerance of image broadcastI Infiniband support

Scalability testing of Kadeploy on Grid’5000 9 / 10

Page 45: Scalability Testing of Kadeploy using Virtual Machines on ... · Scalability testing of Kadeploy on Grid’5000 3 / 10. Process overview Kadeploy DHCP TFTP/HTTP (2) triggers reboot

Conclusions

I Tested the scalability of the Kadeploy OS provisioning solutionI Critical service in cluster environments

I Configured a Cloud of KVM virtual machines:I Using our own VM management scriptsI Of 3999 virtual machinesI On 668 physical machinesI From 4 sites of the Grid’5000 testbedI In a L2 network spanning 1000 km

I Reinstalled those virtual machines using Kadeploy< 1 hour for 3838 machines successfully installed

I Fully automated process; no special Grid’5000 privileges required

I Identified several bottlenecks and ideas for future work

Scalability testing of Kadeploy on Grid’5000 10 / 10