Transcript
Page 1: Preliminary tests with  co-scheduling and the  Condor parallel universe

www.eu-etics.org

ETICS All Hands meeting ETICS All Hands meeting Bologna, October 23-25, 2006Bologna, October 23-25, 2006

Preliminary tests with Preliminary tests with co-scheduling and the co-scheduling and the Condor parallel universeCondor parallel universeMarian ZUREKMarian ZUREK

for WP2for WP2

Page 2: Preliminary tests with  co-scheduling and the  Condor parallel universe

Bologna -- All Hands Meeting 2

What’s the …

• ContextContext• Use caseUse case• PastPast• Condor / NMI setupCondor / NMI setup• ResultsResults• gLite-specific issuesgLite-specific issues• Next stepsNext steps• DiscussionDiscussion

Page 3: Preliminary tests with  co-scheduling and the  Condor parallel universe

Bologna -- All Hands Meeting 3

• The gLite software stacks testing activity is highly The gLite software stacks testing activity is highly manual, so the motivation came for the process manual, so the motivation came for the process

automation and ease of reproducibilityautomation and ease of reproducibility

• In the future the system tests should become the part In the future the system tests should become the part of the release process (reports stored in the DB, easily of the release process (reports stored in the DB, easily

accessible for the trends creation, performance accessible for the trends creation, performance analysis, bug reproduction, etc.)analysis, bug reproduction, etc.)

Context

Page 4: Preliminary tests with  co-scheduling and the  Condor parallel universe

Bologna -- All Hands Meeting 4

Build/TestArtefacts

Web Application

ReportDB

ProjectDB

NMI Scheduler

Clients

Web Service

NMI ClientWrapper

Via browser

Command-Line tools

WNs ETICS InfrastructureContinuous Builds

Service Overview

Page 5: Preliminary tests with  co-scheduling and the  Condor parallel universe

Bologna -- All Hands Meeting 5

• We have to deploy six services on six different nodesWe have to deploy six services on six different nodesUI, CE, WMSLB, VOMS_mysql, RGMA, WNUI, CE, WMSLB, VOMS_mysql, RGMA, WN

• There are interdependencies between themThere are interdependencies between them– UI: [RGMA, VOMS_mysql, WMSLB, CE, WN]UI: [RGMA, VOMS_mysql, WMSLB, CE, WN]– CE: [WMSLB, WN]CE: [WMSLB, WN]– WMSLB: [VOMS_mysql]WMSLB: [VOMS_mysql]– VOMS_mysql: [RGMA]VOMS_mysql: [RGMA]– WN: [CE]WN: [CE]– RGMA: []RGMA: []

• No auto discovery possible, order of service startup No auto discovery possible, order of service startup must be preserved, run-time environment definedmust be preserved, run-time environment defined

• The successful “real job” submission requires all the The successful “real job” submission requires all the services being operationalservices being operational

Use case for gLite

Page 6: Preliminary tests with  co-scheduling and the  Condor parallel universe

Bologna -- All Hands Meeting 6

Back to the past

• The gLite software stack requires the administation rights on the The gLite software stack requires the administation rights on the target node, so the root-enabled schema has been developed to target node, so the root-enabled schema has been developed to address thisaddress this

• root enabled jobs should be performed only on the predefined root enabled jobs should be performed only on the predefined sets of hostssets of hosts

• The service installation should reflect its operational The service installation should reflect its operational sstatus by tatus by writing to the file e.g. writing to the file e.g. /etc/nmi/publish_services.list/etc/nmi/publish_services.list

• runs_VOMS_server="true", timeOut=3600runs_VOMS_server="true", timeOut=3600• runs_RGMA_server="true”runs_RGMA_server="true”

• The timeOut (expressed in seconds) defines the service The timeOut (expressed in seconds) defines the service operational time. After the timeOut node will be released. operational time. After the timeOut node will be released. Absence of the timeOut will mean that the machine is released Absence of the timeOut will mean that the machine is released immediately after the job has been finished.immediately after the job has been finished.

Page 7: Preliminary tests with  co-scheduling and the  Condor parallel universe

Bologna -- All Hands Meeting 7

Condor / NMI setup

• Experiment on the predefined set of nodesExperiment on the predefined set of nodes– special STARTD expressions for defining the Condor VMx special STARTD expressions for defining the Condor VMx

availabilityavailability– nodes still available for the regular submissionsnodes still available for the regular submissions– synchronisation using Condor-chirp messagessynchronisation using Condor-chirp messages

• Custom (outside the NMI/Condor) scratching Custom (outside the NMI/Condor) scratching mechanism: mechanism: – watchdog style (outside process monitoring the node’s “limbo” watchdog style (outside process monitoring the node’s “limbo”

state)state)– Initial trouble with lost/stuck jobs resolved with extra wait timeInitial trouble with lost/stuck jobs resolved with extra wait time– Node down-time < 10minsNode down-time < 10mins– Very good candidate for the virtualisation as no re-installation is Very good candidate for the virtualisation as no re-installation is

needed (simple VM restart is enough)needed (simple VM restart is enough)

Page 8: Preliminary tests with  co-scheduling and the  Condor parallel universe

Bologna -- All Hands Meeting 8

Results

• Using the NMI and Condor parallel universe we were Using the NMI and Condor parallel universe we were able to address the above described scenarioable to address the above described scenario

• The delays were minimal and experimental timeOuts The delays were minimal and experimental timeOuts adjusted for optimal performanceadjusted for optimal performance

• The developed code could be consulted in the CVS, The developed code could be consulted in the CVS, module: module: org.etics.nmi.system-testsorg.etics.nmi.system-tests

• Non-conditonal persistency: Non-conditonal persistency: The node on which the The node on which the service runs remains operational for the predefined set of service runs remains operational for the predefined set of timetime– Sleep appended to the codeSleep appended to the code– expiry-time communication via NMI/Hawkeye moduleexpiry-time communication via NMI/Hawkeye module

Page 9: Preliminary tests with  co-scheduling and the  Condor parallel universe

Bologna -- All Hands Meeting 9

Results

• Conditional persistency : Conditional persistency : the node should be frozen in the node should be frozen in case the job fails (not implemented yet, but easy).case the job fails (not implemented yet, but easy).

• Failure propagation: should one of the parallel tests fail Failure propagation: should one of the parallel tests fail - the whole job flow is immediately aborted- the whole job flow is immediately aborted

• Set of parallel job nodes exits immediately when Set of parallel job nodes exits immediately when node_0 job exits (let the node_0 be the “last” in the node_0 job exits (let the node_0 be the “last” in the chain)chain)

• Output format definition is up to the submitterOutput format definition is up to the submitter

Page 10: Preliminary tests with  co-scheduling and the  Condor parallel universe

Bologna -- All Hands Meeting 10

Results

• Context, name spaces - assured thanks to the Context, name spaces - assured thanks to the Condor/NMI designCondor/NMI design

• Tester wants to use its own (external) service instance Tester wants to use its own (external) service instance VOMS_server - possible, but reproducibility not VOMS_server - possible, but reproducibility not guaranteedguaranteed

• Multi-sites/across firewalls tests - possible (see Andy’s Multi-sites/across firewalls tests - possible (see Andy’s talk)talk)

• Is the test job different from the standard build Is the test job different from the standard build submission - not from the WP2 point of viewsubmission - not from the WP2 point of view

• Proposal of the YAML format for the dependencies Proposal of the YAML format for the dependencies definitions (see flow-spec.yaml)definitions (see flow-spec.yaml)

Page 11: Preliminary tests with  co-scheduling and the  Condor parallel universe

Bologna -- All Hands Meeting 11

flow-spec.yaml

# First, a list of all jobs.# First, a list of all jobs.

RGMA: 0RGMA: 0 VOMS_mysql: 10VOMS_mysql: 10 WMSLB: 15WMSLB: 15 CE: 25CE: 25 UI: 35UI: 35------# Timeouts for nodeid discovery stage.# Timeouts for nodeid discovery stage.

RGMA: []RGMA: [] VOMS_mysql: [RGMA]VOMS_mysql: [RGMA] WMSLB: [VOMS_mysql]WMSLB: [VOMS_mysql] CE: [WMSLB]CE: [WMSLB] UI: [RGMA, VOMS_mysql, WMSLB, CE]UI: [RGMA, VOMS_mysql, WMSLB, CE]------# stage.# stage.# Now, a hash mapping each job to its dependencies at the nodeid discovery # Now, a hash mapping each job to its dependencies at the nodeid discovery

RGMA: RGMA.shRGMA: RGMA.sh VOMS_mysql: VOMS_mysql.shVOMS_mysql: VOMS_mysql.sh WMSLB: WMSLB.shWMSLB: WMSLB.sh CE: CE.shCE: CE.sh UI: UI.shUI: UI.sh------# Now, mapping the job name to its script.# Now, mapping the job name to its script.

- RGMA- RGMA - VOMS_mysql- VOMS_mysql - WMSLB- WMSLB - CE- CE - UI- UI------

Page 12: Preliminary tests with  co-scheduling and the  Condor parallel universe

Bologna -- All Hands Meeting 12

gLite/general issues

• Do we adopt YAML formatDo we adopt YAML format

• Do we need to create a temporary CAs servers or we Do we need to create a temporary CAs servers or we expect this from the testers/code submittersexpect this from the testers/code submitters– pass-phrase problempass-phrase problem

• Do we write Do we write site-info.defsite-info.def file upfront or we take the file upfront or we take the assumption of the future auto-discoveryassumption of the future auto-discovery

Page 13: Preliminary tests with  co-scheduling and the  Condor parallel universe

Bologna -- All Hands Meeting 13

Next steps

• Virtualisation using WoD (WindowsOnDemand) serviceVirtualisation using WoD (WindowsOnDemand) service– Initial assessment very positiveInitial assessment very positive– Customized installation a-la etics WNCustomized installation a-la etics WN– Candidate for the “freeze” scenario - one can programmatically Candidate for the “freeze” scenario - one can programmatically

export/import the VMexport/import the VM– Free as of today, paid in the future (should we run a Free as of today, paid in the future (should we run a

dedicated/private server)dedicated/private server)

• Virtualisation using the VMWareVirtualisation using the VMWare– Base installation (Alberto can say much more)Base installation (Alberto can say much more)– API existingAPI existing

• Virtualisation with CondorVirtualisation with Condorsee Andy’s talksee Andy’s talk

Page 14: Preliminary tests with  co-scheduling and the  Condor parallel universe

Bologna -- All Hands Meeting 14

Next steps

• Demo for the PM12 (review) ?Demo for the PM12 (review) ?

Page 15: Preliminary tests with  co-scheduling and the  Condor parallel universe

Bologna -- All Hands Meeting 15

Discussion

• Q & AQ & A


Top Related