job submission, ce and wms

10
EGEE-II INFSO-RI- 031688 Enabling Grids for E-sciencE www.eu-egee.org www.glite.org Job submission, CE and WMS Claudio Grandi (INFN & CERN) GDB meeting 31/8/2007

Upload: hasad-mckee

Post on 30-Dec-2015

18 views

Category:

Documents


1 download

DESCRIPTION

Job submission, CE and WMS. Claudio Grandi (INFN & CERN) GDB meeting 31/8/2007. WMS/LB: acceptance criteria. Before being taken in certification: 5 days consecutive run without intervention 10K jobs/day handled by a single instance of the WMS number of jobs in non-final state < 0.5% - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Job submission, CE and WMS

EGEE-II INFSO-RI-031688

Enabling Grids for E-sciencE

www.eu-egee.orgwww.glite.org

Job submission, CE and WMS

Claudio Grandi(INFN & CERN)

GDB meeting31/8/2007

Page 2: Job submission, CE and WMS

GDB - 31 August 2007 2

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

WMS/LB: acceptance criteria

• Before being taken in certification:– 5 days consecutive run without intervention– 10K jobs/day handled by a single instance of the WMS– number of jobs in non-final state < 0.5%– Proxy renewal must work at the 98% level

• Note that the LB should be installed on a separate machine (that can serve multiple WMS instances at the same time)

Page 3: Job submission, CE and WMS

GDB - 31 August 2007 3

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

WMS/LB: status• Easter:

– run one week at 15000 jobs/day without manual intervention with 0.3% of jobs in non-final state

• Stress tests:– almost 27000 jobs/day

• Check-point patches: #1116 on 02/04/07 #1140 on 20/04/07 #1167 on 15/05/07 #1203 on 21/06/07 #1251 on 18/07/07

– Patch #1251 on the PPS Will be released to PS next Wednesday if no major issues found

• Issue: all that is built in the gLite 3.0 environment– The SL4/VDT 1.6.0 release (from ETICS) is being tested now

27000 jobs/day

Small number of jobsnot assigned to CEs

} Job in WMS

} Job on CE

Final states

Page 4: Job submission, CE and WMS

GDB - 31 August 2007 4

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

CE: acceptance criteria

• Before being taken in certification:– 5000 simultaneous jobs per CE node– Job failure rates due to CE in normal operation: < 0.5%– Job failures due to restart of CE services or CE reboot <0.5%.– For 2007 dress rehearsals

50 user/role/submission_node combinations (Condor_C instances) per CE node

5 days unattended running with performance on day 5 equivalent to that on day 1

– In the longer term 1 CE node should support an unlimited number of

user/role/submission node combinations, from at least 20 VOs• On the gLiteCE requires user switching done by glexec in blah

1 month unattended running without performance degradation

Page 5: Job submission, CE and WMS

GDB - 31 August 2007 5

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

CE: additional requirements

• Support for commonly used batch systems– High priority: LSF, PBS-Torque/Maui– Condor, SGE

• Support for commonly used submission systems– gLite WMS, Condor-G, pre-WS GRAM– In future BES. WS-GRAM?– Note that by construction not all the submission systems may be

implemented by the same product may need more than one product at each site...

• Support for services needed on the infrastructure– GIP to publish on the bdII information system– Accounting data to the GOC (APEL, DGAS?)

• Interoperability with our major partner projects (OSG)

Page 6: Job submission, CE and WMS

GDB - 31 August 2007 6

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

CE: status 1/2

• The gLiteCE and CREAM passed the accepatance tests– gLite CE (GSI-enabled Condor-C) on SL3 (tests @CERN)

Implements the identity change with glexec (called by BLAH) Up to 8000 jobs in a day submitted with 40 users with no failures https://edms.cern.ch/document/863275/1

– CREAM on SL3 - Tests going on on SL4 now (tests @CNAF) Uses the same BLAH of

the gLite CE > 90000 in 8 days with

50 users and 111 failures (LSF errors)

https://edms.cern.ch/document/863276/1

~ 6 K jobs on the CE at any time

}

Load phase:

~ 1 K jobs/hour

CREAM

Run phase:

~10 K jobs/day

Page 7: Job submission, CE and WMS

GDB - 31 August 2007 7

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

CE: status 2/2

• LCG-CE has been ported to SL4 / VDT 1.6.0 – GT4 pre-WS GRAM (tests @CERN)

Uses the globus call-out implemented in GT4 and the new LCAS/LCMAPS that was developed for the gLiteCE

Does not scale to the level of the acceptance tests• ok for single user, but 21% success rate with 9000 jobs/day

submitted by 15 users and high load on the machine https://twiki.cern.ch/twiki/bin/view/LCG/LCGCETest

Page 8: Job submission, CE and WMS

GDB - 31 August 2007 8

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

CE: next steps• Assuming that the project cannot support more than

one CE in the long term, the TCG decided that:– The first priority is to deploy the LCG-CE on SL4 as it is

On the PPS in early September, on the PS by end of September

– Full develop CREAM and make it available on the production infrastructure Certification by December, in production in February-March 2008 https://twiki.cern.ch/twiki/bin/view/EGEE/CECheckList

– Sites supporting applications that use native globus interface for job submission will continue to deploy the LCG-CE The development of a new node type that combines the LCG-CE

and CREAM is in discussion and will be addressed after the release of CREAM in production

Ask VOs to switch to other supported interfaces

– No effort will be spent on the gLite CE We will continue to accept requirements on BLAH and glexec from

Condor that will be included in the RedHat and Fedora distributions– TCG: https://twiki.cern.ch/twiki/bin/viewfile/EGEE/TCGHome?rev=1;filename=CE-proposal.doc

Page 9: Job submission, CE and WMS

GDB - 31 August 2007 9

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

CE: summary

LCG-CE (GT4 pre-WS GRAM)– Globus client, Condor-G– Globus code not modified– Not passing acceptance tests

Condor-C (was gLite CE)– Condor-G– Maintained by Condor– VDT includes BLAH & glexec

CREAM (WS-I)– CREAM and BES client, ICE,

Condor-G (prototype)

Condor-G

Globusclient

gLite WMS

User

CREAMCEMon

ICE

CREAM orBES client

EGEE authZ,InfoSys,

AccountingIn production

Existing prototype

gLitecomponent

non-gLitecomponent

BatchSystem

LCG-CE(GT4 +

add-ons)Condor-C

BLAH

User / Resource

UI

Sit

e

GIP

Page 10: Job submission, CE and WMS

GDB - 31 August 2007 10

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Job Prioritization

• TCG-WG active since more than one year– Focus is on a short term solution that support access to batch

system shares based on the user credentials (VOMS)

• Situation summarized in a note (Andrea and Simone):– http://cern.ch/egee-intranet/NA1/TCG/wgs/Job%20Priorities%20Implementation%20Plan.doc

– Proposal: the VOView Access Control Base Rule (ACBR) is published in a way that reflects the behaviour of LCAS/LCMAPS VOViews are mutually exclusive (use DENY tags)

• Required modifications to the match maker (WMS), the Generic Information Provider (GIP) and YAIM– Implementation is done but a few details still prevent a correct

configuration of the GIP

• In the longer term a consistent way of treating authorization in middleware components is needed – Authorization service that will also be able to handle policies– Better mechanisms on the CE to map users to batch shares