cern deployment area status report ian bird lcg, it division sc2 meeting 8 august 2003

40
CERN Deployment Area Status Report Deployment Area Status Report Ian Bird LCG, IT Division SC2 Meeting 8 August 2003

Upload: lesley-pope

Post on 27-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CERN Deployment Area Status Report Ian Bird LCG, IT Division SC2 Meeting 8 August 2003

CERN

Deployment Area Status ReportDeployment Area Status Report

Ian BirdLCG, IT Division

SC2 Meeting8 August 2003

Page 2: CERN Deployment Area Status Report Ian Bird LCG, IT Division SC2 Meeting 8 August 2003

[email protected] 2

CERN

OverviewOverview

• LCG-1 Status• Middleware• Deployment and architecture• Security• Operations & Support• Milestone analysis• Resources

• Plans for remainder of 2003

Page 3: CERN Deployment Area Status Report Ian Bird LCG, IT Division SC2 Meeting 8 August 2003

[email protected] 3

CERN

LCG-1 StatusLCG-1 Status

Middleware release:• We have a candidate LCG-1

release !• Quite robust – compared with

previous EDG releases• Certification team & EDG

Loose Cannons currently running tests

• Some problems (broken new functionality)

• No major showstoppers• Currently trying to stress to

find limitations – has not failed disastrously but rather degrades performance (RB)

Deployment:• Deployment of previous tag

started to 10 Tier 1 sites• Going very slowly – only 5 up,

slow responses, expect 8, dubious about 2 sites

• Want to start pushing out LCG-1 release this week

• This will be an upgrade

• Hope experiments can start controlled testing before the end of August

Two activities in parallel:

Page 4: CERN Deployment Area Status Report Ian Bird LCG, IT Division SC2 Meeting 8 August 2003

CERN

LCG-1 Release CandidateLCG-1 Release CandidateContents and StatusContents and Status

Page 5: CERN Deployment Area Status Report Ian Bird LCG, IT Division SC2 Meeting 8 August 2003

[email protected] 5

CERN

LCG-1 ContentsLCG-1 Contents• Based on VDT 1.1.8-9

– 2 changes from LCG – gridftp bug fix and fix for GRIS instability – not yet in official VDT.

– Soon will move to VDT 1.1.9 (or 10) which will be a converged US and LCG VDT

• EDG components– WP1 (Resource Broker)– WP2 – Replica Management tools, including the EDG Replica Location

Service (RLS)– WP4 – gatekeeper, and LCFG installation tool

• Information Services (see below)– MDS with EDG/LCG improvements and bug fixes– GLUE Schema v1.1 with extensions from LCG; Information providers

from WP3,4,5• Storage access

– “Classical” EDG Storage Element: disk pools accessed via gridFTP; tested in conjunction with Replica Management tools

– Will add access to Castor and Enstore via gridFTP in next few weeks once system is deployed

Page 6: CERN Deployment Area Status Report Ian Bird LCG, IT Division SC2 Meeting 8 August 2003

[email protected] 6

CERN

LCG-1 Contents – 2 LCG-1 Contents – 2

• VDT – 1.1.8-9– Excellent collaboration – very responsive to our needs – fast turnaround

• LCG contributions to VDT and EDG– added accounting and security auditing (connection between the process,

the batch system and the submitter)– fixed gatekeeper log file growing infinitely by implementing rotating log

schema, avoiding gatekeeper restarts– incorporated MDS bug fixes from NorduGrid, improving at the same time

the timeout handling, this allows LCG to deploy MDS on bigger scale; found and fixed bugs in GRIS

– new LCG versions of job managers that do not need shared home directories

• issue raised by deploying LCG-0• solves scalability issues that prevented use of more than a few tens of worker

nodes– gass-cache inode leak, “dead” inodes were never removed - filling up the

disk, eventually causing the service to crash

It was a significant effort to put this all together in a coherent way

Page 7: CERN Deployment Area Status Report Ian Bird LCG, IT Division SC2 Meeting 8 August 2003

[email protected] 7

CERN

CertificationCertification

• Tests so far have been on LCG certification TB– Local set of clusters – no WAN– There has been no time yet for true WAN tests

• Testing is being done by LCG and Loose Cannons– Have matrix of tests as baseline now – will not accept changes

that break these tests– Will do regression testing against future changes– Push baseline acceptance tests upwards – more stringent as we

evolve• Certification TB

– Intended to reproduce and test actual deployed configuration• Test info. system architecture• Test various installation options• Test various batch systems and jobmanagers• Middleware functionality and robustness

– Expanding now to Wisconsin and Hungary, to include also CNAF, FNAL and Moscow (although Moscow do not have sufficient resources at the moment)

– Stress system and determine limits in parallel with deployed LCG-1

Page 8: CERN Deployment Area Status Report Ian Bird LCG, IT Division SC2 Meeting 8 August 2003

CERN

RB_a

BDII_a

MDS_a

CE_a

SE_a

RB_b

BDII_b

CE_b

WNs

CE_2

SE_2

WNs

RB_3

BDII_3

MDS_3_a

CE_3

SE_3

WNs

CE_4

SE_4

WNsWNs

WN_a1

WNsWNs

WN_b1 WNsWNsWNs

WN_2_a1

WNsWNs

WN_3_a2

WN_3_a1

WNsWNsWNsWNsWN_4

RLS_MySQL

RLS_oracle

Cluster_1 Cluster_2 Cluster_3 Cluster_4

UI_1 UI_4

CE_5

WNsWNsWNsWNsWNsWN_5

Cluster_5

CE_6

WNsWNsWNsWNsWNsWN

Cluster_6

LSFCondor

Certification Certification Test BedTest Bed

ArchitectureArchitecture

Certification Certification Test BedTest Bed

ArchitectureArchitecture

Proxy

WN_b2WN_a2

WN_2_a2

LCFGng Lite install

MDS_b

MDS_3_b

Page 9: CERN Deployment Area Status Report Ian Bird LCG, IT Division SC2 Meeting 8 August 2003

[email protected] 9

CERN

Testing Testing

• Progress in the last few weeks with new Russian people involved• We have the following tests – to define when LCG-1 can be deployed

– Sequential & parallel job submission (RB functionality)– Job storms (parametrizable)

• normal• replica manager• copy (gridFTP)• checksum (big files through the sandbox with verification)

– MDS testing suite– Replica Manager (simulates Monte Carlo production)– Globus functionality tests (through VDT test suite)– Service nodes functionality tests

• MDS x BDII coherence tests• LRC x RMC coherence tests

• Many of these based on EDG “stress-tests” of Atlas and CMS

• We still need to define a “site verification test suite”, – to be run as a validation of a site installation, before the site connects to

the grid

Page 10: CERN Deployment Area Status Report Ian Bird LCG, IT Division SC2 Meeting 8 August 2003

[email protected] 10

CERN

LCG-1 statusLCG-1 status• Resource Broker:

– 1000 jobs in 20 parallel streams – 3 failed (1 in submission, would normally work with auto-retry; 2 for local –non-grid reasons)

– 500 long jobs (such that proxy expired) – single stream• 10 failed, but was due to 1 job that put Condor-G in a strange state –

under investigation; but hard to reproduce, subsequent runs did not fail

– New functionality (Output data file) created bad JDL– RB can use all CPU of 2x800MHz machine – need large machine for

RB (trying to understand why) – but did not fail, just degrades

• Replica Management:– Large variety of tests done – 1% failure of copy/register/replicate

between 4 sites (due to a known problem in BDII, under investigation)

– 60 parallel streams replicating 10 1GB files worked without problem

– Some new functionality did not work (block copy and register)– Oracle was used as back-end service

• Combined functionality:– Matchmaking requiring files works (but was not stressing system)

Page 11: CERN Deployment Area Status Report Ian Bird LCG, IT Division SC2 Meeting 8 August 2003

[email protected] 11

CERN

Middleware expectationsMiddleware expectationsMiddleware developments LCG would like in 2003

– Cut-off for this is October in order to have system ready for 2004– EDG further developments will finish in September

• R-GMA:– Multiple registries: are essential for a full production service – this work still has to

be done, may not be simple– Will not be started until initial R-GMA is stable– If not delivered have a single point of failure/bottleneck, or MDS (**)

• RLS:– Need the RLI component to have distributed LRCs– Code is ready; being stress tested by developers– Fallback is single global catalog – but is single point of failure/bottleneck

• RLS: proxy service for Replica Management– Essential to remove requirement on WN IP connectivity– If not delivered limits sites that can run LCG, and resources that can be included

(**)• VOMS (VO management service)

– Service itself is ready and tested– Needs integration with CE and Storage – must be done by developers of those

services– Fallback is to continue as now with basic authorization via grid-map

• GFAL (Grid File Access Library)– LCG development– Prototype available, expect production version in August

Page 12: CERN Deployment Area Status Report Ian Bird LCG, IT Division SC2 Meeting 8 August 2003

[email protected] 12

CERN

Middleware expectations – but Middleware expectations – but

• Based on last 6 months experience, seems unlikely that we get much more than bug fixes from EDG now

• Desirables:– gcc 3.2– Globus 2.4 (i.e. an update via VDT)– R-GMA: may get 1st implementation, but will have a single

registry• Comparisons between MDS and R-GMA – essential

– Replica Management: RLI implementation, proxy service• Also need to align the different RLS versions (Globus-EDG and EDG)• Priorities:

– Proxy service helps with clusters with no outgoing connectivity– Aligning RLS versions avoids two RLS universes – There is a plan to converge, but means we don’t get RLI and proxy

service this year

Page 13: CERN Deployment Area Status Report Ian Bird LCG, IT Division SC2 Meeting 8 August 2003

[email protected] 13

CERN

Middleware developmentMiddleware development

• Recent experience has shown it is very difficult for EDG developers to bug fix “LCG” in parallel with developing EDG

• We have agreed a process with EDG

• Currently LCG release started from a consistent EDG tag– Have made specific LCG fixes – happened that these did not

have dependencies between packages– LCG-1 is not a consistent EDG tag

• Once EDG 2.0 has been released– Re-align LCG release with an EDG tag– Branch CVS repository

• Production branch (LCG-1) for bugfixes• Development branch for EDG

• We will not accept anything that does not meet our current baseline test matrix on the certification test-bed

Page 14: CERN Deployment Area Status Report Ian Bird LCG, IT Division SC2 Meeting 8 August 2003

[email protected] 14

CERN

DeploymentDeployment

• Status– Deployment has started to the initial 10 Tier 1 sites (CERN, FNAL, BNL,

CNAF, RAL, Taipei, Tokyo, FZK, IN2P3, Moscow), also Hungary is ready to join immediately

– Started several weeks ago with sites asked to set up LCFG (different to LCG-0) and complete installation on earlier release, with intent of deploying the LCG-1 release as an update

– Many sites have been very slow to respond and do the installation– The LCG-1 release is now prepared for deployment and will be started

next week• Caveats

– First deployment insisted on PBS as batch system – suggest add CE for favourite batch system and migrate

• LSF, PBS, Condor work, while FBSng and BQS require small amount of local mods

– First deployment forced full LCFGng install (i.e. incl. OS) – real LCG-1 distribution has “lite” version fully tested (install on top of OS)

• Deployment status pages– Site web pages, General status page (see LCG main web page)– Real status is monitoring system

Page 15: CERN Deployment Area Status Report Ian Bird LCG, IT Division SC2 Meeting 8 August 2003

[email protected] 15

CERN

Deployed sites Deployed sites

Site LCG-0 LCG-1pre

Tier 1

0 CERN 1 CNAF 2 RAL

3 FNAL

4 Taipei 5 FZK 6 IN2P3 ?

7 BNL

8 Russia (Moscow)

9 Tokyo Tier 2

10 Legnaro (INFN) N/A

LCG-0: Spring deployment of pre-production middleware based on VDT 1.1.6 and previous EDG version (1.4.x)

LCG-1pre: Deployment of full system preparatory to installing final release for LCG-1. Full installation procedure using LCFGng tools. Pre-release tag.

LCG-1: The initial LCG-1 release, with tested middleware based on VDT 1.1.8-9 and EDG 2.0 components.

: Done

: Started

? : Unknown

Page 16: CERN Deployment Area Status Report Ian Bird LCG, IT Division SC2 Meeting 8 August 2003

[email protected] 16

CERN

Deployed System ArchitectureDeployed System Architecture

• RLS– Single LRC per VO – run at CERN with Oracle back ends– When RLI is introduced propose to run LRC with Oracle at all Tier 1s

(agreed in principle by GDB)• Tests started in Taipei, FNAL, RAL

• VO Services– Run by Nikhef for experiments– LCG at CERN for LCG-1 VO (sign user rules), dteam– Registration web server at CERN

• Configuration and installation servers– Run at CERN

• Batch system– Begin with PBS (most tested)– Add parallel CE for LSF/Condor/FBSng/BQS and migrate– Start with few WN only – add more when service is stable

• All sites run– Disk SE, UI– Most run 1 RB, CERN will run 2– (CERN) UI available on AFS and LXplus

Page 17: CERN Deployment Area Status Report Ian Bird LCG, IT Division SC2 Meeting 8 August 2003

[email protected] 17

CERN

Deployed system architectureDeployed system architecture

RLS(RMC&LRC)

CMSRLS

(RMC&LRC)ATLAS

RLS(RMC&LRC)

ALICERLS

(RMC&LRC)LCG-Team

VOLCG

Proxy

UIAFS users

RB-2RB-1

SEDisk

UI-bLxplus

CE-1 CE-2

WNWN

WNWN

WN

PBS

VOLCG-Team

LCGRegistration

Server

LCGCVS Server

RLS(RMC&LRC)

LHCb

UI

ProxyRB-1

SEDisk

CE

WNWN

WNWN

WN

PBS/????

VOCMS

VOLHCb

VOATLAS

VOALICE

@ NIKHEF

Services at CERN

Services at other sites

CE-3 CE-4

WNWN

WNWN

WN

LSF

CE-3 CE-4

WNWN

WNWN

WN

Favourite Batch System

Page 18: CERN Deployment Area Status Report Ian Bird LCG, IT Division SC2 Meeting 8 August 2003

[email protected] 18

CERN

Information system architectureInformation system architecture

• LCG-1 uses MDS– Top level is the BDII (Static interface) – but modified to get regular and

frequent updates• Each site will run a site GIIS. On day one this will be run on one of

CEs. • The site GIIS registers to two or more regional GIISes.

– These will be well known and in the configuration that we distribute. • The BDII system has been modified by LCG to handle multiple regions

and to react on one instance of the regional GIIS failing. • The problem of stale information has been limited by repopulating

and swapping the ldap trees. • Every site that runs an RB will run its own BDII.• There is room to improve the way this system works via small

modifications to the RB – (not requesting DNs, using alternate multiple BDIIs…)– These changes can be handled after we gain experience with the first

release(s) – In addition we can try to register the GRISes directly with the region GIISes

to see how this improves the reliability.• US grid sites (non-LCG) will likely use the work on MDS we have done

Page 19: CERN Deployment Area Status Report Ian Bird LCG, IT Division SC2 Meeting 8 August 2003

[email protected] 19

CERN

RegionA1GIIS

RegionA2GIIS

BDII ALDAP

BDII BLDAP

RB

RegionB1GIIS

RegionB2GIIS

CE1GRIS

CE2GRIS

SE1GRIS

SE2GRIS

SiteCGIIS

CE1GRIS

CE2GRIS

SE1GRIS

SE2GRIS

SiteDGIIS

CE1GRIS

CE2GRIS

SE1GRIS

SE2GRIS

SiteAGIIS

CE1GRIS

CE2GRIS

SE1GRIS

SE2GRIS

SiteBGIIS

Query

Register

/dataCurrent/.. /dataNew/..

BDIILDAP Swap&Restart

Query

While using the data from one directory the BDII will query the regional GIISes to fill another directory structure. If this has finished the BDII is stopped, the dirs swapped and the BDII is then restarted.The restart takes less than 0.5 sec.To improve the availability during this time it was suggested that the TCP port should be switched off and the TCP protocol should take care of the retry (David).This has to be tested. Another idea worth testing is to remove the site GIIS and configure the GRISes to register directly with the region GIISes

secondary

primary

Using multiple BDIIs requires RB changes

LCG-1 First Launch Information System Overview

Page 20: CERN Deployment Area Status Report Ian Bird LCG, IT Division SC2 Meeting 8 August 2003

[email protected] 20

CERN

RegionA1

GIIS

RegionA2

GIIS

BDII A

LDAP

RB

RegionB1

GIIS

RegionB2

GIIS

CE1

GRIS

CE2

GRIS

SE1

GRIS

SE2

GRIS

SiteC

GIIS

CE1

GRIS

CE2

GRIS

SE1

GRIS

SE2

GRIS

SiteD

GIIS

CE1

GRIS

CE2

GRIS

SE1

GRIS

SE2

GRIS

SiteA

GIIS

CE1

GRIS

CE2

GRIS

SE1

GRIS

SE2

GRIS

SiteB

GIIS

Query

Register

secondary

primary

LCG-1 First Launch Information System Structure

Page 21: CERN Deployment Area Status Report Ian Bird LCG, IT Division SC2 Meeting 8 August 2003

[email protected] 21

CERNLCG-1 First Launch Information System Sites and Regions

•A Region should not contain too many sites since we have observed problems with MDS if a large number of sites are involved.•To allow for future expansion, but not to make the system too complex we suggest starting with two regions and if needed split later to smaller regions.

•The regions are: West and East of 0 degrees•The idea is to have a large region and a small one and see how they work•For the West 2 regional GIISes, and for the East 3 will be setup at the beginning,

RAL

FNAL

BNL

WEST1RegionGIIS

WEST2RegionGIIS

CERN

CNAF

LYON MOSCOW

FZK TOKYO

TAIPEI

EAST1RegionGIIS

EAST2RegionGIIS

EAST3RegionGIIS

Page 22: CERN Deployment Area Status Report Ian Bird LCG, IT Division SC2 Meeting 8 August 2003

[email protected] 22

CERN

SecuritySecurity

• Security group led by Dave Kelsey has been very active, includes many sites and experiment reps– Also includes reps of sites that overlap with LCG

• Has put in place agreements and infrastructure needed for LCG-1

• Is actively planning security policy, and implementation plan for 2004

• Set up incident response group as well as contacts list

• Next few slides from Dave Kelsey’s report to July GDB, numbers refer to GDB documents 36-39 (http://cern.ch/lcg/Documents)

Page 23: CERN Deployment Area Status Report Ian Bird LCG, IT Division SC2 Meeting 8 August 2003

[email protected] 23

CERN

Rules for Use of LCG-1 #36Rules for Use of LCG-1 #36

• To be agreed to by all users (signed via private key in browser) when they register with LCG-1

• Deliberately based on current EDG Usage Rules– Does not override sites rules and policies– Only allows professional use

• Once discussions start on changes– Chance we never converge!

• We know that they are far from perfect• Are there major objections today?

– One comment says we should define the list of user data fields (as agreed at the last GDB)

• Use now and work on better version for Jan 2004– Consult lawyers?

Page 24: CERN Deployment Area Status Report Ian Bird LCG, IT Division SC2 Meeting 8 August 2003

[email protected] 24

CERN

Audit Requirements #37Audit Requirements #37

• UI None• RB None – look at later

• For origin of job submission

• CE gatekeeper maps DN to local account• Keep gatekeeper and jobmanager logs

• SE/GridFTP• Keep input and output data transfer logs

• Batch system• jobmanager logs (or batch system logs)• Need to trace process activity – pacct logs

– This is large

• Central storage of all logfiles? Rather than on the WN• To be kept for at least 90 days by all sites

Page 25: CERN Deployment Area Status Report Ian Bird LCG, IT Division SC2 Meeting 8 August 2003

[email protected] 25

CERN

Incident Response #38Incident Response #38

• Procedures for LCG-1 start (before GOC)– Incidents, communications, enforcement, escalation etc

• Party discovering incident responsible for• Taking local action• Informing all other security contacts

• Difficult to be precise at this stage – we have to learn!• We have created an ops security list (before GOC)

– Default site entry is the Contact person but an operational list would be better

• LCG-1 sites need to refine and improve• All sites must buy-in to the procedures

Page 26: CERN Deployment Area Status Report Ian Bird LCG, IT Division SC2 Meeting 8 August 2003

[email protected] 26

CERNUser RegistrationUser Registration& VO Management #39& VO Management #39

• User registers once with LCG-1– Accepts User Rules– Gives the agreed set of personal data (last GDB)– Requests to join one VO/Experiment

• We need robust VO Registration Authorities to check– The user actually made the request– User is valid member of the experiment– User is at the listed institution– That all user data looks reasonable

• E.g. mail address

• The web form will warn that these checks will be made• User data is distributed to all LCG-1 sites

Page 27: CERN Deployment Area Status Report Ian Bird LCG, IT Division SC2 Meeting 8 August 2003

[email protected] 27

CERN

User Registration aimsUser Registration aims

• To provide LCG-1 with accurate information about users for– Pre-registration of accounts (where needed)– Auditing (legal requirements)

• To ensure VO managers do appropriate checks– To allow LCG-1 sites to open resources to VO

• BUT… the current procedures have limited resources– To some extent has to be “best efforts”

• E.g. do we need backup VO managers?

Page 28: CERN Deployment Area Status Report Ian Bird LCG, IT Division SC2 Meeting 8 August 2003

[email protected] 28

CERN

VO Registration (2)VO Registration (2)

• Today’s VO managers– ALICE Daniele Mura INFN– ATLAS Alessandro De Salvo INFN– CMS Andrea Sciaba INFN– LHCb Joel Closier CERN– DTEAM Ian Neilson CERN

• Plan to continue to use the existing VO servers and services (run by NIKHEF) and the current VO managers (all agree to continue)– DTEAM run at CERN

Page 29: CERN Deployment Area Status Report Ian Bird LCG, IT Division SC2 Meeting 8 August 2003

[email protected] 29

CERN

VO/Experiment RAVO/Experiment RA

• For LCG-1 start• VO manager checks request via one of

– Direct personal knowledge or contact (not e-mail)– Check in official CERN or experiment database– With official experiment contact person at employing institute

• Signed e-mail? (not done today)

• Identity and employing institute are the critical ones• VO managers/LCG registrar to maintain a list of institutes and

contact persons• Work needed on more robust procedures for 2004

– That can scale• With distributed RA’s?

Page 30: CERN Deployment Area Status Report Ian Bird LCG, IT Division SC2 Meeting 8 August 2003

[email protected] 30

CERN

OperationsOperations

• Prototype has been set up by RAL

• Uses several monitoring tools

• GridIce (INFN) – significant effort to set up for LCG-1 by INFN and CERN groups

• Task force set up to define how this will evolve– Requirements– Tools– …

Page 31: CERN Deployment Area Status Report Ian Bird LCG, IT Division SC2 Meeting 8 August 2003

[email protected] 31

CERN

Monitoring “Dashboard”Monitoring “Dashboard”

Page 32: CERN Deployment Area Status Report Ian Bird LCG, IT Division SC2 Meeting 8 August 2003

[email protected] 32

CERN

Operations & MonitoringOperations & Monitoring

Page 33: CERN Deployment Area Status Report Ian Bird LCG, IT Division SC2 Meeting 8 August 2003

[email protected] 33

CERN

SupportSupport

• Initial User Support prototype has been implemented by FZK

• This will evolve over time

• Agreement is that initial problem triage will be done by the experiments support teams– Experiment experts will submit problems to LCG support

• Next few slides from Klaus-Peter Mickel’s presentation to the PEB (http://agenda.cern.ch/fullAgenda.php?ida=a031492)

• User guide and installation guide are available as drafts

Page 34: CERN Deployment Area Status Report Ian Bird LCG, IT Division SC2 Meeting 8 August 2003

[email protected] 34

CERN

Local Operations Level: At Central Grid Operation Center and at each T-1-C (and also at each T-2-C?) e.g.:

• Problem solving• Maintenance• Local services• Resource management• Preventive activities• Problem announcements

The Support Model — three levelsThe Support Model — three levels

Customer/Experi- Problem oriented Information orientedment Level: • Submit a problem • Ask for current Grid status,

• Track a problem • documentation, training

Support Level: At least three identical support centerswith: • Helpdesk application

• User, ticket and resource data base• Knowledge base

On call service outside the working hours

Page 35: CERN Deployment Area Status Report Ian Bird LCG, IT Division SC2 Meeting 8 August 2003

[email protected] 35

CERN

Page 36: CERN Deployment Area Status Report Ian Bird LCG, IT Division SC2 Meeting 8 August 2003

[email protected] 36

CERN

Deployment MilestonesDeployment Milestones

• Recent– 1.4.1.1 Initial M/W delivery for LCG-1 (30/4/03)

• Was not met – now is delivered (~31/7/03)• LCG contributed 2 FTE to assist integration process• LCG decided to use MDS as info system

– 1.4.1.2 Implement LCG-1 Security Model (30/6/03)• Was met (see above)

– 1.4.1.3 Prototype operations and support service (30/6/03)• Met for support – FZK • Is now met for operations (5/8/03) (see above) – RAL + GD team

– 1.4.1.4 Deploy LCG-1 to 10 Tier 1 (15/7/03)• Is late – but in progress now – expect complete by 31/8/03

– 1.4.1.6 Experiment verification of LCG-1 (31/7/03)• Is late – cannot happen before LCG-1 is deployed – start around end August

• Upcoming– 1.4.2.19 Middleware functionality complete (30/9/03)

• This will be a cut off for significant new functionality available from EDG

– 1.4.2.21 Job Execution Model defined (30/9/03)• Will specify how LCG-1 will be useable

Page 37: CERN Deployment Area Status Report Ian Bird LCG, IT Division SC2 Meeting 8 August 2003

[email protected] 37

CERN

ResourcesResources

• 6 INFN Fellows have been recruited– Starting September/October– 3 will work on experiment integration– 3 will work on certification, debugging, deployment

• FZK post is being filled– Starting October?– To work on deployment and service operation, troubleshooting

• 1 Portuguese trainee has started– Grid systems administrator

• Moscow group– Have had 2 (3 month rotation to CERN) working on testing (RLS

and R-GMA) – this will be ongoing and building up effort in Moscow• Taipei

– 1 more physicist joined us 1/8/03 for 1 year – deployment– 3 monthly rotations of 2 people (2 here now, 2 more arriving Sept)

• 1 working on Oracle/RLS installation, 1 on GOC task force with goal to build GOC in Taipei

Page 38: CERN Deployment Area Status Report Ian Bird LCG, IT Division SC2 Meeting 8 August 2003

[email protected] 38

CERN

Lessons learnedLessons learned

• Must have a running service – must keep it running – this is the only basis on which to progress and evolve

• Big-bang integration (a la EDG) is unworkable – must not be carried into EGEE– Must have a development service in parallel with production

service on which we verify incremental changes – and back them out if they don’t work

• Sites are not honest about available staffing – Bits of 1 overworked person are not equivalent to 2 FTE even

allowing for vacations– Committed resources count many dedicated FTE at most sites –

clearly not true, we must adjust this to reflect reality– Buy-in commitment to LCG-1 was 2 FTE minimum as well as

machines

• Every site seems to over-commit resources – this is a real problem which we must resolve if we want to operate a service

Page 39: CERN Deployment Area Status Report Ian Bird LCG, IT Division SC2 Meeting 8 August 2003

[email protected] 39

CERN

SummarySummary

• Middleware for LCG-1 is ready– Tests that we and Loose Cannons have done are promising

• Deployment of pre-cursor release (for configurations etc) was completed to 5/10 sites (expect FNAL, RAL today?)

• Deployment of LCG-1 to 10 sites will start next week– Will take a few days to already configured sites, longer to

others

• Expect experiments to have access mid-August– In a controlled way at first, to monitor problems

• Planning for next steps – expansion, features – in hand– Once the Tier 1’s are up and stable would like to start to add

Tier 2s and other countries as they are ready

Page 40: CERN Deployment Area Status Report Ian Bird LCG, IT Division SC2 Meeting 8 August 2003

[email protected] 40

CERN

Potential Issues for deploymentPotential Issues for deployment

• Grid 3– Not entirely clear what the relationship to LCG is and how it

will affect deployment of LCG middleware and services in the US

• Middleware support for the next year– For EDG workpackages – assuming EGEE or institutional

commitments but this is not yet clear