lhc computing grid project – lcg ian bird – lcg deployment manager it department, cern

LHC Computing Grid Project LCG

Ian Bird LCG Deployment ManagerIT Department, CERNGeneva, Switzerland

BNLMarch 2005

[email protected]

CERN - Computing Challenges

[email protected] Project

OverviewLCG Project OverviewOverview of main project areas

Deployment and OperationsCurrent LCG-2 StatusOperations and issuesPlans for migration to gLite +

Service Challenges

Interoperability

Outlook & Summary



LHC Computing Grid ProjectAim of the projectTo prepare, deploy and operate the computing environment for the experiments to analyse the data from the LHC detectorsThe Grid is just a tool towards achieving this goalApplications development environment, common tools and frameworks Build and operate the LHC computing service



Project Areas & Management

Joint with EGEEApplications AreaTorre WenausDevelopment environmentJoint projects, Data managementDistributed analysisMiddleware AreaFrdric HemmerProvision of a base set of grid middleware (acquisition, development, integration) Testing, maintenance, supportCERN Fabric Area Bernd PanzerLarge cluster managementData recording, Cluster technologyNetworking, Computing service at CERNGrid Deployment AreaIan BirdEstablishing and managing the Grid Service - Middleware, certification, securityoperations, registration, authorisation, accountingDistributed Analysis - ARDAMassimo LamannaPrototyping of distributedend-user analysis usinggrid technologyProject LeaderLes RobertsonResource Manager Chris EckPlanning Officer Jrgen KnoblochAdministration Fabienne Baud-LavignePere Mato from1 March 05



Relation with EGEE



Applications AreaAll Applications Area projects have software deployed in production by the experimentsPOOL, SEAL, ROOT, Geant4, GENSER, PI/AIDA, Savannah400 TB of POOL data produced in 2004Pre-release of Conditions Database (COOL)3D project will help POOL and COOL in terms of scalability3D: Distributed Deployment of DatabasesGeant4 successfully used in ATLAS, CMS and LHCb Data Challenges with excellent reliabilityGENSER MC generator library in productionProgress on integrating ROOT with other Applications Area componentsImproved I/O package used by POOL; Common dictionary, maths library with SEALPere Mato (CERN, LHCb) has taken over from Torre Wenaus (BNL, ATLAS) as Applications Area ManagerPlan for next phase of the applications area being developed for internal review at end of March



The ARDA projectARDA is an LCG projectmain activity is to enable LHC analysis on the gridARDA is contributing to EGEE NA4 uses the entire CERN NA4-HEP resource

Interface with the new EGEE middleware (gLite)By construction, use the new middlewareUse the grid software as it maturesVerify the components in an analysis environments (users!)Provide early and continuous feedback

Support the experiments in the evolution of their analysis systems

Forum for activity within LCG/EGEE and with other projects/initiatives



ARDA activity with the experiments

The complexity of the field requires a great care in the phase of middleware evolution and delivery:Complex (evolving) requirementsNew use cases to be explored (for HEP: large-scale analysis)Different communities in the loop - LHC experiments, middleware experts from the experiments and other communities providing large middleware stacks (CMS GEOD, US OSG, LHCb Dirac, etc)The complexity of the experiment-specific part is comparable (often larger) to the general oneThe experiments do require seamless access to a set of sites (computing resources) but the real usage (therefore the benefit for the LHC scientific programme) will come by exploiting the possibility to build their computing systems on a flexible and dependable infrastructureHow to progress?Build end-to-end prototype systems for the experiments to allow end users to perform analysis tasks



LHC prototype overview



LHC experiments prototypes (ARDA)All prototypes have beendemoed within the correspondinguser communities


INFSO-RI-508833

Interactive Session

Demo at Supercomputing 04

INFSO-RI-508833

Current Status

GANGA job submission handler for gLite is developedDaVinci job runs on gLite submitted through GANGA

Demo in Rio de Janeiro

INFSO-RI-508833

Combined Test Beam

Example: ATLAS TRT data analysis done by PNPI St PetersburgNumber of straw hits per layer

Real data processed at gLiteStandard Athena for testbeamData from CASTOR Processed on gLite worker node

INFSO-RI-508833

MonAlisa

User Job Monitoring

Demo at Supercomputing 04


CERN Fabric



CERN FabricFabric automation has seen very good progressThe new systems for managing large farms are in production at CERN since JanuaryNew CASTOR Mass Storage SystemBeing deployed first on the high throughput cluster for the ongoing ALICE data recording computing challengeAgreement on collaboration with Fermilab on Linux distributionScientific Linux based on Red Hat Enterprise 3 Improves uniformity between the HEP sites serving LHC and Run 2 experimentsCERN computer centre preparationsPower upgrade to 2.5 MW Computer centre refurbishment well under wayAcquisition process started



Preparing for 7,000 boxes in 2008



High Throughput Prototype (openlab + LCG prototype)80 * IA32 CPU Server(dual 2.4 GHz P4, 1 GB mem.)36 Disk Server(dual P4, IDE disks, ~ 1TB disk space each)4 * GE connectionsto the backbone10GE WAN connection10GE4 *ENTERASYS N7 10 GE Switches2 * Enterasys X-Series28 TB , IBM StorageTank2 * 50 Itanium 2(dual 1.3/1.5 GHz, 2 GB mem)10 GE per node10 GE per node1 GE per node12 Tape ServerSTK 9940B10GE80 IA32 CPU Server(dual 2.8 GHz P4, 2 GB mem.)12 Tape ServerSTK 9940B40 * IA32 CPU Server(dual 2.4 GHz P4, 1 GB mem.)10GE24 Disk Server(P4, SATA disks, ~ 2TB disk space each)Experience with likely ingredients in LCG:64-bit programmingnext generation I/O(10 Gb Ethernet, Infiniband, etc.)High performance cluster used for evaluations, and for data challenges with experiments Flexible configuration components moved in and out of production environmentCo-funded by industry and CERN



Alice Data Recording ChallengeTarget one week sustained at 450 MB/secUsed the new version of Castor mass storage systemNote smooth degradation and recovery after equipment failure



Deployment and Operations


RALIN2P3FNALTier-1USC.KrakowCIEMATRomeTaipeiLIPCSCSLegnaroUBIFCAICMSUPragueBudapestCambridgeIFICNIKHEFTRIUMFCNAFFZKBNLPICICEPPNordic.Tier-2smallcentresdesktopsportables

Tier-2 Well-managed, grid-enabled disk storageEnd-user analysis batch and interactiveSimulationLHC Computing Model (simplified!!)Tier-0 the accelerator centreFilter raw data reconstruction event summary data (ESD)Record the master copy of raw and ESD

Tier-1 Managed Mass Storage permanent storage raw, ESD, calibration data, meta-data, analysis data and databases grid-enabled data serviceData-heavy (ESD-based) analysisRe-processing of raw dataNational, regional supportonline to the data acquisition process high availability, long-term commitment



Computing Resources: March 2005Country providing resourcesCountry anticipating joining

In LCG-2: 121 sites, 32 countries >12,000 cpu ~5 PB storage

Includes non-EGEE sites: 9 countries 18 sites



Infrastructure metricsCountries, sites, and CPU available in LCG-2 production serviceEGEE partner regionsOther collaborating sites



Service UsageVOs and users on the production serviceActive HEP experiments:4 LHC, D0, CDF, Zeus, BabarActive other VO:Biomed, ESR (Earth Sciences), Compchem, Magic (Astronomy), EGEODE (Geo-Physics)6 disciplinesRegistered users in these VO: 500In addition to these there are many VO that are local to a region, supported by their ROCs, but not yet visible across EGEE

Scale of work performed:LHC Data challenges 2004:>1 M SI2K years of cpu time (~1000 cpu years)400 TB of data generated, moved and stored1 VO achieved ~4000 simultaneous jobs (~4 times CERN grid capacity)

Number of jobs processed/month



Current production software (LCG-2)Evolution through 2003/2004Focus has been on making these reliable and robustrather than additional functionalityRespond to needs of users, admins, operatorsThe software stack is the following:Virtual Data Toolkit Globus (2.4.x), Condor, etcEU DataGrid project developed higher-level componentsWorkload management (RB, L&B, etc)Replica Location Service (single central catalog), replica management toolsR-GMA as accounting and monitoring frameworkVOMS being deployed nowOperations team re-worked components:Information system: MDS GRIS/GIIS LCG-BDIIedg-rm tools replaced and augmented as lcg-utilsDevelopments on:Disk pool managers (dCache, DPM)Not addressed by JRA1Other tools as required:e.g. GridIce EU DataTag project Maintenance agreements with: VDT team (inc Globus support) DESY/FNAL - dCache EGEE/LCG teams:WLM, VOMS, R-GMA, Data Management



Software 2Platform supportWas an issue limited to RedHat 7.3Now ported to: Scientific Linux (RHEL), Fedora, IA64, AIX, SGI

Another problem was heaviness of installationNow much improved and simpler with simple installation tools, allow integration with existing fabric management toolsMuch lighter installation on worker nodes user level



Overall statusThe production grid service is quite stableThe services are quite reliableRemaining instabilities in the IS are being addressedSensitivity to site managementProblems in underlying services must be addressed Work on stop-gap solutions (e.g. RB maintains state, Globus gridftp reliable file transfer service)The biggest problem is stability of sitesConfiguration problems due to complexity of the middlewareFabric management at less experienced sitesJob efficiency is not high, unlessOperations/Applications select stable sites (BDII allows a application-specific view)In large tests, selecting stable sites, achieve >>90% efficiency Operations workshop last November to address thisFabric management working group write fabric management cookbookTighten operations control of the grid escalation procedures, removing bad sitesComplexity is in the number of sites not number of cpu



Operations StructureOperations Management Centre (OMC):At CERN coordination etcCore Infrastructure Centres (CIC)Manage daily grid operations oversight, troubleshootingRun essential infrastructure servicesProvide 2nd level support to ROCsUK/I, Fr, It, CERN, + Russia (M12)Taipei also run a CICRegional Operations Centres (ROC)Act as front-line support for user and operations issuesProvide local knowledge and adaptationsOne in each region many distributedUser Support Centre (GGUS)In FZK manage PTS provide single point of contact (service desk)Not foreseen as such in TA, but need is clear



Grid OperationsThe grid is flat, butHierarchy of responsibilityEssential to scale the operationCICs act as a single Operations CentreOperational oversight (grid operator) responsibilityrotates weekly between CICsReport problems to ROC/RCROC is responsible for ensuring problem is resolved ROC oversees regional RCsROCs responsible for organising the operations in a regionCoordinate deployment of middleware, etcCERN coordinates sites not associated with a ROC RC = Resource Centre



SLAs and 24x7Start with service level definitionsWhat a site supports (apps, software, MPI, compilers, etc)Levels of support (# admins, hrs/day, on-call, operators)Response time to problemsDefine metrics to measure compliancePublish metrics & performance of sites relative to their commitments Remote monitoring/management of servicesCan be considered for small sitesMiddleware/services Should cope with bad sitesClarify what 24x7 meansService should be available 24x7 Does not mean all sites must be available 24x7Specific crucial services that justify costClassify services according to level of support requiredOperations tools need to become more and more automatedHaving an operating production infrastructure should not mean having staff on shift everywherebest-effort supportThe infrastructure (and applications) must adapt to failures



Operational SecurityOperational Security team in placeEGEE security officer, ROC security contactsConcentrate on 3 activities:Incident responseBest practice advice for Grid Admins creating dedicated webSecurity Service Monitoring evaluationIncident ResponseJSPG agreement on IR in collaboration with OSG Update existing policy To guide the development of common capability for handling and response to cyber security incidents on GridsBasic framework for incident definition and handlingSite registration process in draftPart of basic SLACA OperationsEUGridPMA best practice, minimum standards, etc.More and more CAs appearing Security group and work was started in LCG was from the start a cross-grid activity. Much already in place at start of EGEE: usage policy, registration process and infrastructure, etc.We regard it as crucial that this activity remains broader than just EGEE



Policy Joint Security GroupSecurity & Availability PolicyUsageRulesCertification AuthoritiesAuditRequirementsIncident ResponseUser RegistrationApplication Development& Network Admin Guidehttp://cern.ch/proj-lcg-security/documents.html



User SupportUser support Call centre/helpdeskCoordinated through GGUSROCs as front-lineTask force in place to improve the serviceVO SupportWas an oversight in the project and is not really provisionedIn LCG there is a team (5 FTE):Help apps integrate with m/wDirect 1:1 supportUnderstanding of needsAct as advocate for appThis is really missing for the other apps adaptation to the grid environment takes expertiseWe have found that user support has 2 distinct aspects:DeploymentSupport Middleware ProblemsLHCexperimentsOther Communities e.g. Biomednon-LHCexperiments Operations Centres(CIC / ROC) Operations ProblemsResourceCentres (RC) Hardware ProblemsApplication Specific User Support VO specific problemsGlobal Grid User Support (GGUS) Single Point of Contact Coordination of UserSupport



Certification process Process was decisive to improve the middlewareThe process is time consuming (5 releases 2004)Many sequential stepsMany different site layouts have to be tested Format of internal and external releases differMultiple packaging formats (tool based, generic) All components are treated equalsame level of testing for non vital and core componentsnew tools and tools in use by other projects are tested to the same levelProcess to include new components is not transparentTiming for releases difficult Users: now; sites: scheduledUpgrades need a long time to cover all sites Some sites had problems to become functional after an upgrade



Additional Input Data Challengesclient libs need fast and frequent updatescore services need fast patches (functional/fixes)applications need a transparent release preparation many problems only become visible during full scale productionConfiguration is a major problem on smaller sites Operations Workshopsmaller sites can handle major upgrades only every 3 monthssites need to give input in the selection of new packagesresolve conflicts with local policies



Changes I Simple Installation/Configuration Scripts YAIM (YetAnotherInstallMethod)semi-automatic simple configuration managementbased on scripts (easy to integrate into other frameworks)all configuration for a site are kept in one file APT (Advanced Package Tool) based installation of middleware RPMs simple dependency managementupdates (automatic on demand)no OS installationClient libs packaged in addition as user space tar-ballcan be installed like application software



Changes II

Different frequency of separate release typesclient libs (UI, WN)services (CE, SE)core services (RB, BDII,..)major releases (configuration changes, RPMs, new services)updates (bug fixes) added any time to specific releases non-critical components will be made available with reduced testingFixed release dates for major releases (allows planning) every 3 months, sites have to upgrade within 3 weeksMinor releases every monthbased on ranked components available at a specific date in the month not mandatory for smaller RCs to follow client libs will be installed as application level softwareearly access to pre-releases of new software for applicationsclient libs. will be made available on selected sitesservices with functional changes are installed on EIS-Applications testbedearly feedback from applications



Certification Process C&TEISGISGDBApplicationsRCBugs/Patches/TaskSavannahEISCICsHead of DeploymentDevelopersDevelopers1integration&first testsC&T3Internal Releases4full deployment on test clusters (6)functional/stress tests~1 weekC&T6assign and update cost Bugs/Patches/TaskSavannahcomponentsready at cutoff InternalClient ReleaseC&T



Deployment ProcessRelease(s)Certificationis run dailyUpdate User Guides EISUpdateRelease NotesGISReleaseNotesInstallationGuidesUserGuidesEvery MonthYAIMEvery MonthEvery 3 monthson fixed dates !at own pace



Operations ProceduresDriven by experience during 2004 Data ChallengesReflecting the outcome of the November Operations WorkshopOperations Procedures roles of CICs - ROCs - RCsweekly rotation of operations centre duties (CIC-on-duty)daily tasks of the operations shiftmonitoring (tools, frequency)problem reportingproblem tracking systemcommunication with ROCs & RCsescalation of unresolved problemshanding over the service to the next CIC



ImplementationEvolutionary DevelopmentProceduresdocumented (constantly adapted) available at the CIC portal http://cic.in2p3.fr/in use by the shift crews

Portal http://cic.in2p3.fr access to tools and process documentationrepository for logs and FAQsprovides means of efficient communicationprovides condensed monitoring information

Problem tracking systemcurrently based on Savannah at CERNis moving to the GGUS at FZKexports/imports tickets to local systems used by the ROCs

Weekly Phone Conferences and Quarterly Meetings



Grid operator dashboard Cic-on-duty Dashboard https://cic.in2p3.fr/pages/cic/framedashboard.html



Operator procedure(6)In DepthTestingGIISMonitorDiagnosishelpWikipagesReportSavannahFollow upCicmailingtoolROCIncidentclosureSavannah(1)(2)(3)(4)RC(5.1)(5.2)(5.1)Monitoring toolsGridIce, GOC(5)SEVERITYESCALATIONPROCEDUREstEscalation



Selection of monitoring tools



Middleware



Architecture & DesignDesign team including representatives from Middleware providers (AliEn, Condor, EDG, Globus,) including US partners produced middleware architecture and design.Takes into account input and experiences from applications, operations, and related projects DJRA1.1 EGEE Middleware Architecture (June 2004)https://edms.cern.ch/document/476451/ DJRA1.2 EGEE Middleware Design (August 2004)https://edms.cern.ch/document/487871/ Much feedback from within the project (operation & applications) and from related projectsBeing used and actively discussed by OSG, GridLab, etc. Input to various GGF groups


INFSO-RI-508833

gLite Services and Responsible ClustersGrid Access ServiceAPI Access ServicesJob Provenance Job Management ServicesComputing ElementWorkload ManagementPackage ManagerMetadata Catalog Data ServicesStorage ElementData ManagementFile & Replica CatalogAuthorization Security ServicesAuthenticationAuditingInformation & Monitoring Information & Monitoring ServicesApplication MonitoringSite ProxyAccountingJRA3UKCERNIT/CZ


INFSO-RI-508833

gLite Services for Release 1Grid Access ServiceAPI Access ServicesJob Provenance Job Management ServicesComputing ElementWorkload ManagementPackage ManagerMetadata Catalog Data ServicesStorage ElementData ManagementFile & Replica CatalogAuthorization Security ServicesAuthenticationAuditingInformation & Monitoring Information & Monitoring ServicesApplicationMonitoringSite ProxyAccountingJRA3UKCERNIT/CZFocus on key services according to gLite Mgmt taskforce



gLite Services for Release 1Software stack and origin (simplified)Computing ElementGatekeeper (Globus)Condor-C (Condor)CE Monitor (EGEE)Local batch system (PBS, LSF, Condor)Workload ManagementWMS (EDG)Logging and bookkeeping (EDG)Condor-C (Condor)Storage ElementFile Transfer/Placement (EGEE)glite-I/O (AliEn)GridFTP (Globus)SRM: Castor (CERN), dCache (FNAL, DESY), other SRMsCatalogFile and Replica Catalog (EGEE)Metadata Catalog (EGEE)Information and MonitoringR-GMA (EDG)SecurityVOMS (DataTAG, EDG)GSI (Globus)Authentication for C and Java based (web) services (EDG)



SummaryWMSTask Queue, Pull mode, Data management interfaceAvailable in the prototypeUsed in the testing testbedNow working on the certification testbedSubmission to LCG-2 demonstratedCatalogMySQL and OracleAvailable in the prototypeUsed in the testing testbedDelivered to SA1But not tested yetgLite I/OAvailable in the prototypeUsed in the testing testbedBasic functionality and stress test availableDelivered to SA1But not tested yetFTSFTS is being evolved with LCGMilestone on March 15, 2005Stress tests in service challengesUIAvailable in the prototypeIncudes data managementNot yet formally testedR-GMAAvailable in the prototypeTesting has shown deployment problemsVOMSAvailable in the prototypeNo tests available



ScheduleAll of the Services are available now on the development testbedUser documentation currently being addedOn a limited scale testbedMost of the Services are being deployed on the LCG Preproduction ServiceInitially at CERN, more sites once tested/validatedScheduled in April-MaySchedule for deployment at major sites by the end of MayIn time to be included in the LCG service challenge that must demonstrate full capability in July prior to operate as a stable service in 2H2005



Migration StrategyCertify gLite components on existing LCG-2 service

Deploy components in parallel replacing with new service once stability and functionality is demonstrated

WN tools and libs must co-exist on same cluster nodes

As far as possible must have a smooth transition



Service Challenges



Problem StatementRobust File Transfer Service often seen as the goal of the LCG Service ChallengesWhilst it is clearly essential that we ramp up at CERN and the T1/T2 sites to meet the required data rates well in advance of LHC data taking, this is only one aspectGetting all sites to acquire and run the infrastructure is non-trivial (managed disk storage, tape storage, agreed interfaces, 24 x 365 service aspect, including during conferences, vacation, illnesses etc.)Need to understand networking requirements and plan earlyBut transferring dummy files is not enoughStill have to show that basic infrastructure works reliably and efficientlyNeed to test experiments Use CasesCheck for bottlenecks and limits in s/w, disk and other caches etc.We can presumably write some test scripts to mock up the experiments Computing ModelsBut the real test will be to run your s/wWhich requires strong involvement from production teams



LCG Service Challenges - OverviewLHC will enter production (physics) in April 2007Will generate an enormous volume of dataWill require huge amount of processing power

LCG solution is a world-wide GridMany components understood, deployed, tested..

ButUnprecedented scaleHumungous challenge of getting large numbers of institutes and individuals, all with existing, sometimes conflicting commitments, to work together

LCG must be ready at full production capacity, functionality and reliability in less than 2 years from nowIssues include h/w acquisition, personnel hiring and training, vendor rollout schedules etc.

Should not limit ability of physicist to exploit performance of detectors nor LHCs physics potentialWhilst being stable, reliable and easy to use



Key PrinciplesService challenges results in a series of services that exist in parallel with baseline production service

Rapidly and successively approach production needs of LHC

Initial focus: core (data management) services

Swiftly expand out to cover full spectrum of production and analysis chain

Must be as realistic as possible, including end-end testing of key experiment use-cases over extended periods with recovery from glitches and longer-term outages

Necessary resources and commitment pre-requisite to success!

Should not be under-estimated!



Initial Schedule (evolving)Q1 / Q2: up to 5 T1s, writing to disk at 100MB/s per T1 (no expts)

Q3 / Q4: include two experiments, tape and a few selected T2s

2006: progressively add more T2s, more experiments, ramp up to twice nominal data rate

2006: production usage by all experiments at reduced rates (cosmics); validation of computing models

2007: delivery and contingency

N.B. there is more detail in Dec / Jan / Feb GDB presentations



Key dates for Service PreparationSC2June05 - Technical Design ReportSep05 - SC3 Service PhaseMay06 SC4 Service PhaseSep06 Initial LHC Service in stable operationSC2 Reliable data transfer (disk-network-disk) 5 Tier-1s, aggregate 500 MB/sec sustained at CERNSC3 Reliable base service most Tier-1s, some Tier-2s basic experiment software chain grid data throughput 500 MB/sec, including mass storage (~25% of the nominal final throughput for the proton period)SC4 All Tier-1s, major Tier-2s capable of supporting full experiment software chain inc. analysis sustain nominal final grid data throughputLHC Service in Operation September 2006 ramp up to full operational capacity by April 2007 capable of handling twice the nominal data throughput Apr07 LHC Service commissioned



FermiLab Dec 04/Jan 05FermiLab demonstrated 500MB/s for 3 days in November



FTS stabilityM not K !!!



Interoperability



Introduction: grid flavoursLCG-2 vs Grid3Both use same VDT versionGlobus 2.4.xLCG-2 has components for WLM, IS, R-GMA, etcBoth use ~same information schema (GLUE)Grid3 schema not all GLUESome small extensions by eachBoth use MDS (BDII)

LCG-2 vs NorduGridNorduGrid uses modified version of Globus 2.xDoes not use gatekeeper different interfaceVery different information schema but does use MDS

Canada:Gateway into GridCanada and WestGrid (Globus based) in production

Catalogues:LCG-2: EDG derived catalogue (for POOL)Grid3 and NorduGrid: Globus RLS

Work done:With Grid3/OSG strong contacts, many points of collaboration, etc.With NorduGrid discussions have started



Common areas (with Grid3/OSG)InteroperationAlign Information SystemsRun jobs between LCG-2 and Grid3/NorduGridStorage interfaces SRM Reliable file transferService challenges

InfrastructureSecuritySecurity policy JSPG Operational securityBoth are explicitly common activities across all sitesMonitoringJob monitoringGrid monitoringAccountingGrid OperationsCommon operations policiesProblem tracking



InteroperationLCG-2 jobs on Grid3G3 site runs LCG-developed generic info provider fills their site GIIS with missing info GLUE schemaFrom LCG-2 BDII can see G3 sitesRunning a job on grid3 site needed:G3 installs full set of LCG CAsAdded users into VOMSWN installation (very lightweight now) installs on the flyGrid3 jobs on LCG-2Added Grid3 VO to our configurationThey point directly to the site (do not use IS for job submission)Job submission LCG-2 Grid3 has been demonstratedNorduGrid can run generic info provider at a siteBut requires work to use the NG clusters



Storage and file transferStorage interfacesLCG-2, gLite, Open Science Grid all agree on SRM as basic interface to storageSRM collaboration for >2 years, group in GGFSRM interoperability has been demonstratedLHCb use SRM in their stripping phaseReliable file transferWork ongoing with Tier 1s (inc. FNAL, BNL, Triumf) in service challenges. Agree that interface is SRM and srmcopy or gridftp as transfer protocol Reliable transfer software will run at all sites already in place as part of service challenges



OperationsSeveral points where collaboration will happenStarted from LCG and OSG operations workshops Operational security/incident responseCommon site charter/service definitions possible?Collaboration on operations centres (CIC-on-duty) ?Operations monitoring:Common schema for problem description/views allow tools to understand both?Common metrics for performance and reliabilityCommon site and application validation suites (for LHC apps)AccountingGrid3 and LCG-2 use GGF schemaAgree to publish into common tool (NG could too)Job monitoringLCG-2 Logging and bookkeeping well defined set of statesAgree common set will allow common tools to view job states in any gridNeed good high level (web) tools to display user could track jobs easily across grids



Outlook & SummaryLHC startup is very close. Services have to be in place 6 months earlierThe Service Challenge program is the ramp-up processAll aspects are really challenging!Now the experiment computing models have been published, trying to clarify what services LCG must provide and what their interfaces need to beBaseline services working group A enormous amount of work has been done by the various grid projectsAlready at full complexity and scale foreseen for startupBut still significant problems to address functionality, and stability of operationsTime to bring these efforts together to build the solution for LHC

Thank you for your attention !


lhc computing grid project – lcg ian bird – lcg deployment manager it department, cern

Documents

computing environment

lhcb data challenges

lhc analysis

grid service middleware

grid software

lhc detectorsthe grid

applications area managerplan

analysis systemsforum