lhc computing grid project – lcg ian bird – lcg deployment manager it department, cern
DESCRIPTION
LHC Computing Grid Project – LCG Ian Bird – LCG Deployment Manager IT Department, CERN Geneva, Switzerland BNL March 2005 [email protected]. Overview. LCG Project Overview Overview of main project areas Deployment and Operations Current LCG-2 Status Operations and issues - PowerPoint PPT PresentationTRANSCRIPT
-
LHC Computing Grid Project LCG
Ian Bird LCG Deployment ManagerIT Department, CERNGeneva, Switzerland
BNLMarch 2005
CERN - Computing Challenges
[email protected] Project
OverviewLCG Project OverviewOverview of main project areas
Deployment and OperationsCurrent LCG-2 StatusOperations and issuesPlans for migration to gLite +
Service Challenges
Interoperability
Outlook & Summary
CERN - Computing Challenges
[email protected] Project
LHC Computing Grid ProjectAim of the projectTo prepare, deploy and operate the computing environment for the experiments to analyse the data from the LHC detectorsThe Grid is just a tool towards achieving this goalApplications development environment, common tools and frameworks Build and operate the LHC computing service
CERN - Computing Challenges
[email protected] Project
Project Areas & Management
Joint with EGEEApplications AreaTorre WenausDevelopment environmentJoint projects, Data managementDistributed analysisMiddleware AreaFrdric HemmerProvision of a base set of grid middleware (acquisition, development, integration) Testing, maintenance, supportCERN Fabric Area Bernd PanzerLarge cluster managementData recording, Cluster technologyNetworking, Computing service at CERNGrid Deployment AreaIan BirdEstablishing and managing the Grid Service - Middleware, certification, securityoperations, registration, authorisation, accountingDistributed Analysis - ARDAMassimo LamannaPrototyping of distributedend-user analysis usinggrid technologyProject LeaderLes RobertsonResource Manager Chris EckPlanning Officer Jrgen KnoblochAdministration Fabienne Baud-LavignePere Mato from1 March 05
CERN - Computing Challenges
[email protected] Project
Relation with EGEE
CERN - Computing Challenges
[email protected] Project
Applications AreaAll Applications Area projects have software deployed in production by the experimentsPOOL, SEAL, ROOT, Geant4, GENSER, PI/AIDA, Savannah400 TB of POOL data produced in 2004Pre-release of Conditions Database (COOL)3D project will help POOL and COOL in terms of scalability3D: Distributed Deployment of DatabasesGeant4 successfully used in ATLAS, CMS and LHCb Data Challenges with excellent reliabilityGENSER MC generator library in productionProgress on integrating ROOT with other Applications Area componentsImproved I/O package used by POOL; Common dictionary, maths library with SEALPere Mato (CERN, LHCb) has taken over from Torre Wenaus (BNL, ATLAS) as Applications Area ManagerPlan for next phase of the applications area being developed for internal review at end of March
CERN - Computing Challenges
[email protected] Project
The ARDA projectARDA is an LCG projectmain activity is to enable LHC analysis on the gridARDA is contributing to EGEE NA4 uses the entire CERN NA4-HEP resource
Interface with the new EGEE middleware (gLite)By construction, use the new middlewareUse the grid software as it maturesVerify the components in an analysis environments (users!)Provide early and continuous feedback
Support the experiments in the evolution of their analysis systems
Forum for activity within LCG/EGEE and with other projects/initiatives
CERN - Computing Challenges
[email protected] Project
ARDA activity with the experiments
The complexity of the field requires a great care in the phase of middleware evolution and delivery:Complex (evolving) requirementsNew use cases to be explored (for HEP: large-scale analysis)Different communities in the loop - LHC experiments, middleware experts from the experiments and other communities providing large middleware stacks (CMS GEOD, US OSG, LHCb Dirac, etc)The complexity of the experiment-specific part is comparable (often larger) to the general oneThe experiments do require seamless access to a set of sites (computing resources) but the real usage (therefore the benefit for the LHC scientific programme) will come by exploiting the possibility to build their computing systems on a flexible and dependable infrastructureHow to progress?Build end-to-end prototype systems for the experiments to allow end users to perform analysis tasks
CERN - Computing Challenges
[email protected] Project
LHC prototype overview
CERN - Computing Challenges
[email protected] Project
LHC experiments prototypes (ARDA)All prototypes have beendemoed within the correspondinguser communities
CERN - Computing Challenges
INFSO-RI-508833
Interactive Session
Demo at Supercomputing 04
INFSO-RI-508833
Current Status
GANGA job submission handler for gLite is developedDaVinci job runs on gLite submitted through GANGA
Demo in Rio de Janeiro
INFSO-RI-508833
Combined Test Beam
Example: ATLAS TRT data analysis done by PNPI St PetersburgNumber of straw hits per layer
Real data processed at gLiteStandard Athena for testbeamData from CASTOR Processed on gLite worker node
INFSO-RI-508833
MonAlisa
User Job Monitoring
Demo at Supercomputing 04
[email protected] Project
CERN Fabric
CERN - Computing Challenges
[email protected] Project
CERN FabricFabric automation has seen very good progressThe new systems for managing large farms are in production at CERN since JanuaryNew CASTOR Mass Storage SystemBeing deployed first on the high throughput cluster for the ongoing ALICE data recording computing challengeAgreement on collaboration with Fermilab on Linux distributionScientific Linux based on Red Hat Enterprise 3 Improves uniformity between the HEP sites serving LHC and Run 2 experimentsCERN computer centre preparationsPower upgrade to 2.5 MW Computer centre refurbishment well under wayAcquisition process started
CERN - Computing Challenges
[email protected] Project
Preparing for 7,000 boxes in 2008
CERN - Computing Challenges
[email protected] Project
High Throughput Prototype (openlab + LCG prototype)80 * IA32 CPU Server(dual 2.4 GHz P4, 1 GB mem.)36 Disk Server(dual P4, IDE disks, ~ 1TB disk space each)4 * GE connectionsto the backbone10GE WAN connection10GE4 *ENTERASYS N7 10 GE Switches2 * Enterasys X-Series28 TB , IBM StorageTank2 * 50 Itanium 2(dual 1.3/1.5 GHz, 2 GB mem)10 GE per node10 GE per node1 GE per node12 Tape ServerSTK 9940B10GE80 IA32 CPU Server(dual 2.8 GHz P4, 2 GB mem.)12 Tape ServerSTK 9940B40 * IA32 CPU Server(dual 2.4 GHz P4, 1 GB mem.)10GE24 Disk Server(P4, SATA disks, ~ 2TB disk space each)Experience with likely ingredients in LCG:64-bit programmingnext generation I/O(10 Gb Ethernet, Infiniband, etc.)High performance cluster used for evaluations, and for data challenges with experiments Flexible configuration components moved in and out of production environmentCo-funded by industry and CERN
CERN - Computing Challenges
[email protected] Project
Alice Data Recording ChallengeTarget one week sustained at 450 MB/secUsed the new version of Castor mass storage systemNote smooth degradation and recovery after equipment failure
CERN - Computing Challenges
[email protected] Project
Deployment and Operations
CERN - Computing Challenges
RALIN2P3FNALTier-1USC.KrakowCIEMATRomeTaipeiLIPCSCSLegnaroUBIFCAICMSUPragueBudapestCambridgeIFICNIKHEFTRIUMFCNAFFZKBNLPICICEPPNordic.Tier-2smallcentresdesktopsportables
Tier-2 Well-managed, grid-enabled disk storageEnd-user analysis batch and interactiveSimulationLHC Computing Model (simplified!!)Tier-0 the accelerator centreFilter raw data reconstruction event summary data (ESD)Record the master copy of raw and ESD
Tier-1 Managed Mass Storage permanent storage raw, ESD, calibration data, meta-data, analysis data and databases grid-enabled data serviceData-heavy (ESD-based) analysisRe-processing of raw dataNational, regional supportonline to the data acquisition process high availability, long-term commitment
CERN - Computing Challenges
[email protected] Project
Computing Resources: March 2005Country providing resourcesCountry anticipating joining
In LCG-2: 121 sites, 32 countries >12,000 cpu ~5 PB storage
Includes non-EGEE sites: 9 countries 18 sites
CERN - Computing Challenges
[email protected] Project
Infrastructure metricsCountries, sites, and CPU available in LCG-2 production serviceEGEE partner regionsOther collaborating sites
CERN - Computing Challenges
[email protected] Project
Service UsageVOs and users on the production serviceActive HEP experiments:4 LHC, D0, CDF, Zeus, BabarActive other VO:Biomed, ESR (Earth Sciences), Compchem, Magic (Astronomy), EGEODE (Geo-Physics)6 disciplinesRegistered users in these VO: 500In addition to these there are many VO that are local to a region, supported by their ROCs, but not yet visible across EGEE
Scale of work performed:LHC Data challenges 2004:>1 M SI2K years of cpu time (~1000 cpu years)400 TB of data generated, moved and stored1 VO achieved ~4000 simultaneous jobs (~4 times CERN grid capacity)
Number of jobs processed/month
CERN - Computing Challenges
[email protected] Project
Current production software (LCG-2)Evolution through 2003/2004Focus has been on making these reliable and robustrather than additional functionalityRespond to needs of users, admins, operatorsThe software stack is the following:Virtual Data Toolkit Globus (2.4.x), Condor, etcEU DataGrid project developed higher-level componentsWorkload management (RB, L&B, etc)Replica Location Service (single central catalog), replica management toolsR-GMA as accounting and monitoring frameworkVOMS being deployed nowOperations team re-worked components:Information system: MDS GRIS/GIIS LCG-BDIIedg-rm tools replaced and augmented as lcg-utilsDevelopments on:Disk pool managers (dCache, DPM)Not addressed by JRA1Other tools as required:e.g. GridIce EU DataTag project Maintenance agreements with: VDT team (inc Globus support) DESY/FNAL - dCache EGEE/LCG teams:WLM, VOMS, R-GMA, Data Management
CERN - Computing Challenges
[email protected] Project
Software 2Platform supportWas an issue limited to RedHat 7.3Now ported to: Scientific Linux (RHEL), Fedora, IA64, AIX, SGI
Another problem was heaviness of installationNow much improved and simpler with simple installation tools, allow integration with existing fabric management toolsMuch lighter installation on worker nodes user level
CERN - Computing Challenges
[email protected] Project
Overall statusThe production grid service is quite stableThe services are quite reliableRemaining instabilities in the IS are being addressedSensitivity to site managementProblems in underlying services must be addressed Work on stop-gap solutions (e.g. RB maintains state, Globus gridftp reliable file transfer service)The biggest problem is stability of sitesConfiguration problems due to complexity of the middlewareFabric management at less experienced sitesJob efficiency is not high, unlessOperations/Applications select stable sites (BDII allows a application-specific view)In large tests, selecting stable sites, achieve >>90% efficiency Operations workshop last November to address thisFabric management working group write fabric management cookbookTighten operations control of the grid escalation procedures, removing bad sitesComplexity is in the number of sites not number of cpu
CERN - Computing Challenges
[email protected] Project
Operations StructureOperations Management Centre (OMC):At CERN coordination etcCore Infrastructure Centres (CIC)Manage daily grid operations oversight, troubleshootingRun essential infrastructure servicesProvide 2nd level support to ROCsUK/I, Fr, It, CERN, + Russia (M12)Taipei also run a CICRegional Operations Centres (ROC)Act as front-line support for user and operations issuesProvide local knowledge and adaptationsOne in each region many distributedUser Support Centre (GGUS)In FZK manage PTS provide single point of contact (service desk)Not foreseen as such in TA, but need is clear
CERN - Computing Challenges
[email protected] Project
Grid OperationsThe grid is flat, butHierarchy of responsibilityEssential to scale the operationCICs act as a single Operations CentreOperational oversight (grid operator) responsibilityrotates weekly between CICsReport problems to ROC/RCROC is responsible for ensuring problem is resolved ROC oversees regional RCsROCs responsible for organising the operations in a regionCoordinate deployment of middleware, etcCERN coordinates sites not associated with a ROC RC = Resource Centre
CERN - Computing Challenges
[email protected] Project
SLAs and 24x7Start with service level definitionsWhat a site supports (apps, software, MPI, compilers, etc)Levels of support (# admins, hrs/day, on-call, operators)Response time to problemsDefine metrics to measure compliancePublish metrics & performance of sites relative to their commitments Remote monitoring/management of servicesCan be considered for small sitesMiddleware/services Should cope with bad sitesClarify what 24x7 meansService should be available 24x7 Does not mean all sites must be available 24x7Specific crucial services that justify costClassify services according to level of support requiredOperations tools need to become more and more automatedHaving an operating production infrastructure should not mean having staff on shift everywherebest-effort supportThe infrastructure (and applications) must adapt to failures
CERN - Computing Challenges
[email protected] Project
Operational SecurityOperational Security team in placeEGEE security officer, ROC security contactsConcentrate on 3 activities:Incident responseBest practice advice for Grid Admins creating dedicated webSecurity Service Monitoring evaluationIncident ResponseJSPG agreement on IR in collaboration with OSG Update existing policy To guide the development of common capability for handling and response to cyber security incidents on GridsBasic framework for incident definition and handlingSite registration process in draftPart of basic SLACA OperationsEUGridPMA best practice, minimum standards, etc.More and more CAs appearing Security group and work was started in LCG was from the start a cross-grid activity. Much already in place at start of EGEE: usage policy, registration process and infrastructure, etc.We regard it as crucial that this activity remains broader than just EGEE
CERN - Computing Challenges
[email protected] Project
Policy Joint Security GroupSecurity & Availability PolicyUsageRulesCertification AuthoritiesAuditRequirementsIncident ResponseUser RegistrationApplication Development& Network Admin Guidehttp://cern.ch/proj-lcg-security/documents.html
CERN - Computing Challenges
[email protected] Project
User SupportUser support Call centre/helpdeskCoordinated through GGUSROCs as front-lineTask force in place to improve the serviceVO SupportWas an oversight in the project and is not really provisionedIn LCG there is a team (5 FTE):Help apps integrate with m/wDirect 1:1 supportUnderstanding of needsAct as advocate for appThis is really missing for the other apps adaptation to the grid environment takes expertiseWe have found that user support has 2 distinct aspects:DeploymentSupport Middleware ProblemsLHCexperimentsOther Communities e.g. Biomednon-LHCexperiments Operations Centres(CIC / ROC) Operations ProblemsResourceCentres (RC) Hardware ProblemsApplication Specific User Support VO specific problemsGlobal Grid User Support (GGUS) Single Point of Contact Coordination of UserSupport
CERN - Computing Challenges
[email protected] Project
Certification process Process was decisive to improve the middlewareThe process is time consuming (5 releases 2004)Many sequential stepsMany different site layouts have to be tested Format of internal and external releases differMultiple packaging formats (tool based, generic) All components are treated equalsame level of testing for non vital and core componentsnew tools and tools in use by other projects are tested to the same levelProcess to include new components is not transparentTiming for releases difficult Users: now; sites: scheduledUpgrades need a long time to cover all sites Some sites had problems to become functional after an upgrade
CERN - Computing Challenges
[email protected] Project
Additional Input Data Challengesclient libs need fast and frequent updatescore services need fast patches (functional/fixes)applications need a transparent release preparation many problems only become visible during full scale productionConfiguration is a major problem on smaller sites Operations Workshopsmaller sites can handle major upgrades only every 3 monthssites need to give input in the selection of new packagesresolve conflicts with local policies
CERN - Computing Challenges
[email protected] Project
Changes I Simple Installation/Configuration Scripts YAIM (YetAnotherInstallMethod)semi-automatic simple configuration managementbased on scripts (easy to integrate into other frameworks)all configuration for a site are kept in one file APT (Advanced Package Tool) based installation of middleware RPMs simple dependency managementupdates (automatic on demand)no OS installationClient libs packaged in addition as user space tar-ballcan be installed like application software
CERN - Computing Challenges
[email protected] Project
Changes II
Different frequency of separate release typesclient libs (UI, WN)services (CE, SE)core services (RB, BDII,..)major releases (configuration changes, RPMs, new services)updates (bug fixes) added any time to specific releases non-critical components will be made available with reduced testingFixed release dates for major releases (allows planning) every 3 months, sites have to upgrade within 3 weeksMinor releases every monthbased on ranked components available at a specific date in the month not mandatory for smaller RCs to follow client libs will be installed as application level softwareearly access to pre-releases of new software for applicationsclient libs. will be made available on selected sitesservices with functional changes are installed on EIS-Applications testbedearly feedback from applications
CERN - Computing Challenges
[email protected] Project
Certification Process C&TEISGISGDBApplicationsRCBugs/Patches/TaskSavannahEISCICsHead of DeploymentDevelopersDevelopers1integration&first testsC&T3Internal Releases4full deployment on test clusters (6)functional/stress tests~1 weekC&T6assign and update cost Bugs/Patches/TaskSavannahcomponentsready at cutoff InternalClient ReleaseC&T
CERN - Computing Challenges
[email protected] Project
Deployment ProcessRelease(s)Certificationis run dailyUpdate User Guides EISUpdateRelease NotesGISReleaseNotesInstallationGuidesUserGuidesEvery MonthYAIMEvery MonthEvery 3 monthson fixed dates !at own pace
CERN - Computing Challenges
[email protected] Project
Operations ProceduresDriven by experience during 2004 Data ChallengesReflecting the outcome of the November Operations WorkshopOperations Procedures roles of CICs - ROCs - RCsweekly rotation of operations centre duties (CIC-on-duty)daily tasks of the operations shiftmonitoring (tools, frequency)problem reportingproblem tracking systemcommunication with ROCs & RCsescalation of unresolved problemshanding over the service to the next CIC
CERN - Computing Challenges
[email protected] Project
ImplementationEvolutionary DevelopmentProceduresdocumented (constantly adapted) available at the CIC portal http://cic.in2p3.fr/in use by the shift crews
Portal http://cic.in2p3.fr access to tools and process documentationrepository for logs and FAQsprovides means of efficient communicationprovides condensed monitoring information
Problem tracking systemcurrently based on Savannah at CERNis moving to the GGUS at FZKexports/imports tickets to local systems used by the ROCs
Weekly Phone Conferences and Quarterly Meetings
CERN - Computing Challenges
[email protected] Project
Grid operator dashboard Cic-on-duty Dashboard https://cic.in2p3.fr/pages/cic/framedashboard.html
CERN - Computing Challenges
[email protected] Project
Operator procedure(6)In DepthTestingGIISMonitorDiagnosishelpWikipagesReportSavannahFollow upCicmailingtoolROCIncidentclosureSavannah(1)(2)(3)(4)RC(5.1)(5.2)(5.1)Monitoring toolsGridIce, GOC(5)SEVERITYESCALATIONPROCEDUREstEscalation
CERN - Computing Challenges
[email protected] Project
Selection of monitoring tools
CERN - Computing Challenges
[email protected] Project
Middleware
CERN - Computing Challenges
[email protected] Project
Architecture & DesignDesign team including representatives from Middleware providers (AliEn, Condor, EDG, Globus,) including US partners produced middleware architecture and design.Takes into account input and experiences from applications, operations, and related projects DJRA1.1 EGEE Middleware Architecture (June 2004)https://edms.cern.ch/document/476451/ DJRA1.2 EGEE Middleware Design (August 2004)https://edms.cern.ch/document/487871/ Much feedback from within the project (operation & applications) and from related projectsBeing used and actively discussed by OSG, GridLab, etc. Input to various GGF groups
CERN - Computing Challenges
INFSO-RI-508833
gLite Services and Responsible ClustersGrid Access ServiceAPI Access ServicesJob Provenance Job Management ServicesComputing ElementWorkload ManagementPackage ManagerMetadata Catalog Data ServicesStorage ElementData ManagementFile & Replica CatalogAuthorization Security ServicesAuthenticationAuditingInformation & Monitoring Information & Monitoring ServicesApplication MonitoringSite ProxyAccountingJRA3UKCERNIT/CZ
CERN - Computing Challenges
INFSO-RI-508833
gLite Services for Release 1Grid Access ServiceAPI Access ServicesJob Provenance Job Management ServicesComputing ElementWorkload ManagementPackage ManagerMetadata Catalog Data ServicesStorage ElementData ManagementFile & Replica CatalogAuthorization Security ServicesAuthenticationAuditingInformation & Monitoring Information & Monitoring ServicesApplicationMonitoringSite ProxyAccountingJRA3UKCERNIT/CZFocus on key services according to gLite Mgmt taskforce
CERN - Computing Challenges
[email protected] Project
gLite Services for Release 1Software stack and origin (simplified)Computing ElementGatekeeper (Globus)Condor-C (Condor)CE Monitor (EGEE)Local batch system (PBS, LSF, Condor)Workload ManagementWMS (EDG)Logging and bookkeeping (EDG)Condor-C (Condor)Storage ElementFile Transfer/Placement (EGEE)glite-I/O (AliEn)GridFTP (Globus)SRM: Castor (CERN), dCache (FNAL, DESY), other SRMsCatalogFile and Replica Catalog (EGEE)Metadata Catalog (EGEE)Information and MonitoringR-GMA (EDG)SecurityVOMS (DataTAG, EDG)GSI (Globus)Authentication for C and Java based (web) services (EDG)
CERN - Computing Challenges
[email protected] Project
SummaryWMSTask Queue, Pull mode, Data management interfaceAvailable in the prototypeUsed in the testing testbedNow working on the certification testbedSubmission to LCG-2 demonstratedCatalogMySQL and OracleAvailable in the prototypeUsed in the testing testbedDelivered to SA1But not tested yetgLite I/OAvailable in the prototypeUsed in the testing testbedBasic functionality and stress test availableDelivered to SA1But not tested yetFTSFTS is being evolved with LCGMilestone on March 15, 2005Stress tests in service challengesUIAvailable in the prototypeIncudes data managementNot yet formally testedR-GMAAvailable in the prototypeTesting has shown deployment problemsVOMSAvailable in the prototypeNo tests available
CERN - Computing Challenges
[email protected] Project
ScheduleAll of the Services are available now on the development testbedUser documentation currently being addedOn a limited scale testbedMost of the Services are being deployed on the LCG Preproduction ServiceInitially at CERN, more sites once tested/validatedScheduled in April-MaySchedule for deployment at major sites by the end of MayIn time to be included in the LCG service challenge that must demonstrate full capability in July prior to operate as a stable service in 2H2005
CERN - Computing Challenges
[email protected] Project
Migration StrategyCertify gLite components on existing LCG-2 service
Deploy components in parallel replacing with new service once stability and functionality is demonstrated
WN tools and libs must co-exist on same cluster nodes
As far as possible must have a smooth transition
CERN - Computing Challenges
[email protected] Project
Service Challenges
CERN - Computing Challenges
[email protected] Project
Problem StatementRobust File Transfer Service often seen as the goal of the LCG Service ChallengesWhilst it is clearly essential that we ramp up at CERN and the T1/T2 sites to meet the required data rates well in advance of LHC data taking, this is only one aspectGetting all sites to acquire and run the infrastructure is non-trivial (managed disk storage, tape storage, agreed interfaces, 24 x 365 service aspect, including during conferences, vacation, illnesses etc.)Need to understand networking requirements and plan earlyBut transferring dummy files is not enoughStill have to show that basic infrastructure works reliably and efficientlyNeed to test experiments Use CasesCheck for bottlenecks and limits in s/w, disk and other caches etc.We can presumably write some test scripts to mock up the experiments Computing ModelsBut the real test will be to run your s/wWhich requires strong involvement from production teams
CERN - Computing Challenges
[email protected] Project
LCG Service Challenges - OverviewLHC will enter production (physics) in April 2007Will generate an enormous volume of dataWill require huge amount of processing power
LCG solution is a world-wide GridMany components understood, deployed, tested..
ButUnprecedented scaleHumungous challenge of getting large numbers of institutes and individuals, all with existing, sometimes conflicting commitments, to work together
LCG must be ready at full production capacity, functionality and reliability in less than 2 years from nowIssues include h/w acquisition, personnel hiring and training, vendor rollout schedules etc.
Should not limit ability of physicist to exploit performance of detectors nor LHCs physics potentialWhilst being stable, reliable and easy to use
CERN - Computing Challenges
[email protected] Project
Key PrinciplesService challenges results in a series of services that exist in parallel with baseline production service
Rapidly and successively approach production needs of LHC
Initial focus: core (data management) services
Swiftly expand out to cover full spectrum of production and analysis chain
Must be as realistic as possible, including end-end testing of key experiment use-cases over extended periods with recovery from glitches and longer-term outages
Necessary resources and commitment pre-requisite to success!
Should not be under-estimated!
CERN - Computing Challenges
[email protected] Project
Initial Schedule (evolving)Q1 / Q2: up to 5 T1s, writing to disk at 100MB/s per T1 (no expts)
Q3 / Q4: include two experiments, tape and a few selected T2s
2006: progressively add more T2s, more experiments, ramp up to twice nominal data rate
2006: production usage by all experiments at reduced rates (cosmics); validation of computing models
2007: delivery and contingency
N.B. there is more detail in Dec / Jan / Feb GDB presentations
CERN - Computing Challenges
[email protected] Project
Key dates for Service PreparationSC2June05 - Technical Design ReportSep05 - SC3 Service PhaseMay06 SC4 Service PhaseSep06 Initial LHC Service in stable operationSC2 Reliable data transfer (disk-network-disk) 5 Tier-1s, aggregate 500 MB/sec sustained at CERNSC3 Reliable base service most Tier-1s, some Tier-2s basic experiment software chain grid data throughput 500 MB/sec, including mass storage (~25% of the nominal final throughput for the proton period)SC4 All Tier-1s, major Tier-2s capable of supporting full experiment software chain inc. analysis sustain nominal final grid data throughputLHC Service in Operation September 2006 ramp up to full operational capacity by April 2007 capable of handling twice the nominal data throughput Apr07 LHC Service commissioned
CERN - Computing Challenges
[email protected] Project
FermiLab Dec 04/Jan 05FermiLab demonstrated 500MB/s for 3 days in November
CERN - Computing Challenges
[email protected] Project
FTS stabilityM not K !!!
CERN - Computing Challenges
[email protected] Project
Interoperability
CERN - Computing Challenges
[email protected] Project
Introduction: grid flavoursLCG-2 vs Grid3Both use same VDT versionGlobus 2.4.xLCG-2 has components for WLM, IS, R-GMA, etcBoth use ~same information schema (GLUE)Grid3 schema not all GLUESome small extensions by eachBoth use MDS (BDII)
LCG-2 vs NorduGridNorduGrid uses modified version of Globus 2.xDoes not use gatekeeper different interfaceVery different information schema but does use MDS
Canada:Gateway into GridCanada and WestGrid (Globus based) in production
Catalogues:LCG-2: EDG derived catalogue (for POOL)Grid3 and NorduGrid: Globus RLS
Work done:With Grid3/OSG strong contacts, many points of collaboration, etc.With NorduGrid discussions have started
CERN - Computing Challenges
[email protected] Project
Common areas (with Grid3/OSG)InteroperationAlign Information SystemsRun jobs between LCG-2 and Grid3/NorduGridStorage interfaces SRM Reliable file transferService challenges
InfrastructureSecuritySecurity policy JSPG Operational securityBoth are explicitly common activities across all sitesMonitoringJob monitoringGrid monitoringAccountingGrid OperationsCommon operations policiesProblem tracking
CERN - Computing Challenges
[email protected] Project
InteroperationLCG-2 jobs on Grid3G3 site runs LCG-developed generic info provider fills their site GIIS with missing info GLUE schemaFrom LCG-2 BDII can see G3 sitesRunning a job on grid3 site needed:G3 installs full set of LCG CAsAdded users into VOMSWN installation (very lightweight now) installs on the flyGrid3 jobs on LCG-2Added Grid3 VO to our configurationThey point directly to the site (do not use IS for job submission)Job submission LCG-2 Grid3 has been demonstratedNorduGrid can run generic info provider at a siteBut requires work to use the NG clusters
CERN - Computing Challenges
[email protected] Project
Storage and file transferStorage interfacesLCG-2, gLite, Open Science Grid all agree on SRM as basic interface to storageSRM collaboration for >2 years, group in GGFSRM interoperability has been demonstratedLHCb use SRM in their stripping phaseReliable file transferWork ongoing with Tier 1s (inc. FNAL, BNL, Triumf) in service challenges. Agree that interface is SRM and srmcopy or gridftp as transfer protocol Reliable transfer software will run at all sites already in place as part of service challenges
CERN - Computing Challenges
[email protected] Project
OperationsSeveral points where collaboration will happenStarted from LCG and OSG operations workshops Operational security/incident responseCommon site charter/service definitions possible?Collaboration on operations centres (CIC-on-duty) ?Operations monitoring:Common schema for problem description/views allow tools to understand both?Common metrics for performance and reliabilityCommon site and application validation suites (for LHC apps)AccountingGrid3 and LCG-2 use GGF schemaAgree to publish into common tool (NG could too)Job monitoringLCG-2 Logging and bookkeeping well defined set of statesAgree common set will allow common tools to view job states in any gridNeed good high level (web) tools to display user could track jobs easily across grids
CERN - Computing Challenges
[email protected] Project
Outlook & SummaryLHC startup is very close. Services have to be in place 6 months earlierThe Service Challenge program is the ramp-up processAll aspects are really challenging!Now the experiment computing models have been published, trying to clarify what services LCG must provide and what their interfaces need to beBaseline services working group A enormous amount of work has been done by the various grid projectsAlready at full complexity and scale foreseen for startupBut still significant problems to address functionality, and stability of operationsTime to bring these efforts together to build the solution for LHC
Thank you for your attention !
CERN - Computing Challenges