lcg grid deployment

25
CERN LHC Computing Grid Deployment Ian Bird IT Division, CERN LCG Deployment Area Manager US LHC Software and Computing Review January 14, 2003

Upload: softwarecentral

Post on 16-Apr-2017

451 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: LCG Grid Deployment

CERN

LHC Computing Grid Deployment

Ian BirdIT Division, CERN

LCG Deployment Area Manager

US LHC Software and Computing ReviewJanuary 14, 2003

Page 2: LCG Grid Deployment

US LHC Computing Review / Jan 14,2003 [email protected] 2

CERNOutline

• LCG Phase I Deployment Goals• Deployment Organisation• LCG Deployment Plan• Middleware• Test beds and Services• Certification, Testing, Packaging, Distribution• Operations and Security• Support• Collaborative projects• Summary

Page 3: LCG Grid Deployment

US LHC Computing Review / Jan 14,2003 [email protected] 3

CERNLCG-1 Deployment Goals

• Production service for Data Challenges in 2H03 & 2004– Initially focused on batch production work

• Experience in close collaboration between the Regional Centres– Must have wide enough participation to understand the issues,

• Learn how to maintain and operate a global grid

• Focus on a production-quality service – Robustness, fault-tolerance, predictability, and supportability take

precedence; additional functionality gets prioritized

• LCG should be integrated into the sites’ physics computing services – should not be something apart– This requires coordination between participating sites in:

• Policies and collaborative agreements• Resource planning and scheduling • Operations and Support

Page 4: LCG Grid Deployment

US LHC Computing Review / Jan 14,2003 [email protected] 4

CERNElements of an LCG Service

• Middleware:– Testing and certification – Packaging, configuration, distribution and site validation– Support – problem determination and resolution; feedback to

middleware developers• Operations:

– Grid infrastructure services– Site fabrics run as production services– Operations centres – trouble and performance monitoring,

problem resolution – 24x7 globally• Support:

– Experiment integration – ensure optimal use of system– User support – call centres/helpdesk – global coverage;

documentation; training

Page 5: LCG Grid Deployment

US LHC Computing Review / Jan 14,2003 [email protected] 5

CERNDeployment Organisation• Grid Deployment Board (GDB) – chair Mirco Mazzucato

– Representatives from the experiments and from each country with an active Regional Centre taking part in the LCG Grid Service

• Forges the agreements, takes the decisions, defines the standards and policies that are needed to set up and manage the LCG Global Grid Services

• Coordinates the planning of resources for physics and computing data challenges

• First task is the detailed definition of LCG-1– includes defining the set of grid middleware tools to be deployed

• LCG Deployment Group - coordinated by Deployment Area Manager– Teams at CERN and Regional Centres

• Implement the elements of the LCG service:– Certification, testing etc; Operations; Support; Scheduling

• Guided by agreements negotiated by GDB• Reports to Project Execution Board; SC2

Page 6: LCG Grid Deployment

US LHC Computing Review / Jan 14,2003 [email protected] 6

CERNGDB Status• 1st Meeting in Milano – Oct 4, 2002, 4 since

– Set up Technical Working groups:• WG1: Define LCG-1 functionality and services• WG2: Define the schedule for rolling out the infrastructure and

resources. Propose process & metrics to be used for allocation, accounting, and reporting.

• WG3: Define a straightforward security and authentication model to be used in LCG-1, and identify the technical issues. Set up agreements to enable implementation.

• WG4: Define ops procedures & responsibilities. Make agreements to ensure coordination of these activities. Define the requirements for a Grid Operations Centre to coordinate operational activities.

• WG5: Propose a support model for LCG-1, including the scope of responsibilities for call centre/helpdesk (delayed until Feb 03).

• Status:– LCG-1 on track to be defined by end Jan 03 – certainly in

essentials needed to progress – functionality and services, participating sites, resources and schedules, initial operational and security models

• Working group final reports end ~Jan 03 – should indicate where LCG should focus effort, process for agreements, etc.

Page 7: LCG Grid Deployment

US LHC Computing Review / Jan 14,2003 [email protected] 7

CERNLCG Deployment PlanLevel 1 Milestones

M1.1 - July 03 First Global Grid Service (LCG-1) available

M1.2 - June 03 Hybrid Event Store (Persistency Framework) available for general users

M1.3a - November 03

LCG-1 reliability and performance targets achieved

M1.3b - November 03

Distributed batch production using grid services

M1.4 - May 04 Distributed end-user interactive analysis from “Tier 3” centre

M1.5 - December 04

“50% prototype” (LCG-3) available

M1.6 - March 05 Full Persistency FrameworkM1.7 - June 05 LHC Global Grid TDR

Page 8: LCG Grid Deployment

US LHC Computing Review / Jan 14,2003 [email protected] 8

CERNLCG Phase I Timescale in a nutshell• LCG-1 must be defined – end Jan 03

– 2 major areas to be addressed by GDB working groups• Define LCG-1 in terms of required functionality and services

– Deployment schedule • Set up distributed organisational structure

– Resources and scheduling, – Policies – security, authentication, etc.– Operational agreements and responsibilities– Support services

• LCG-1 service must be in place – July 2003– 6 months testing, integration, certification, packaging and

deployment• Need to demonstrate performance – end 2003

– This should include adding current production services into LCG• Provide production service for data challenges in 2004• LCG-3 Follows LCG-1 by 1 year – provides “50% complexity”

service in 2005.

Page 9: LCG Grid Deployment

US LHC Computing Review / Jan 14,2003 [email protected] 9

CERNActivities to Achieve:Initial Availability of First Global Service

• Define LCG-1 in terms of– functionality, resources, operations, security, support

• Deploy a series of evolving pilot services for testing, with increasing resources– First pilot service – Feb 1 2003– Incremental deployment to Tier 1 and Tier 2 centres:

• ~10 in 3 continents by June

• Testing, certification, packaging and release of software– Certification, testing, release process defined – January 2003.– Packaging/configuration mechanism defined– March 2003. – Delivery of middleware software packages – March 1, 2003– Iterative, incremental release cycle, with major functional

releases:• V1.0 – June 1, 2003

Page 10: LCG Grid Deployment

US LHC Computing Review / Jan 14,2003 [email protected] 10

CERNactivities – cont.

• Set up infrastructure and operational procedures– Certificate Authorities and VO management systems in place –

May 2003• Based on existing EU and US inter-operating systems

– Resource accounting and reporting procedures set up – May 2003– Security procedures defined and agreed and in place– June 2003

• Incident response and security management

• Set up operations centre and help desk (call centre)– Identify operations and call centre locations – February 1, 2003– In place by June 2003

• LCG-1 commissioning and acceptance– 30 day commissioning period with user productions and stress

tests, including 7 day acceptance period

Page 11: LCG Grid Deployment

US LHC Computing Review / Jan 14,2003 [email protected] 11

CERNActivities to Achieve:LCG-1 Fully operational

• Define LCG-1 performance goals – July 2003– In concert with experiments and their data challenge requirements,

set performance goals in terms of capacity, throughput, reliability, etc. A GDB working group.

• 10 Regional Centres participating – October 2003– WG2 defines the implementation schedule – may be adjusted in July.

• LXBatch service merged into LCG-1 – October 2003– All resources of LXBATCH will be grid-enabled and accessible as part of

the LCG-1 service. This is a CERN activity but hopefully reflects what happens at other sites too.

• Milestone release of middleware – October 2003– V1.1 release with improved functionality – October 2003

• Review of service – November 2003– The LCG-1 service level should be that required for the 2004 data

challenges. The determination and acceptance of achieving the target will be done in a review of the service by representatives from the experiments, the regional centres and LCG.

Page 12: LCG Grid Deployment

CERN

Deployment Details

Page 13: LCG Grid Deployment

US LHC Computing Review / Jan 14,2003 [email protected] 13

CERNMiddleware• Deployed middleware to be based on US and EU toolkits:

– VDT• Globus, Condor, GLUE schema, EDG CA and VO tools, etc.

– EDG• Resource broker• Reptor - WP2 (Data Management) – using RLS• WP4 will be used to manage CERN fabric (available for others)• VOMS

– Monitoring tools• Initially based on work done by Worldgrid (iVDGL + DataTag)

• Specifics (functionality, version, delivery) being firmed up now by GDB WG1 – final by beginning February.

• This will provide the initial basic functionality and will evolve significantly – LCG will focus on building a robust service – changes in basic

functionality driven by experiments • Deployment of these components involves not only obtaining the

software but also agreeing the essential support and maintenance.

Page 14: LCG Grid Deployment

US LHC Computing Review / Jan 14,2003 [email protected] 14

CERNTestbeds and Services

• The deployed systems will be in several versions and functions:– Certification testbeds – both local and distributed

• Integration of middleware components, Controlled changes, in-depth application testing

• Prepare for release– Production service – deployed at Regional Centres– Development service – deployed at Regional Centres– Certification testbeds parallel Production and Development

services – i.e. need to debug and stabilise production release in parallel with development

• This is 2 prongs of the “Gordon trident” – – 3rd prong are the grid projects’ development systems:

iVDGLDatagrid

Prod

ucti

on G

rid

deve

lope

rs’

test

bed

deve

lopm

ent

test

bed

LCG

Page 15: LCG Grid Deployment

US LHC Computing Review / Jan 14,2003 [email protected] 15

CERNTimelines – LCG Phase 1

LCG-1LCG

ServicesPilot-1 Pilot-2

TestbedLCG-3

LCG Certification & Test Incremental middleware releases

Incrementally add regional centres

ALICE 5% 10%

ATLASDC-2

CMS5% DC04 DC0510%

LHCb

Data Challenges

July 03Jan 03 Jan 04 July 04 Jan 05 July 05

LCG-1 Defined

LCG-1 Initial Service Available

LCG-1 Full Service Available

LCG-1 Fulfils Performance Goals

LCG-3 Fulfils Performance Goals

Computing TDR

Page 16: LCG Grid Deployment

US LHC Computing Review / Jan 14,2003 [email protected] 16

CERNCertification and Testing• Will be an ongoing major activity of LCG

– Part of what will make LCG a production-level service• Goals:

– Certify/validate that middleware behaves as advertised and provides the required functionality (HEPCAL)

– Stabilise and robustify middleware– Provide debugging, problem resolution and feedback to

developers• Testing activities at all levels

– Component/unit tests– Basic functional tests, including tests of distributed (grid) services– Application level tests – based on HEPCAL use-cases

• Driven/implemented by the experiments – GAG set up by SC2– Experiment beta-testing before release– Site configuration verification

• JTB collaborative project - LCG, Trillium, EDG– Gather existing tests– Write/obtain missing tests

Page 17: LCG Grid Deployment

US LHC Computing Review / Jan 14,2003 [email protected] 17

CERNCertification & TestingTestbeds

• CERN testbed– Several “clusters” forming a local grid– Basic tests, basic grid functionality

• Distributed testbed– CERN testbed + testbeds at a few other remote sites– Grid functionality– Application benchmarks and experiment beta testing

• Needs several versions– Current production version – for reproducing and fixing problems– Development version– + OS versions …

Page 18: LCG Grid Deployment

US LHC Computing Review / Jan 14,2003 [email protected] 18

CERNCertification & TestingRelease Strategy

• Small release cycles with incremental functionality, rather than major releases where many things change– Somewhat depends on technology suppliers and their responsiveness

to LCG needs, since LCG is not in control of development• There will however, be milestone functional releases in June and

October 2003.

• Continuous, evolutionary process

• Each release goes through certification/test cycle

• Only way to keep control of bugs

• Goal is stability and robustness …

Page 19: LCG Grid Deployment

US LHC Computing Review / Jan 14,2003 [email protected] 19

CERNPackaging and distribution

• Obviously a major issue for a deployment project• Joint activity started –

– Discussions LCG, EDG, VDT, EDT, iVDGL, etc.– Have produced a draft discussion document– Will soon lead to a JTB joint project

• Want to provide a tool that satisfies needs of the participating sites, – Interoperate with existing tools where appropriate and

necessary– Does not force solution on sites with established infrastructure– Solution for sites with nothing

• Configuration is essential component– Essential to understand and validate correct site configuration – Effort will be devoted to providing configuration tools– Verification of correct configuration will be required before sites

join LCG

Page 20: LCG Grid Deployment

US LHC Computing Review / Jan 14,2003 [email protected] 20

CERNLCG Operations

• Responsible for operating and maintaining the grid infrastructure and associated services – Gateways, information services, resource broker etc. – i.e. grid specific

services– Will be a coordination between teams at CERN and at Regional Centres– Responsible also for the VO infrastructure, Authentication and

Authorisation services– Security operations – incident response etc.

• Build Grid Operations Centre(s)– Performance and problem monitoring; troubleshooting and

coordination with site operations, user support, network operations etc.– Leverage existing experience/ideas (WorldGrid – iVDGL,EDT, etc.)

• Started discussions with DataTag about future developments• Indian group provide development effort• LCG site to lead this (FZK? – not certain yet)• Once have a activity lead will expand collaborative activities

– Assemble monitoring, reporting, performance, etc. tools • Start with what exists, understand what is missing and needed and build

from there

Page 21: LCG Grid Deployment

US LHC Computing Review / Jan 14,2003 [email protected] 21

CERNGrid Operation

User

Network Operations

Centre

Local operationLocal user support

Grid Operations Centre

Call Centre

Loca

l site

Grid

ope

ratio

ns

Grid information

service VirtualOrganisation

Grid logging &

bookkeeping

queriesmonitoring & alarmscorrective actions

Page 22: LCG Grid Deployment

US LHC Computing Review / Jan 14,2003 [email protected] 22

CERNSecurity

• GOAL: Do not want to make exceptions for LCG services – they must run integrated into a site infrastructure, and be subject to all usual security and good management procedures and policies

• BUT: Initially, certain to need exceptions and compromises since until now most grid middleware has sidestepped security issues

• THUS: We must have a sound security policy and an agreed plan that provides for these exceptions in the short term, but shows a clear path to reach the state that the sites require

Page 23: LCG Grid Deployment

US LHC Computing Review / Jan 14,2003 [email protected] 23

CERNUser Support

• Essential for a production service• Two aspects

– Experiment integration/ consultancy• Work directly with the experiments’ computing projects to ensure

efficient use of LCG services, and optimum use of resources• Act as liaison to ensure experiment specific issues are resolved

– User support• Helpdesk/call centre operation

– Globally distributed – 24x7, ensure single point of contact for user – Collaborative and distributed operation

• Documentation• Training, tutorials, etc.

Page 24: LCG Grid Deployment

US LHC Computing Review / Jan 14,2003 [email protected] 24

CERNCollaborative Projects

• LCG is not a middleware development project and can only succeed by leveraging the existing and ongoing work of the various grid development projects – and (hopefully) becoming a focus for them.

• There are many opportunities for common solutions, which are being actively pursued

• HICB – JTB– GLUE

• Schema definitions & interoperability work– New collaborative activities:

• Validation and Test Suites• Distribution, Meta-Packaging, Configuration

• Grid Operations Centre– DataTag, iVDGL, DTF, etc

• Storage interfaces; e.g. SRM• Authentication, authorisation and security

– Security managers are beginning to collaborate in the context of LCG • HEPiX/LCCWS as collaborative vehicle for RC managers, site

coordinators– E.g. certification process for operating environments; upgrade

procedures; configuration management; helpdesk tools, etc.• GGF – production grids area, etc.

Page 25: LCG Grid Deployment

US LHC Computing Review / Jan 14,2003 [email protected] 25

CERNDeployment Summary• Deploy middleware to support essential functionality, but goal is to

evolve and incrementally add functionality– Added value is to robustify, support and make into a 24x7 production service

• How?– Certification & test procedure – tight feedback to developers

• must develop support agreements with grid projects to ensure this– Define missing functionality – require from providers– Provide documentation and training– Provide missing operational services– Provide a 24x7 Operations and Call Centre

• Guarantee to respond • Single point of contact for a user

– Make software easy to install – facilitate new centres joining • Deployment is a major activity of LCG

– Encompasses all operational and practical aspects of a grid– There is a lot of work already done that must be leveraged– Many opportunities for synergy and collaboration