john gordon cclrc escience centre grid support and operations john gordon cclrc gridpp9 - edinburgh

42
John Gordon CCLRC eScience centre Grid Support and Operations John Gordon CCLRC GridPP9 - Edinburgh

Upload: jenna-mcmanus

Post on 28-Mar-2015

223 views

Category:

Documents


0 download

TRANSCRIPT

John Gordon

CCLRC eScience centre

Grid Support and Operations

John Gordon

CCLRC

GridPP9 - Edinburgh

What is support?

• Not well defined

• ..or rather defined differently in many places

• End users, sysadmins, deployers, developers

– all need support

• Some examples

Grid Support Centre

• 14 named staff at Rutherford, Daresbury, Manchester and Edinburgh.

• Operates the UK e-Science Certification Authority.– http://ca.grid-support.ac.uk

• Provides a helpdesk for ‘first point of call’ queries.

• Website for advertising services provided.– http://www.grid-support.ac.uk

• Provides technical training and evaluations of middleware.

• Supports the Level-2 Grid project.– National Information Server for Core programme.– Publishing of site monitoring information in xml.

• Core support for the OGSA-DAI project.

European Grid Support Centre

• Collaboration between CCLRC, CERN and KTH Sweden each providing 1 FTE

• Point of trusted reliability between major projects and middleware producers.

• Directly communicates with staff from Globus Alliance to ensure European issues faced having assisted with release.

• Website up and running though currently a skeleton of the final content.

• Attended EDG meeting in Barcelona to publicise and GGF-8 to guide User Services R.G. work.

• Started 1st of october at GridKa Forschungszentrum Karlsruhe (Germany)

• Supports already 41 usergroups of GridKa

• Websitehttp://www.ggus.org

[email protected]

Global Grid User Support – GGUSThe Model

ESUS

GOC

Local operations

GGUS

Interaction Interaction

Interaction

Service Request

First line of support:Problems (experiment specific) will be solved by ESUS (with Savannah) or sent to GGUS using an agreed interface;

Grid related problems will be solved by GGUS or sent to GOC using the GGUS system;

Data flow

Grid User

GGUS: Global Grid User Support

ESUS: Experiment Specific User SupportGOC: Grid Operations Centre

Information flow

GridPP TB-Support1. Support Team

• built from sysadmins. 4 funded by GridPP to work on EDG WP6, the rest are the usual site sysadmins.

2. Methods• Email list, phone meetings, personal

visits, job submission monitoring• RB, VO, RC for UK use to support non-

EDG use• Planned to verify EDG releases but they

have been too infrequent to test procedures

3. Rollout• Experience from RAL in EDG dev

testbeds and IC and Bristol in CMS testbeds

• >10 sites have been part of EDG app testbed at one time

• 3 in LCG1

Savannah

EGEE Operations

• Resource Centres – all sites

• Regional Operations Centres (ROC)

– At least one per region!

– RAL in UK/Ireland

• Core Infrastructure Centres (CIC)

– CERN, RAL, CNAF, CC-IN2P3

Others

• Tier1`Support– Role to support UK Tier2s in LCG– Deployment role in GridPP2

• Tier2 Specialist Posts– Support for varous middleware areas

• Middleware Developers

Where do you go for support?

• Users go to experiment support• Experiment support diagnoses and forwards as necessary to Grid user

support or middleware or operations or applications• Resource Centres look to their Regional Operations Centre (Tier2s to

their Tier1)• ROCs will also push problems to their RCs. • But we know that users will go to their local sysadmin or direct to their

Tier2 or Tier1 too. – And some sysadmins will go to their favourite experiment expert– And Tier1s will go direct to middleware experts.

• In short, chaos.• Strategy for now is to have a UK Plan that is self-contained and can

deliver support in the UK when and where required. – Interface this to the various outside bodies– Don’t duplicate for the sake of it, but be ready to.

• Or be prepared to role our work into wider provision when it is proven.

John Gordon

CCLRC eScience centre

Grid Operations Centre

What is Operations?

• RAL leading development of LCG GOCThe Vision• GOC Processes and Activities

– Coordinating Grid Operations– Defining Service Level Parameters– Monitoring Service Performance Levels– First-Level Fault Analysis– Interacting with Local Support Groups– Coordinating Security Activities– Operations Development

• Recent developments :-

GOC - Monitoring

• Who is Involved?

3.0 FTE (Trevor Daniels, Dave Kant, Matt Thorpe, Jason Leake)• What are we Doing?

Monitor Grid Services, Manage Site Information, Accounting

• Developed Tools to Configure/Integrate Monitoring to make the job

easier

GPPMon

Nagios

Mapcentre

• Example: Mapcentre 30 sites ~ 500 lines in config file• Example: Nagios 30 sites, 12 individual config files with

dependencies

Both tedious to configure

Not practical “by hand” with large numbers of nodes

GOC - Database

• Develop/maintain a database to hold site information • Site Information (contact lists, resources, site

information, URLs)• Secure access through GridSite (X509 certificates) via

PHP web interface• RC managers should maintain their own pages as part of

the site certification process.• Monitoring scripts read information in database and run

a set of customised tools to monitor the infrastructure. • To be included in the monitoring a site must register its

resources (CE,SE,RB,RC,RLS,MDS,RGMA,BDII,..) • BDII can be queried to check GOC database is up-to-

date.

GOC Monitoring Today

GOC DB

GridSiteMySQL

EDG UI

Remote UI Queries Database to build a list of resources

Submit monitoring jobs to those resources

Publish Results on WWW

LCG1 UI

LCG-2 UI

EDG RESOURCES

LCG-1 RESOURCES

LCG-2 RESOURCES

New GPPMon Features

• Download Host Certificates daily and monitor Life Times for CEs and SEs for LCG and EDG

New GPPMon Features

• Reliability of service provided using RRDTool to show Globus and RB stats

New GPPMon Features

• Moving toward LCG-1, LCG-2 and EDG monitoring

gridkap01.fzk.de

Tuesday 3/2/04 14:10

Only RAL and FZK have updated their LCG-2 information in the GOC database.

Nagios

• Customised plugins for monitoring

• Focus service behaviour and data consistency Do RBs find resources

Do site GIIS’s publish correct hostname?

Is the site running the latest stable software release?

Does the Gatekeeper authenticate?

Are the host certificates valid?

Are essential services running?

Nagios Screen Shots LCG-1

Nagios Screen Shots LCG-1

Service Summary for Gatekeeper Nodes

Nagios Screen Shots LCG-1

Host and Service Summary tables for BDII nodes

GOC Configuration

• Example: Manage a Grid-Wide Database

- provides access to site information via trusted certificate

- scripts to automatically configure Nagios from the GOC database

- provide plugins to monitor services for nagios

- create configurations file for mapcentre

GOC

GOC GridSite MySQL

Resource CentreResources & Site Information

EDG, LCG-1, LCG-2, …

ce

se

bdii

rb

Monitoring

Secure Database Management via HTTPS / X.509

RC

GOC Server

http://goc.grid-support.ac.uk

What’s in the Database?

People: Who do we notify when there are problems

What’s in the Database?Node Information (Hostname, IP Address, Group)

What’s in the Database?

Scheduled Downtimes:

Advanced warning of site maintenance resulting in reduced service availability

LCG Accounting Overview

1. PBS log processed daily on site CE to extract required data, filter acts as R-GMA DBProducer -> PbsRecords table

2. Gatekeeper log processed daily on site CE to extract required data, filter acts as R-GMA DBProducer -> GkRecords table

3. Site GIIS interrogated daily on site CE to obtain SpecInt and SpecFloat values for CE, acts as DBProducer -> SpecRecords table, one dated record per day

4. These three tables joined daily on MON to produce LcgRecords table. As each record is produced program acts as StreamProducer to send the entries to the LcgRecords table on the GOC site.

5. Site now has table containing its own accounting data; GOC has aggregated table over whole of LCG.

6. Interactive and regular reports produced by site or at GOC site as required.

Note: This is an improved design over that presented at the Jan GDB. The SOAP transport has been replaced by R-GMA.

GOC SiteLCG Site

MON

LCG Site

CE MON

PBS log

gk log

site GIIS

filter

filter

filterGOC

Reports

LCG Site

Accounting DB

LCG Accounting Flow

Progress

• Status on 3 Feb 2004:– The code which will run on the CE to parse and process the PBS

and Gatekeeper logs is written. The PbsRecords and GkRecords tables are created and are being populated.

– The code to join these two tables and publish the new joined table (LcgRecords) is also written and working.

– Work is in progress to write the archiver at the GOC to receive the aggregated LcgRecords table – 2 days work.

• To do:– Write the code to interrogate the site GIIS to extract the CPU

power values and populate these fields in the tables – 2 days work

– Integration testing and debugging – 5 days– Packaging for deployment – 3 days– Write the report generators – 30 days (estimate – not yet

designed)

Accounting Issues

1. There is no R-GMA infrastructure LCG-wide, so most sites are not able to install and run the accounting suite at present. It is expected that R-GMA and the MON boxes will be rolled out in LCG2 soon after the storage problems are resolved. Until this happens the complete batch and gatekeeper logs will have to be copied to the GOC site for processing.

2. The VO associated with a user’s DN is not available in the batch or gatekeeper logs. It will be assumed that the group ID used to execute user jobs, which is available, is the same as the VO name. This needs to be acknowledged as an LCG requirement.

3. The global jobID assigned by the Resource Broker is not available in the batch or gatekeeper logs. This global jobID cannot therefore appear in the accounting reports. The RB Events Database contains this, but that is not accessible nor is it designed to be easily processed.

4. At present the logs provide no means of distinguishing sub-clusters of a CE which have nodes of differing processing power. Changes to the information logged by the batch system will be required before such heterogeneous sites can be accounted properly. At present it is believed all sites are homogeneous.

Future Direction Towards EGEE

Distribute Tools to help the ROCs monitor their RCs

(Database + Monitoring Packages)

Distribute Tools to help CIC’s monitor Core Services – Grid Wide Monitoring

Ideas on how this would work:

CIC monitoring tools query ROC databases

Select core services

Run a standard set of checks on those services

Display information / Notifications …

John Gordon

CCLRC eScience centre

UK Deployment, Support and Operations

Deployment Team

Grid Support Centre5 FTE

Core Grid Coordinator1 FTE

Security Officer

RAL

Data and Storage Management

Glasgow, Bristol, Edinburgh

VO Management and ServicesNorth (0.5 FTE)

Workload Management ServicesLondon

Network ManagementLondon (0.5 FTE)

MiddleWare Specialist Support6 FTE

2 Tier1 Deployment

4 Tier 2 UK Coordinators

LondonGrid,NorthGridScotGrid, SouthGrid

1 Tier2 CoordinatorIreland

Applications Expert

Deployment Team8 FTE Grid Operations Centre

Production Manager

EGEEGridPPJISCCore UK

Proposal for a UK wide Team to provide and run a UK wide Grid

The GridPP View. There are alternative views for other stakeholders

Manager

Operations (2)

Technical Writer

Network Support

Helpdesk

Network Monitoring

Resource Centres

• Tier1 : Rutherford Appleton Laboratory• Tier-2 centres are distributed over many sites.

• Sites which have signed up to LCG and deployed software (RAL,IC,Cambridge) expect to join EGEE (PM1)

London Grid IC,QMUL,RHUL,UCL, Brunel

North Grid Daresbury, Lancaster, Liverpool,Manchester, Sheffield

Scot Grid Durham, Edinburgh, Glasgow

South Grid Birmingham, Bristol, Cambridge, Oxford, RAL-PPD

Tier 2 Number

Of

CPUs

TOTAL CPU

[KSI2000]

Total Disk [TB]

Total Tape [TB]

London 2454 1996 99 20

North

Grid

2718 2801 209 332

South

Grid

918 930 67 8

Scot

Grid

368 318 79 0

Total 6458 6045 455 360

Tier-2 Centre Resources (Projected 2004)

Projected resources available in September 2004 to be applied to large-scale production Grid deployment. The total CPU at each institute is proportional to the size of the green circles. The disk storage at each site is proportional to the height of the grey vertical bars

Roles (1)

• Production Manager Overall Manager to oversee operations and report to

other groups (ROC Coordinator, OMC …)

• Core Grid Coordinator Bring UK non-Particle Physics projects (applications and

resources) into EGEE

Roles (2)

• Deployment Team Consists of about 7 people to spearhead the rollout and

certification of Grid software to the Resource Centres (Tier1 & Tier2)

• Grid Operations Centre Similar role to the proposed CIC in EGEE. Monitor health of services and provide toolkits Operate Core Grid Services Database of RC’s managed by RC site administrators

Roles (3)

• Middleware Specialist Support Body of experts to provide specialist support to Resource

Centres in key areas: security, data management, network, VO management and workflow management.

• Grid Support Centre Helpdesk facility, CA

Broker requests to middleware specialists

Team UK

• A large team in the UK (GridPP, EU, and other)• GridPP Production Manager should orchestrate

this team to deliver a production grid for GridPP– But interwork with as many other UK grids

and projects as possible• Meet our EGEE ROC and CIC deliverables for

support and operations

• A big challenge