john gordon [email protected] and lcg and grid operations john gordon cclrc e-science centre, uk...

39
John Gordon [email protected] LCG and and Grid Operations John Gordon CCLRC e-Science Centre, UK LCG Grid Operations

Upload: jessie-bennett

Post on 24-Dec-2015

226 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: John Gordon j.c.gordon@rl.ac.uk and LCG and Grid Operations John Gordon CCLRC e-Science Centre, UK LCG Grid Operations

John Gordon

[email protected]

LCG andand Grid Operations

John Gordon

CCLRC e-Science Centre, UK

LCG Grid Operations

Page 2: John Gordon j.c.gordon@rl.ac.uk and LCG and Grid Operations John Gordon CCLRC e-Science Centre, UK LCG Grid Operations

John Gordon

[email protected]

Outline

• The monitoring tools

• How we use them in operations

• What is still to be done

Page 3: John Gordon j.c.gordon@rl.ac.uk and LCG and Grid Operations John Gordon CCLRC e-Science Centre, UK LCG Grid Operations

John Gordon

[email protected]

Grid Operations

• Once middleware has been developed, tested and deployed, grid operations are the set of actions and procedures to keep a grid running for the users.

Page 4: John Gordon j.c.gordon@rl.ac.uk and LCG and Grid Operations John Gordon CCLRC e-Science Centre, UK LCG Grid Operations

John Gordon

[email protected]

The Vision

• GOC Processes and Activities– Coordinating Grid Operations– Defining Service Level Parameters– Monitoring Service Performance Levels– First-Level Fault Analysis– Interacting with Local Support Groups– Coordinating Security Activities– Operations Development

Page 5: John Gordon j.c.gordon@rl.ac.uk and LCG and Grid Operations John Gordon CCLRC e-Science Centre, UK LCG Grid Operations

John Gordon

[email protected]

Have we delivered?

• Coordinating Grid Operations

• Defining Service Level Parameters

• Monitoring Service Performance Levels

• First-Level Fault Analysis• Interacting with Local

Support Groups• Coordinating Security

Activities• Operations Development

• Yes, RAL, CERN & Taipei

• No

• up or down• Yes

• Yes

• Policies, not operation• Monitoring and

accounting

Page 6: John Gordon j.c.gordon@rl.ac.uk and LCG and Grid Operations John Gordon CCLRC e-Science Centre, UK LCG Grid Operations

John Gordon

[email protected]

Monitoring the Grid is a Challenge!

Page 7: John Gordon j.c.gordon@rl.ac.uk and LCG and Grid Operations John Gordon CCLRC e-Science Centre, UK LCG Grid Operations

John Gordon

[email protected]

Why We Monitor• Keep systems up and running• Notice failures; grid-wide services MDS; • Knowing what services a site should be running

no point raising an alert if the site isn’t meant to run it! definition of services and which sites run them (SLA)

What Tools Do We Use• Job Submission; GridIce; Nagios; GIIS Monitor• How – Database• Developments Planned nagios

Monitoring Overview

Page 8: John Gordon j.c.gordon@rl.ac.uk and LCG and Grid Operations John Gordon CCLRC e-Science Centre, UK LCG Grid Operations

John Gordon

[email protected]

• We have only fragmentary information about the services that sites are running.

• We don’t know what RBs/SEs/Sites the VOs are using for data challenges.• We don’t know what the core services are and who is running them.• We don’t have a toolkit to test specific core services.• We have to concentrate on functional behaviour of services e.g If an RB

sends your job to a CE, then we must assume the RB is working fine. Is this the only test of a RB?

• Not all the tests that we perform are effective at finding problems so we must take tests written by the experts and integrate them into GOC monitoring.

• We must develop tests which simulate the life cycle of real applications in a Grid environment.

• There are lots of monitoring tools available, so we need to bring them together.

• Do we spend time investigating new tools, or make the ones which we already have better?

• …and probably lots more!

Monitoring Challenges

Page 9: John Gordon j.c.gordon@rl.ac.uk and LCG and Grid Operations John Gordon CCLRC e-Science Centre, UK LCG Grid Operations

John Gordon

[email protected]

• There are many frameworks which can be used to monitor

distributed environments• MAPCENTRE http://mapcenter.in2p3.fr/• GPPMON http://goc.grid-support.ac.uk/• GRIDICE http://grid-ice.esc.rl.ac.uk• NAGIOS http://www.nagios.org/• MONALISA http://monalisa.cacr.caltech.edu/• GIIS Monitor http://goc.grid.sinica.edu.tw/gstat/• Ganglia

– Example: Mapcentre 30 sites ~ 500 lines in config file (static version)– Example: Nagios 30 sites, 12 individual config files with

dependencies

– Developed Tools to Configure these services to make the job easier NAGIOS, MAPCENTER and GPPMON

Monitoring Services

Page 10: John Gordon j.c.gordon@rl.ac.uk and LCG and Grid Operations John Gordon CCLRC e-Science Centre, UK LCG Grid Operations

John Gordon

[email protected]

GOC Configuration Database

GOC GridSite MySQL

Resource CentreResources & Site Information

EDG, LCG-1, LCG-2, …

ce

se

bdii

rb

Monitoring

Secure Database Management via HTTPS / X.509

People, Contact Information, Resources

Scheduled Maintenance

RC

SQLhttps

SERVER

Page 11: John Gordon j.c.gordon@rl.ac.uk and LCG and Grid Operations John Gordon CCLRC e-Science Centre, UK LCG Grid Operations

John Gordon

[email protected]

GOC Job Submission Flow Diagram

Simple job forked on CE using globus

GOC (UI)

Build List of CE, RB

Resources

JOB Script

GLOBUS.CEcreate CE

sent acknowledge

globus-job-run CE

SITE DB

SQL QUERY

wget http://goc_ui/ack.cgi?GLOBUS.CE

received acknowledgement

1

2

3

4

5

GPPMON - 2

Dave Kant
Data Flow DiagramA graphical means of presenting, describing or analyzing a process.
Page 12: John Gordon j.c.gordon@rl.ac.uk and LCG and Grid Operations John Gordon CCLRC e-Science Centre, UK LCG Grid Operations

John Gordon

[email protected]

GPPMON - 3

JOB Script

RB.CEcreate

RB

sent acknowledge

edg-job-submit

GOC (UI)

Build List of CE, RB

Resources

SITE DB

SQL QUERY

CE

Other.GlueCEUniqueID

wget http://goc_ui/ack.cgi?RB.CE

received acknowledgement WN

CE

Simple job through local jobmanager on CE via Resource Broker Job MatchMaking

Dave Kant
Data Flow DiagramA graphical means of presenting, describing or analyzing a process.
Page 13: John Gordon j.c.gordon@rl.ac.uk and LCG and Grid Operations John Gordon CCLRC e-Science Centre, UK LCG Grid Operations

John Gordon

[email protected]

LCG2 Site Status: 21 July 2004 10.00am

GPPMON – 1

Page 14: John Gordon j.c.gordon@rl.ac.uk and LCG and Grid Operations John Gordon CCLRC e-Science Centre, UK LCG Grid Operations

John Gordon

[email protected]

GRIDICE - 1

http://grid-ice.esc.rl.ac.uk/gridice

Page 15: John Gordon j.c.gordon@rl.ac.uk and LCG and Grid Operations John Gordon CCLRC e-Science Centre, UK LCG Grid Operations

John Gordon

[email protected]

Page 16: John Gordon j.c.gordon@rl.ac.uk and LCG and Grid Operations John Gordon CCLRC e-Science Centre, UK LCG Grid Operations

John Gordon

[email protected]

Ganglia Monitoring - 1

• http://gridpp.ac.uk/ganglia• Can use Ganglia to monitor a cluster

RAL Tier-1 Centre

LCG PBS Server displays Job status for each VO

Page 17: John Gordon j.c.gordon@rl.ac.uk and LCG and Grid Operations John Gordon CCLRC e-Science Centre, UK LCG Grid Operations

John Gordon

[email protected]

Ganglia Monitoring - 2

• Can also use Ganglia to monitor clusters of clusters

Page 18: John Gordon j.c.gordon@rl.ac.uk and LCG and Grid Operations John Gordon CCLRC e-Science Centre, UK LCG Grid Operations

John Gordon

[email protected]

Provide ROCs with a package to monitor the resources in the region• Tailored Monitoring• ROCs may upload their own maps• JAVA GUI to automate site locations on the map

Hierarchical view of Resources

• Example GridPP made up of virtual T2 centres

Regional Monitoring - 1

EGEE

France UK/I S.E.E

GridPP

LondonT2

ScotGrid

IMPERIAL

QMUL

Edinburgh

Page 19: John Gordon j.c.gordon@rl.ac.uk and LCG and Grid Operations John Gordon CCLRC e-Science Centre, UK LCG Grid Operations

John Gordon

[email protected]

LCG2 Site Status: 21 July 2004 10.00am

GPPMON – 1

Page 20: John Gordon j.c.gordon@rl.ac.uk and LCG and Grid Operations John Gordon CCLRC e-Science Centre, UK LCG Grid Operations

John Gordon

[email protected]

http://goc.grid-support.ac.uk/roc_map/map.php Active map to select individual regions

Regional Monitoring - 2

Page 21: John Gordon j.c.gordon@rl.ac.uk and LCG and Grid Operations John Gordon CCLRC e-Science Centre, UK LCG Grid Operations

John Gordon

[email protected]

Regional Monitoring - 3

UK/I Monitoring displays GRIDPP and NGS resources.

Page 22: John Gordon j.c.gordon@rl.ac.uk and LCG and Grid Operations John Gordon CCLRC e-Science Centre, UK LCG Grid Operations

John Gordon

[email protected]

Replica Manager Tests - 1

• GOC to take over site certification testing which is done by CERN deployment team on a daily basis (e.g reports by Piotr Nyczyk)

• First step toward this involved running a series of replica manager tests which register files onto the grid, move them around, delete them; and 3rd party copies from remote SE e.g Castorgrid

• Demonstrates that we can integrate other peoples tools into GPPMON

• Development of a portal which will:– Make it easy to retrieve debug information from the job output.– Connect with information provided by other monitoring tools e.g Taipei GIIS

Monitor. – Provide testing “on-demand” to site administrators through a secure interface.

Page 23: John Gordon j.c.gordon@rl.ac.uk and LCG and Grid Operations John Gordon CCLRC e-Science Centre, UK LCG Grid Operations

John Gordon

[email protected]

http://goc.grid-support.ac.uk/gridsite/status/rmtest.php?action=table

Results of each test are shown as a coloured index on the map.

Distinguish between jobs that have completed, or have failed or still running.

Replica Manager Tests - 2

Page 24: John Gordon j.c.gordon@rl.ac.uk and LCG and Grid Operations John Gordon CCLRC e-Science Centre, UK LCG Grid Operations

John Gordon

[email protected]

Description of the tests

Job Outputs

GIIS Monitor Information

Replica Manager Tests - 3

Page 25: John Gordon j.c.gordon@rl.ac.uk and LCG and Grid Operations John Gordon CCLRC e-Science Centre, UK LCG Grid Operations

John Gordon

[email protected]

GIIS Monitor• Developed by MinTsai (GOC Taipei)

• Tool to display and check information published by the site GIIS

• http://goc.grid.sinica.edu.tw/gstat/

Page 26: John Gordon j.c.gordon@rl.ac.uk and LCG and Grid Operations John Gordon CCLRC e-Science Centre, UK LCG Grid Operations

John Gordon

[email protected]

Job Accounting -1http://goc.grid-support.ac.uk/ROC/docs/accounting/accounting.php

Program publishes PBS log file information through RGMA to the GOC

GOC aggregates data across all sites.

Page 27: John Gordon j.c.gordon@rl.ac.uk and LCG and Grid Operations John Gordon CCLRC e-Science Centre, UK LCG Grid Operations

John Gordon

[email protected]

Job Accounting - 2• Offline testing of program using data from the CORE sites completed.

• Development of an accounting portal underway to provide accounting on-demand for each site, and aggregated for each EGEE region

• Challenge! Deal with large database 1 ROW per LCGPBS Job per Site!

• http://goc-dev.esc.rl.ac.uk/jpg/goc_demo.php

• http://goc-dev.esc.rl.ac.uk/jpg/goc_demo3.php

Page 28: John Gordon j.c.gordon@rl.ac.uk and LCG and Grid Operations John Gordon CCLRC e-Science Centre, UK LCG Grid Operations

John Gordon

[email protected]

GridPP Accounting

Page 29: John Gordon j.c.gordon@rl.ac.uk and LCG and Grid Operations John Gordon CCLRC e-Science Centre, UK LCG Grid Operations

John Gordon

[email protected]

EDG-network monitoring

Page 30: John Gordon j.c.gordon@rl.ac.uk and LCG and Grid Operations John Gordon CCLRC e-Science Centre, UK LCG Grid Operations

John Gordon

[email protected]

Security

• Worked with Security Group

• Defined a Security Policy – and auditing procedures

• Have a list for security contacts– but not really exercised it yet– still need to define procedures in the event of

security incidents

Page 31: John Gordon j.c.gordon@rl.ac.uk and LCG and Grid Operations John Gordon CCLRC e-Science Centre, UK LCG Grid Operations

John Gordon

[email protected]

Keeping the Work Flowing

• Regular monitoring of job submission– shows sites that have problems running jobs

• Nagios tracks individual services– plus certificate lifetime

• RM tests show whether data can be moved• GridICE and Ganglia show what is running

• Limited by RB behaviour – we can see that jobs are not getting to sites but not why.

Page 32: John Gordon j.c.gordon@rl.ac.uk and LCG and Grid Operations John Gordon CCLRC e-Science Centre, UK LCG Grid Operations

John Gordon

[email protected]

What we have delivered?

• A set of monitoring tools

• A monitoring regime

• Two GOCs (RAL and Taipei)

• Security Policy

Page 33: John Gordon j.c.gordon@rl.ac.uk and LCG and Grid Operations John Gordon CCLRC e-Science Centre, UK LCG Grid Operations

John Gordon

[email protected]

Still to do

• Effective problem tracking– we see site problems and get them fixed– but don’t manage long-term problems

• Integration with User Support– we track problems we see– but problems users notice not effectively dealt with

• Automatic alerts– Nagios does but EMS from Taipei looks promising

• Remote repair– agents until middleware can support this directly

• Security • Deploy accounting• Distribute monitoring to EGEE ROCs and others

Page 34: John Gordon j.c.gordon@rl.ac.uk and LCG and Grid Operations John Gordon CCLRC e-Science Centre, UK LCG Grid Operations

John Gordon

[email protected]

What Next ? (1)

• RSS used to send tailored streams– sites, ROCs, management can all decide what

to subscribe to

• Accounting– being tested in LCG C&T testbed– should be in next LCG release– Then get T2 accounts

• keep your pbs log and msgs and gatekeeper logs

Page 35: John Gordon j.c.gordon@rl.ac.uk and LCG and Grid Operations John Gordon CCLRC e-Science Centre, UK LCG Grid Operations

John Gordon

[email protected]

Monitoring Feeds

• GOC server generates a lot of monitoring information.

• Need a way to give this information to the right people e.g site administrators

• Really Simple Syndication (RSS) is an XML schema• Used by many sites which want to syndicate content

e.g BBC, Slashdot• Client Pull model: GOC creates RSS formatted

documents, clients pull these feeds which render them in html.

Page 36: John Gordon j.c.gordon@rl.ac.uk and LCG and Grid Operations John Gordon CCLRC e-Science Centre, UK LCG Grid Operations

John Gordon

[email protected]

Aggregator RSSReader (Windows Client)

GOC generates RSS feeds which clients can pull using an RSS aggregator.

Aggregators available for Linux, Windows and MacOS

The aggregator shown displays test results for the RAL CE. These results are archived and popup on the desktop when the feed is updated.

Page 37: John Gordon j.c.gordon@rl.ac.uk and LCG and Grid Operations John Gordon CCLRC e-Science Centre, UK LCG Grid Operations

John Gordon

[email protected]

What next? (2)

• GGUS developments– operations issued forwarded to UK GSC

helpdesk

• Weekly LCG GDA Operations Meeting– see next slide

• EGEE ROCs taking support load– UK ready?

• EGEE CICs taking operations load on weekly rotation

Page 38: John Gordon j.c.gordon@rl.ac.uk and LCG and Grid Operations John Gordon CCLRC e-Science Centre, UK LCG Grid Operations

John Gordon

[email protected]

Proposal• 2 hour weekly meeting, with VRVS for remote participation –

– use the existing GDA slot– Fully open meeting

• Weekly operations reports (written in advance - previous Friday evening) from – Each EGEE ROC (NE should include Nordugrid ops)– Taipei GOC– Grid3 (covering FNAL and BNL Tier 1’s)– Other LCG Tier 1 sites (where different from the above) - Triumf, Tokyo – others?– ROCs and Tier1s will report on and represent the sites they support

• Weekly reports (written submitted in advance) from customers: – LHC experiments – Bio-med – Others as they come on-line

• During the meeting only issues should be brought up and resolved • Need to have good representation from ROCs and Tier 1s • Need application reps involved in grid work to attend • Once a month have more general discussions (presentation style): eg:

– Middleware developments – Larger issues - batch system problems, etc

• Minutes, attendance and problems will be public

Page 39: John Gordon j.c.gordon@rl.ac.uk and LCG and Grid Operations John Gordon CCLRC e-Science Centre, UK LCG Grid Operations

John Gordon

[email protected]

UK view

• RAL CIC will take on part of ongoing GOC work – including development for LCG/EGEE

• UK/I ROC will monitor and support UK/I sites– Helpdesk/DTeam/GOC– Maps tailored for Tier2s