john gordon j.c.gordon@rl.ac.uk and lcg and grid operations john gordon cclrc e-science centre, uk...

Post on 24-Dec-2015

226 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

John Gordon

j.c.gordon@rl.ac.uk

LCG andand Grid Operations

John Gordon

CCLRC e-Science Centre, UK

LCG Grid Operations

John Gordon

j.c.gordon@rl.ac.uk

Outline

• The monitoring tools

• How we use them in operations

• What is still to be done

John Gordon

j.c.gordon@rl.ac.uk

Grid Operations

• Once middleware has been developed, tested and deployed, grid operations are the set of actions and procedures to keep a grid running for the users.

John Gordon

j.c.gordon@rl.ac.uk

The Vision

• GOC Processes and Activities– Coordinating Grid Operations– Defining Service Level Parameters– Monitoring Service Performance Levels– First-Level Fault Analysis– Interacting with Local Support Groups– Coordinating Security Activities– Operations Development

John Gordon

j.c.gordon@rl.ac.uk

Have we delivered?

• Coordinating Grid Operations

• Defining Service Level Parameters

• Monitoring Service Performance Levels

• First-Level Fault Analysis• Interacting with Local

Support Groups• Coordinating Security

Activities• Operations Development

• Yes, RAL, CERN & Taipei

• No

• up or down• Yes

• Yes

• Policies, not operation• Monitoring and

accounting

John Gordon

j.c.gordon@rl.ac.uk

Monitoring the Grid is a Challenge!

John Gordon

j.c.gordon@rl.ac.uk

Why We Monitor• Keep systems up and running• Notice failures; grid-wide services MDS; • Knowing what services a site should be running

no point raising an alert if the site isn’t meant to run it! definition of services and which sites run them (SLA)

What Tools Do We Use• Job Submission; GridIce; Nagios; GIIS Monitor• How – Database• Developments Planned nagios

Monitoring Overview

John Gordon

j.c.gordon@rl.ac.uk

• We have only fragmentary information about the services that sites are running.

• We don’t know what RBs/SEs/Sites the VOs are using for data challenges.• We don’t know what the core services are and who is running them.• We don’t have a toolkit to test specific core services.• We have to concentrate on functional behaviour of services e.g If an RB

sends your job to a CE, then we must assume the RB is working fine. Is this the only test of a RB?

• Not all the tests that we perform are effective at finding problems so we must take tests written by the experts and integrate them into GOC monitoring.

• We must develop tests which simulate the life cycle of real applications in a Grid environment.

• There are lots of monitoring tools available, so we need to bring them together.

• Do we spend time investigating new tools, or make the ones which we already have better?

• …and probably lots more!

Monitoring Challenges

John Gordon

j.c.gordon@rl.ac.uk

• There are many frameworks which can be used to monitor

distributed environments• MAPCENTRE http://mapcenter.in2p3.fr/• GPPMON http://goc.grid-support.ac.uk/• GRIDICE http://grid-ice.esc.rl.ac.uk• NAGIOS http://www.nagios.org/• MONALISA http://monalisa.cacr.caltech.edu/• GIIS Monitor http://goc.grid.sinica.edu.tw/gstat/• Ganglia

– Example: Mapcentre 30 sites ~ 500 lines in config file (static version)– Example: Nagios 30 sites, 12 individual config files with

dependencies

– Developed Tools to Configure these services to make the job easier NAGIOS, MAPCENTER and GPPMON

Monitoring Services

John Gordon

j.c.gordon@rl.ac.uk

GOC Configuration Database

GOC GridSite MySQL

Resource CentreResources & Site Information

EDG, LCG-1, LCG-2, …

ce

se

bdii

rb

Monitoring

Secure Database Management via HTTPS / X.509

People, Contact Information, Resources

Scheduled Maintenance

RC

SQLhttps

SERVER

John Gordon

j.c.gordon@rl.ac.uk

GOC Job Submission Flow Diagram

Simple job forked on CE using globus

GOC (UI)

Build List of CE, RB

Resources

JOB Script

GLOBUS.CEcreate CE

sent acknowledge

globus-job-run CE

SITE DB

SQL QUERY

wget http://goc_ui/ack.cgi?GLOBUS.CE

received acknowledgement

1

2

3

4

5

GPPMON - 2

Dave Kant
Data Flow DiagramA graphical means of presenting, describing or analyzing a process.

John Gordon

j.c.gordon@rl.ac.uk

GPPMON - 3

JOB Script

RB.CEcreate

RB

sent acknowledge

edg-job-submit

GOC (UI)

Build List of CE, RB

Resources

SITE DB

SQL QUERY

CE

Other.GlueCEUniqueID

wget http://goc_ui/ack.cgi?RB.CE

received acknowledgement WN

CE

Simple job through local jobmanager on CE via Resource Broker Job MatchMaking

Dave Kant
Data Flow DiagramA graphical means of presenting, describing or analyzing a process.

John Gordon

j.c.gordon@rl.ac.uk

LCG2 Site Status: 21 July 2004 10.00am

GPPMON – 1

John Gordon

j.c.gordon@rl.ac.uk

GRIDICE - 1

http://grid-ice.esc.rl.ac.uk/gridice

John Gordon

j.c.gordon@rl.ac.uk

John Gordon

j.c.gordon@rl.ac.uk

Ganglia Monitoring - 1

• http://gridpp.ac.uk/ganglia• Can use Ganglia to monitor a cluster

RAL Tier-1 Centre

LCG PBS Server displays Job status for each VO

John Gordon

j.c.gordon@rl.ac.uk

Ganglia Monitoring - 2

• Can also use Ganglia to monitor clusters of clusters

John Gordon

j.c.gordon@rl.ac.uk

Provide ROCs with a package to monitor the resources in the region• Tailored Monitoring• ROCs may upload their own maps• JAVA GUI to automate site locations on the map

Hierarchical view of Resources

• Example GridPP made up of virtual T2 centres

Regional Monitoring - 1

EGEE

France UK/I S.E.E

GridPP

LondonT2

ScotGrid

IMPERIAL

QMUL

Edinburgh

John Gordon

j.c.gordon@rl.ac.uk

LCG2 Site Status: 21 July 2004 10.00am

GPPMON – 1

John Gordon

j.c.gordon@rl.ac.uk

http://goc.grid-support.ac.uk/roc_map/map.php Active map to select individual regions

Regional Monitoring - 2

John Gordon

j.c.gordon@rl.ac.uk

Regional Monitoring - 3

UK/I Monitoring displays GRIDPP and NGS resources.

John Gordon

j.c.gordon@rl.ac.uk

Replica Manager Tests - 1

• GOC to take over site certification testing which is done by CERN deployment team on a daily basis (e.g reports by Piotr Nyczyk)

• First step toward this involved running a series of replica manager tests which register files onto the grid, move them around, delete them; and 3rd party copies from remote SE e.g Castorgrid

• Demonstrates that we can integrate other peoples tools into GPPMON

• Development of a portal which will:– Make it easy to retrieve debug information from the job output.– Connect with information provided by other monitoring tools e.g Taipei GIIS

Monitor. – Provide testing “on-demand” to site administrators through a secure interface.

John Gordon

j.c.gordon@rl.ac.uk

http://goc.grid-support.ac.uk/gridsite/status/rmtest.php?action=table

Results of each test are shown as a coloured index on the map.

Distinguish between jobs that have completed, or have failed or still running.

Replica Manager Tests - 2

John Gordon

j.c.gordon@rl.ac.uk

Description of the tests

Job Outputs

GIIS Monitor Information

Replica Manager Tests - 3

John Gordon

j.c.gordon@rl.ac.uk

GIIS Monitor• Developed by MinTsai (GOC Taipei)

• Tool to display and check information published by the site GIIS

• http://goc.grid.sinica.edu.tw/gstat/

John Gordon

j.c.gordon@rl.ac.uk

Job Accounting -1http://goc.grid-support.ac.uk/ROC/docs/accounting/accounting.php

Program publishes PBS log file information through RGMA to the GOC

GOC aggregates data across all sites.

John Gordon

j.c.gordon@rl.ac.uk

Job Accounting - 2• Offline testing of program using data from the CORE sites completed.

• Development of an accounting portal underway to provide accounting on-demand for each site, and aggregated for each EGEE region

• Challenge! Deal with large database 1 ROW per LCGPBS Job per Site!

• http://goc-dev.esc.rl.ac.uk/jpg/goc_demo.php

• http://goc-dev.esc.rl.ac.uk/jpg/goc_demo3.php

John Gordon

j.c.gordon@rl.ac.uk

GridPP Accounting

John Gordon

j.c.gordon@rl.ac.uk

EDG-network monitoring

John Gordon

j.c.gordon@rl.ac.uk

Security

• Worked with Security Group

• Defined a Security Policy – and auditing procedures

• Have a list for security contacts– but not really exercised it yet– still need to define procedures in the event of

security incidents

John Gordon

j.c.gordon@rl.ac.uk

Keeping the Work Flowing

• Regular monitoring of job submission– shows sites that have problems running jobs

• Nagios tracks individual services– plus certificate lifetime

• RM tests show whether data can be moved• GridICE and Ganglia show what is running

• Limited by RB behaviour – we can see that jobs are not getting to sites but not why.

John Gordon

j.c.gordon@rl.ac.uk

What we have delivered?

• A set of monitoring tools

• A monitoring regime

• Two GOCs (RAL and Taipei)

• Security Policy

John Gordon

j.c.gordon@rl.ac.uk

Still to do

• Effective problem tracking– we see site problems and get them fixed– but don’t manage long-term problems

• Integration with User Support– we track problems we see– but problems users notice not effectively dealt with

• Automatic alerts– Nagios does but EMS from Taipei looks promising

• Remote repair– agents until middleware can support this directly

• Security • Deploy accounting• Distribute monitoring to EGEE ROCs and others

John Gordon

j.c.gordon@rl.ac.uk

What Next ? (1)

• RSS used to send tailored streams– sites, ROCs, management can all decide what

to subscribe to

• Accounting– being tested in LCG C&T testbed– should be in next LCG release– Then get T2 accounts

• keep your pbs log and msgs and gatekeeper logs

John Gordon

j.c.gordon@rl.ac.uk

Monitoring Feeds

• GOC server generates a lot of monitoring information.

• Need a way to give this information to the right people e.g site administrators

• Really Simple Syndication (RSS) is an XML schema• Used by many sites which want to syndicate content

e.g BBC, Slashdot• Client Pull model: GOC creates RSS formatted

documents, clients pull these feeds which render them in html.

John Gordon

j.c.gordon@rl.ac.uk

Aggregator RSSReader (Windows Client)

GOC generates RSS feeds which clients can pull using an RSS aggregator.

Aggregators available for Linux, Windows and MacOS

The aggregator shown displays test results for the RAL CE. These results are archived and popup on the desktop when the feed is updated.

John Gordon

j.c.gordon@rl.ac.uk

What next? (2)

• GGUS developments– operations issued forwarded to UK GSC

helpdesk

• Weekly LCG GDA Operations Meeting– see next slide

• EGEE ROCs taking support load– UK ready?

• EGEE CICs taking operations load on weekly rotation

John Gordon

j.c.gordon@rl.ac.uk

Proposal• 2 hour weekly meeting, with VRVS for remote participation –

– use the existing GDA slot– Fully open meeting

• Weekly operations reports (written in advance - previous Friday evening) from – Each EGEE ROC (NE should include Nordugrid ops)– Taipei GOC– Grid3 (covering FNAL and BNL Tier 1’s)– Other LCG Tier 1 sites (where different from the above) - Triumf, Tokyo – others?– ROCs and Tier1s will report on and represent the sites they support

• Weekly reports (written submitted in advance) from customers: – LHC experiments – Bio-med – Others as they come on-line

• During the meeting only issues should be brought up and resolved • Need to have good representation from ROCs and Tier 1s • Need application reps involved in grid work to attend • Once a month have more general discussions (presentation style): eg:

– Middleware developments – Larger issues - batch system problems, etc

• Minutes, attendance and problems will be public

John Gordon

j.c.gordon@rl.ac.uk

UK view

• RAL CIC will take on part of ongoing GOC work – including development for LCG/EGEE

• UK/I ROC will monitor and support UK/I sites– Helpdesk/DTeam/GOC– Maps tailored for Tier2s

top related