gridpp monitoring & accounting dave kant cclrc, e-science centre

29
GridPP Monitoring & Accounting Dave Kant CCLRC, e-Science Centre

Upload: makayla-perry

Post on 28-Mar-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: GridPP Monitoring & Accounting Dave Kant CCLRC, e-Science Centre

GridPP Monitoring & Accounting

Dave Kant

CCLRC, e-Science Centre

Page 2: GridPP Monitoring & Accounting Dave Kant CCLRC, e-Science Centre

EGEE’03, April 2005 - 2

Monitoring Overview`

1. Overview

2. How Many Jobs on the Grid?

3. LCG/EGEE Monitoring System

4. Putting it all together for GridPP

5. Future Plans

Page 3: GridPP Monitoring & Accounting Dave Kant CCLRC, e-Science Centre

EGEE’03, April 2005 - 3

How Many Jobs on the Grid?

• As a way to introduce the various tools that are in development in the LCG/EGEE Grid

• There are different sources for getting estimates about the number of Jobs. Information System Accounting System Resource Brokers

Page 4: GridPP Monitoring & Accounting Dave Kant CCLRC, e-Science Centre

EGEE’03, April 2005 - 4

How Many Jobs on the Grid?

• One source of information is the monitoring system based on R-GMA

• Tools which gather information and use the R-GMA backbone for data collection GIIS Monitor Apel Site Functional Tests

• Tools which create reports RB Logging&Bookkeeping data mining Accounting

Page 5: GridPP Monitoring & Accounting Dave Kant CCLRC, e-Science Centre

EGEE’03, April 2005 - 5

http://goc.grid.sinica.edu.tw/gstat/ GIIS Monitor

• GIIS Monitor developed by GOC Taipei (Min Tsai)• Tool to display and check information published by

the site GIIS• Sanity checks, fault detection of information

system every 5 minutes• Provides an instantaneous snapshot of the number

of Jobs

Page 6: GridPP Monitoring & Accounting Dave Kant CCLRC, e-Science Centre

EGEE’03, April 2005 - 6

How Many Jobs on the Grid?

• Another source of information is the accounting, which as so many sources, is not complete, but covers most of the resources.

• This is not the case for GridPP resources.• Accounting information is based on resource usage published

by batch servers

Page 7: GridPP Monitoring & Accounting Dave Kant CCLRC, e-Science Centre

EGEE’03, April 2005 - 7

How Many Jobs on the Grid?

Latest source is a data mining tool which can be used to examine RB Logging and Bookkeeping information (via R-GMA) at the user level.

https://lxn1192.cern.ch:9443/~judit/job-monitor.cgi

Page 8: GridPP Monitoring & Accounting Dave Kant CCLRC, e-Science Centre

EGEE’03, April 2005 - 8

How Many Jobs on the Grid?

• A further source is based on the work by the EGEE QA Team They monitor several – but not all – resource brokers on

LCG and create reports of their usage. http://egee-jra2.web.cern.ch/EGEE-JRA2/index.html Statisticts based on aggregated information

• Job Success and job throughput per VO and per RB• Grid efficiency (Execution time vs Waiting Time)

Page 9: GridPP Monitoring & Accounting Dave Kant CCLRC, e-Science Centre

EGEE’03, April 2005 - 9

How Many Jobs on the Grid?

Page 10: GridPP Monitoring & Accounting Dave Kant CCLRC, e-Science Centre

EGEE’03, April 2005 - 10

How Many Jobs on the Grid?

• Job Duration showing a dominance of Dteam and LHCb jobs which are relatively short lived.

Page 11: GridPP Monitoring & Accounting Dave Kant CCLRC, e-Science Centre

EGEE’03, April 2005 - 11

Site Functional Tests

• Installation and configuration of a site is quite a complicated procedure.

-When there is a new release, sites don’t upgrade at the same time. -Some upgrades don’t always go smoothly-Unexpected things happen (who turned of the power?)-Day-to-day problems; robustness of service under load?

• SFT framework consists of a number of tests which probe a site to determine the operational status.

• This includes all certified sites in EGEE/LCG infrastructure but also testing uncertified sites (for internal certification process performed by ROCs), monitoring sites that are part of gLite Pre-Production Service, and all other sites that are using LCG or gLite middleware

Page 12: GridPP Monitoring & Accounting Dave Kant CCLRC, e-Science Centre

EGEE’03, April 2005 - 12

SFT

Site summaries and histories

SFT used by ROCs for certification

Grid–Ireland SFT

• SFT runs every 3 hours and writes test results to a database using R-GMA

Page 13: GridPP Monitoring & Accounting Dave Kant CCLRC, e-Science Centre

EGEE’03, April 2005 - 13

http://map.gridpp.ac.uk/ GridPP Monitoring Map

Links hourly job submission test results to SFT, GSTAT, RSS Feeds and Accounting data

GPPMon is a lightweight test which sends a simple job to GridPP resources every hour.

Page 14: GridPP Monitoring & Accounting Dave Kant CCLRC, e-Science Centre

EGEE’03, April 2005 - 14

Future Plans for GPPMon

• GPPMON - GridPP monitor to be switched off SFT2 runs every 3 hours and sites/ROCS can run these tests

independently, so there is no real need for these jobs.

• Proposal is to link GridPP monitoring map to the monitoring data in the R-GMA and make use of changes to the grid M/W e.g. support for longitude and latitude in Glue Schema (LCG 2.6). Google Map

Page 15: GridPP Monitoring & Accounting Dave Kant CCLRC, e-Science Centre

EGEE’03, April 2005 - 15

Google Map

http://goc03.grid-support.ac.uk/googlemaps/gridpp.html

Page 16: GridPP Monitoring & Accounting Dave Kant CCLRC, e-Science Centre

EGEE’03, April 2005 - 16

Accounting Overview

This is a summary of the status of Accounting & Reporting following its deployment in LCG2_6

1. Overview

2. APEL Design

3. What’s New?

4. LCG Accounting (OSG , NorduGrid, EGEE)

5. Issues

Page 17: GridPP Monitoring & Accounting Dave Kant CCLRC, e-Science Centre

EGEE’03, April 2005 - 21

Accounting Flow Diagram

Page 18: GridPP Monitoring & Accounting Dave Kant CCLRC, e-Science Centre

Accounting Home Page

107 Sites publishing data (Sep 02 2005)

Over 3.3 Million Job records

~ 100K records per week (period June 1st – mid Aug 2005)

http://goc.grid-support.ac.uk//

Page 19: GridPP Monitoring & Accounting Dave Kant CCLRC, e-Science Centre

EGEE’03, April 2005 - 26

What’s New?

• Added GridPP View to the reporting interface• Requirements driven by GridPP

– Global view of entire organisation– Tier-2 Summaries– Detailed view at Site level– CSV download of information– Toggle between Normalised / Un-normalised Datasets

http://goc.grid-support.ac.uk/gridsite/accounting/tree/gridppview.html

Page 20: GridPP Monitoring & Accounting Dave Kant CCLRC, e-Science Centre

EGEE’03, April 2005 - 27

GridPP Input

• GridPP Metrics and Deployment Document (J.Coles) Metric 10:Number of sites publishing accounting data at the end of the

last quarter Metric 11:KSI2K hours of CPU processing delivered (per VO) over the

last quarter

• We are looking for meaningful plots that allow important conclusions to be drawn without misleading people

• Is Job Efficiency meaningful?• Sites treat their data in different ways:-

• At Tier-1 WCT are scaled because of the scheduler

• At other sites, only system time is scaled

• What about Hyper threading?

• Perhaps we need to provide descriptive text against each plot to warn of such problems?

• Spot potential problems in resource allocation• Identify trends

Page 21: GridPP Monitoring & Accounting Dave Kant CCLRC, e-Science Centre

GridPP View Screen Shots

Page 22: GridPP Monitoring & Accounting Dave Kant CCLRC, e-Science Centre

GridPP View Screen Shots

Atlas and LHCb dominating

KSI2K delivered per Tier1/Tier2 per VO

Atlas dominates in Tier1

Job Efficiency = CPUT/WCT

Why is atlas EFF at 60%? Why is DZERO EFF for MANHEP > 1 ?

Page 23: GridPP Monitoring & Accounting Dave Kant CCLRC, e-Science Centre

Tier2 View (NorthGrid)

Page 24: GridPP Monitoring & Accounting Dave Kant CCLRC, e-Science Centre

Site View (Lancaster)

Breakdown of data per Vo per month showing Njobs, CPUt, WCT, record history

Total CPU Usage per VO

Gantt Chart NB:Gaps across all VOs consistent with scheduled downdowns in GocDB

Page 25: GridPP Monitoring & Accounting Dave Kant CCLRC, e-Science Centre

EGEE’03, April 2005 - 32

APEL IN LCG 2.6

• New version with better documentation

• APEL supports PBS and LSF

• Consists of a number of components

• Core module contains functionality common to all components

• Plugin components provide log parsing functionality for PBS and LSF job managers.

Page 26: GridPP Monitoring & Accounting Dave Kant CCLRC, e-Science Centre

EGEE’03, April 2005 - 33

Accounting Dissemination

1. CERN Courier

2. LCG Computing Newsletter (slightly more technical)

3. AHM 2005 (more technical still)

Page 27: GridPP Monitoring & Accounting Dave Kant CCLRC, e-Science Centre

EGEE’03, April 2005 - 40

APEL and gLite

• Is APEL integrated in g-Lite? Work currently in progress. We have ported the APEL code into the gLite CVS repository but

need to understand functional differences e.g. WMS and use of Condor

• What about its development plan? Future unclear given presence of DGAS in gLite Areas of possible development:

• Condor (easy or complicated)• Reporting Tool (GridICE will most likely provide this)

Page 28: GridPP Monitoring & Accounting Dave Kant CCLRC, e-Science Centre

EGEE’03, April 2005 - 41

LCG Accounting

Project involves combining results from all three infrastructures and presenting an aggregated view

Peer Infrastructures in LCG

• Open Science Grid (Ruth Pordes, Philippe Canal, Matteo Melani)• Nordugrid (Per Oster)• EGEE• Currently, LHCView filters LHC VO data from EGEE accounting data.

Page 29: GridPP Monitoring & Accounting Dave Kant CCLRC, e-Science Centre

EGEE’03, April 2005 - 42

Requirements

Combine results from all three infrastructures …

Ideally: Distributed queries to multiple databases• Each peer manages an accounting database• LHC VO filtering provided through a web services interface

Initial Implementation: Centralised Collection• Peers publish data into a global database• WebServices or direct MySql inserts

Common Problem: Different Grid infrastructures may use different Schemas. GGF define a schema, but quite flexible.

May need “translators” to convert from one schema to another. (already exist)