operations structure of the infn-grid/grid.it production grid infrastructure

29
Operations structure of the INFN-GRID/Grid.it Production Grid Infrastructure Presenter (on behalf of the authors): Cristina Vistoli [email protected] Italian grid operation manager INFN CNAF – Bologna - Italy

Upload: jaime-britt

Post on 30-Dec-2015

49 views

Category:

Documents


0 download

DESCRIPTION

Operations structure of the INFN-GRID/Grid.it Production Grid Infrastructure. Presenter (on behalf of the authors): Cristina Vistoli [email protected] Italian grid operation manager INFN CNAF – Bologna - Italy. Production Quality Grid Infrastructure. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Operations structure of the  INFN-GRID/Grid.it  Production Grid Infrastructure

Operations structure of the INFN-GRID/Grid.it

Production Grid Infrastructure

Presenter (on behalf of the authors): Cristina Vistoli

[email protected]

Italian grid operation managerINFN CNAF – Bologna - Italy

Page 2: Operations structure of the  INFN-GRID/Grid.it  Production Grid Infrastructure

Production Quality Grid Infrastructure

• Status of the infrastructure

• Operations structure and organization

• Grid monitoring and management

• Usage report and accounting

• User and operation support

Page 3: Operations structure of the  INFN-GRID/Grid.it  Production Grid Infrastructure

The Italian Grid Production Infrastructure

about 40 Resource Centers

The grid resources can be accessed through central or VO-specific services (e.g. Resource Brokers)

28 sites are also part of the EGEE/LCG Grid infrastructure (and are registered in the central database of the Grid Operation Center)

the other 12 sites can be accessed through the Italian grid services only

http://grid-it.cnaf.infn.it

Page 4: Operations structure of the  INFN-GRID/Grid.it  Production Grid Infrastructure

Production Infrastructure: Resources

Page 5: Operations structure of the  INFN-GRID/Grid.it  Production Grid Infrastructure

InfnGrid-2_7_0

• InfnGrid-2_7_0 customization of LCG-2_7_0:– Support for the following VOs:

• egrid, babar, zeus, biomed, magic, esr, cms, atlas, lhcb, alice (managed via LDAP VO server);

• pamela, infngrid, cdf, gridit, compchem, planck, bio, enea, theophys, ingv, inaf, virgo, argo  (managed via VOMS server);

• euchina, eumed (optional and managed via VOMS server). – DGAS (DataGrid Accounting System) :

• Patched WMS lcg2.1.73 on the Resource Broker to support DGAS • DGAS HLR (Home Location Register) server: it is responsible for

keeping the accounting information for both users and grid resources.

– Network Monitor Element, interfaced with GridIce for data presentation.

Page 6: Operations structure of the  INFN-GRID/Grid.it  Production Grid Infrastructure

InfnGrid-2_7_0

– support for MPI jobs via home synchronization with scp with host based authentication

– Customized tools to install and use the grid:• installation by a customized version of LCG yaim

(ig-yaim)• support to interface ig-yaim with a Quattor

installation; • UIPnP: a PlugAndPlay User Interface to access the

grid as user of every Linux system without RPMs.

Page 7: Operations structure of the  INFN-GRID/Grid.it  Production Grid Infrastructure

InfnGrid-2_7_0 : deployed services

FTS

LFC

MyProxy

RB (DGAS)

VOMS

Gridice

BDII

HLRINFNGRID-2_7_0

Page 8: Operations structure of the  INFN-GRID/Grid.it  Production Grid Infrastructure

Operations Structure and Organization

The National Grid Central Management Team (CMT): – Activities:

• ‘integration’ and testing of the InfnGrid middleware release (based on LCG m/w release)

• deployment procedures and configuration tools

• Monitoring and control of the status of the grid services and resources

– Responsibilities: • site registration procedure

• middleware deployment

• certification procedure for all InfnGrid sites

• Operation of the GRID services

Page 9: Operations structure of the  INFN-GRID/Grid.it  Production Grid Infrastructure

Grid Central Management Team

• Deployment Plan– The team coordinates the installation and

deployment of the grid services.

A plan is provided to:• ensure that the user support and service level provided

to the grid users during the upgrade period is acceptable

• simplify the certification activities (all resources are thoroughly tested before joining the infrastructure).

• Site registration procedure

• Site certification procedure

Page 10: Operations structure of the  INFN-GRID/Grid.it  Production Grid Infrastructure

Operations Support

• The Italian ROC provides local front line support to Virtual Organization, Users and Resources Centres

• The Italian Roc team is organized in daily shifts:– 2 people per shift, 2 shifts per day, from Monday to Friday.

• Activities planned during the shift– Log trouble tickets created, updated and closed, problems on grid

services and sites, monitor successful site certification– check the actions of the previous shift and the downtime page– check the status of production grid services and the GRIS status of

production CE and SE.– check the status of the production sites using the Site Functional

Tests report• Periodic (every 15 days) phone conferences

– ROC/CIC teams and site managers • Provide and write the ROC report for the weekly EGEE operation

meeting

Page 11: Operations structure of the  INFN-GRID/Grid.it  Production Grid Infrastructure

Grid Monitoring

• The status of the Italian grid infrastructure is monitored using GridIce, – It is one of the monitoring tools used by EGEE– It is used to control

• the status of the submitting queues

• Process/daemons status in the services (RB, BDII)

• VO view: list of CE and SE available for a the VOs and their status and capacity

• Job monitoring

Page 12: Operations structure of the  INFN-GRID/Grid.it  Production Grid Infrastructure

Monitoring

Page 13: Operations structure of the  INFN-GRID/Grid.it  Production Grid Infrastructure

Accounting

• The DataGrid Accounting System (DGAS) has been developed within the EDG and EGEE project.– It implements a resource usage metering and economic

accounting in a fully distributed grid environment

– It is part of the InfnGrid middleware release and has been deployed on the Italian Grid Infrastructure

– Grid computing resources and grid users are registered in appropriate servers, known as HLRs (Home Location Registers), which keep track of every submitted job. An arbitrary number of HLR servers can be used

Page 14: Operations structure of the  INFN-GRID/Grid.it  Production Grid Infrastructure

DGAS HLR flow

Page 15: Operations structure of the  INFN-GRID/Grid.it  Production Grid Infrastructure

Accounting

• Accounting data can be retrieved from the HLRs with different aggregation levels: – single-user

– group of users

– VO

– resource

• A functional test has been developed and it is used to monitor the stability of the service. It checks the functionality of the sensors and services running on the CE and the communication between CEs and HLRs

• DGAS data for the Italian Grid are aggregated/anonymized and provided to EGEE through an appropriate interface to Apel.

• More information on http://www.to.infn.it/grid/accounting/

Page 16: Operations structure of the  INFN-GRID/Grid.it  Production Grid Infrastructure

Jobs per week: CMS

02000400060008000

10000120001400016000

WallTime @ CPUTime

010000200003000040000500006000070000

Week

Ho

ur

SUM(wallTime/3600)

SUM(cpuTime/3600)

Page 17: Operations structure of the  INFN-GRID/Grid.it  Production Grid Infrastructure

Jobs per site (January, 15 – 31)

Total jobs =179.310

Page 18: Operations structure of the  INFN-GRID/Grid.it  Production Grid Infrastructure

Jobs per site (January, 15 – 31)

Page 19: Operations structure of the  INFN-GRID/Grid.it  Production Grid Infrastructure

Jobs per VO (January, 15 – 31)

Page 20: Operations structure of the  INFN-GRID/Grid.it  Production Grid Infrastructure

Jobs report (January, 15 – 31)

Page 21: Operations structure of the  INFN-GRID/Grid.it  Production Grid Infrastructure

User, Operation and VO support

• The user support system provides tickets exchange between: – ROC on Duty and site managers– Site managers and Central management team

and viceversa– Site manager and certification team during

installation/upgrade– GGUS to ROC ROC to GGUS

Page 22: Operations structure of the  INFN-GRID/Grid.it  Production Grid Infrastructure

The support system

• Italian ROC ticketing system is built upon a suite of web based tools written in PHP: Xhelp

• The support system components are accessible form the main interface of the deployment portal (grid-it.cnaf.infn.it) providing a SSO point of registration/identification certificate-based.

• The end-user can open a request, view and follow his own tickets and related replies;

• A supporter can view tickets assigned to his own groups, add responses and solutions, and change status/priority

• While operating tickets, a side content is always available for all classes of users (related to their access level) – Site Functional Tests, – site downtimes calendaring system– file archive– net query tools– IRC applet, contextual questions and answers– reports from daily shifts

Page 23: Operations structure of the  INFN-GRID/Grid.it  Production Grid Infrastructure
Page 24: Operations structure of the  INFN-GRID/Grid.it  Production Grid Infrastructure

Interface with GGUS

• The Italian ROC support system is interfaced to the GGUS helpdesk application using web-services technologies– Secure methods to create and update trouble tickets in the GGUS database

are provided by the GGUS application. – These methods are called by APIs that wrap into SOAP messages the

ticket information stored in the XHelp database, and send them to the WSDL contact URL.

• A trouble ticket submitted by a local user to the XHelp helpdesk that cannot be addressed locally, can be escalated by the local supporter across the ROC boundaries.

• The system allows for ticket assignment to any other support unit of GGUS as well as all other ROC helpdesks connected to GGUS via the interface.

• The ticket is shared among all the helpdesk’s databases involved in the workflow, can be updated from every source, and any update will propagate towards all the other systems.

Page 25: Operations structure of the  INFN-GRID/Grid.it  Production Grid Infrastructure

GGUSROC Basic Workflow

Web Portal

GGUS System

GGUS/TPM

ROC-1 Helpdesk

ROC-1 Interface

Ticket solved

Ticket assignment to ROC-1

SU-1SU-2

SU-N

ROC-X Helpdesk

ROC-X Interface

SU-1SU-2

SU-N

Ticket re-assigned

Page 26: Operations structure of the  INFN-GRID/Grid.it  Production Grid Infrastructure

A new ticket comes from GGUS

We assign the ticket to the site

GGUS!

Page 27: Operations structure of the  INFN-GRID/Grid.it  Production Grid Infrastructure

The site's support group reassigns the ticket to GGUS

…and adds a response!

Page 28: Operations structure of the  INFN-GRID/Grid.it  Production Grid Infrastructure

Trouble tickets statistics

Page 29: Operations structure of the  INFN-GRID/Grid.it  Production Grid Infrastructure

Authors

VISTOLI, Maria Cristina, INFN-CNAF

GAIDO, Luciano INFN-Torino

SELMI, Matteo INFN-CNAF

PAGANO, Alfredo INFN-CNAF

AIFTIMIEI, Cristina INFN – Padova

CUSCELA, Guido INFN - Bari

CAVALLI, Alessandro INFN – CNAF

FERRO, Enrico INFN – Padova

FANZAGO, Federica INFN – Padova

FANTINEL, Sergio INFN – LNL

VACCAROSSA, Luca INFN – Milano

CESINI, Daniele INFN-CNAF

PAOLINI, Alessandro, INFN-CNAF

VERONESI, Paolo INFN-CNAF

CAROTA, Luciana INFN-CNAF

NEBIOLO, Federico INFN- Torino

CALTRONI, Andrea INFN – Padova

DONVITO, Giacinto;, INFN - Bari

VERLATO, Marco, INFN – Padova

BAGNASCO, Stefano, INFN - Torino

BRUNETTI, Riccardo, INFN - Torino

DACRUZ, Marcio, INFN - Milano;

BARCHIESI, Alex INFN - Roma

FIORE, Sandro – Univ. Lecce

ARGENTATI, Sabrina – INFN - LNF

DALLA FINA, Simone – INFN - Padova ;

DELLE FRATTE, Cesare; INFN – Roma2;

TURRISI, Rosario INFN - Padova ;

GREGORETTI, Francesco , CNR-ICAR Napoli