see-grid-sci operations procedures and tools

26
www.see-grid-sci.eu SEE-GRID-SCI SEE-GRID-SCI Operations Procedures and Tools Antun Balaz Institute of Physics Belgrade, Serbia [email protected] The SEE-GRID-SCI initiative is co-funded by the European Commission under the FP7 Research Infrastructures contract no. 211338 Regional SEE-GRID-SCI Training for Site Administrators Institute of Physics Belgrade March 5-6, 2009

Upload: aric

Post on 27-Jan-2016

53 views

Category:

Documents


0 download

DESCRIPTION

SEE-GRID-SCI Operations Procedures and Tools. Regional SEE-GRID-SCI Training for Site Administrators Institute of Physics Belgrade March 5-6, 2009. Antun Balaz Institute of Physics Belgrade, Serbia [email protected]. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: SEE-GRID-SCI Operations Procedures and Tools

www.see-grid-sci.eu

SEE-GRID-SCI

SEE-GRID-SCI Operations Procedures and Tools

Antun BalazInstitute of Physics Belgrade, Serbia

[email protected]

The SEE-GRID-SCI initiative is co-funded by the European Commission under the FP7 Research Infrastructures contract no. 211338

Regional SEE-GRID-SCI Training for Site Administrators

Institute of Physics BelgradeMarch 5-6, 2009

Page 2: SEE-GRID-SCI Operations Procedures and Tools

Regional SEE-GRID-SCI Training for Site Administrators, Institute of Physics Belgrade, March 5-6, 2009 2

Overview

SEE-GRID operational and monitoring tools (and their relation to EGEE tools) HGSM/GOCDB Helpdesk/GGUS BBmSAM/SAM GStat Nagios/CIC portal Accounting portal

Downtime proceduresUpgrade proceduresGrid-Operator-On-Duty (GOOD)Service Level Agreement (SLA)

Page 3: SEE-GRID-SCI Operations Procedures and Tools

Regional SEE-GRID-SCI Training for Site Administrators, Institute of Physics Belgrade, March 5-6, 2009 3

Operational & monitoring tools

HGSMHGSM

HELPDESKHELPDESK

BDIIBDII

R-GMAR-GMA

SAMSAM

GSTAT(Taiwan)GSTAT

(Taiwan)

VOMSVOMS

BBmSAMBBmSAM

AccountingAccounting

NAGIOSNAGIOS

Page 4: SEE-GRID-SCI Operations Procedures and Tools

Regional SEE-GRID-SCI Training for Site Administrators, Institute of Physics Belgrade, March 5-6, 2009 4

HGSM/GOCDB (1)

Page 5: SEE-GRID-SCI Operations Procedures and Tools

Regional SEE-GRID-SCI Training for Site Administrators, Institute of Physics Belgrade, March 5-6, 2009 5

HGSM/GOCDB (2)

Page 6: SEE-GRID-SCI Operations Procedures and Tools

Regional SEE-GRID-SCI Training for Site Administrators, Institute of Physics Belgrade, March 5-6, 2009 6

HGSM/GOCDB (3)

Static database containing all relevant data about all SEE-GRID and AEGIS sitesMust be kept synchronized with the real situation All sheets must be properly updated

Site Info Contacts Site Nodes Downtimes

XML dumps – the easiest way to apply changes is to download XML dump of the data, edit it appropriately, and then upload the new XML file; this also allows keeping of backups

Page 7: SEE-GRID-SCI Operations Procedures and Tools

Regional SEE-GRID-SCI Training for Site Administrators, Institute of Physics Belgrade, March 5-6, 2009 7

HGSM/GOCDB (4)

The essential fields in HGSM: GIIS URL Monitoring: Yes Status: certified Type: seegrid_production, seegrid_certified, egee_production Site Commitments

Contacts and administratorsAll fields have to have correct values! URL: https://hgsm.grid.org.tr/

Page 8: SEE-GRID-SCI Operations Procedures and Tools

Regional SEE-GRID-SCI Training for Site Administrators, Institute of Physics Belgrade, March 5-6, 2009 8

Helpdesk/GGUS (1)

Page 9: SEE-GRID-SCI Operations Procedures and Tools

Regional SEE-GRID-SCI Training for Site Administrators, Institute of Physics Belgrade, March 5-6, 2009 9

Helpdesk/GGUS (2)

Page 10: SEE-GRID-SCI Operations Procedures and Tools

Regional SEE-GRID-SCI Training for Site Administrators, Institute of Physics Belgrade, March 5-6, 2009 10

Helpdesk/GGUS (3)

Central reference point for tracking of all operational and user problemsIdentified problems are reported through the Helpdesk and assigned to the appropriate supportedIf problems cannot be solved within the SEE-GRID community, they are propagated to other projects/initiatives/support systems (e.g. GGUS)URL: https://helpdesk.see-grid.eu/

Page 11: SEE-GRID-SCI Operations Procedures and Tools

Regional SEE-GRID-SCI Training for Site Administrators, Institute of Physics Belgrade, March 5-6, 2009 11

BBmSAM/SAM

Page 12: SEE-GRID-SCI Operations Procedures and Tools

Regional SEE-GRID-SCI Training for Site Administrators, Institute of Physics Belgrade, March 5-6, 2009 12

BBmSAM History

Page 13: SEE-GRID-SCI Operations Procedures and Tools

Regional SEE-GRID-SCI Training for Site Administrators, Institute of Physics Belgrade, March 5-6, 2009 13

BBmSAM

Portal that provides access to the database of SAM tests resultsCentral tools for identification of operational problemsShould be checked by each site admin on a daily basisShould be used to troubleshoot problemsAlso provides SLA figuresURL: https://c01.grid.etfbl.net/

Page 14: SEE-GRID-SCI Operations Procedures and Tools

Regional SEE-GRID-SCI Training for Site Administrators, Institute of Physics Belgrade, March 5-6, 2009 14

GStat (1)

Page 15: SEE-GRID-SCI Operations Procedures and Tools

Regional SEE-GRID-SCI Training for Site Administrators, Institute of Physics Belgrade, March 5-6, 2009 15

GStat (2)

Central tool for monitoring of the information system of SEE-GRID infrastructureProvides useful dataIdentifies problems with sitesShould be checked by each site admin on a daily basis and used for troubleshooting Useful ldapsearch commands can be found on GStat pages!

URL: http://goc.grid.sinica.edu.tw/gstat/seegrid/

Page 16: SEE-GRID-SCI Operations Procedures and Tools

Regional SEE-GRID-SCI Training for Site Administrators, Institute of Physics Belgrade, March 5-6, 2009 16

Nagios/CIC portal (1)

Page 17: SEE-GRID-SCI Operations Procedures and Tools

Regional SEE-GRID-SCI Training for Site Administrators, Institute of Physics Belgrade, March 5-6, 2009 17

Nagios/CIC portal (2)

Page 18: SEE-GRID-SCI Operations Procedures and Tools

Regional SEE-GRID-SCI Training for Site Administrators, Institute of Physics Belgrade, March 5-6, 2009 18

Nagios/CIC portal (3)

Collection of alarms raised by various toolsThe aim is to integrate all the tools and make the life of site admins and infrastructure managers easierIn the future, automatic creation of Helpdesk tickets will be implementedURL: https://portal.ipp.acad.bg:7443/seegridnagios/

Page 19: SEE-GRID-SCI Operations Procedures and Tools

Regional SEE-GRID-SCI Training for Site Administrators, Institute of Physics Belgrade, March 5-6, 2009 19

Accounting portal (1)

•Accounting by site

•Accounting by countries and institutions

•Accounting by applications

Page 20: SEE-GRID-SCI Operations Procedures and Tools

Regional SEE-GRID-SCI Training for Site Administrators, Institute of Physics Belgrade, March 5-6, 2009 20

EGEE Accounting portal

Page 21: SEE-GRID-SCI Operations Procedures and Tools

Regional SEE-GRID-SCI Training for Site Administrators, Institute of Physics Belgrade, March 5-6, 2009 21

Accounting portal (2)

Collects the accounting data from all SEE-GRID and AEGIS sites through apel accounting publisher developed by the projectProvides aggregated accounting data by site, country, institution, applicationEach site must publish the accounting data properlyURL: https://gserv1.ipp.acad.bg:8443/Welcome/

Page 22: SEE-GRID-SCI Operations Procedures and Tools

Regional SEE-GRID-SCI Training for Site Administrators, Institute of Physics Belgrade, March 5-6, 2009 22

Downtime procedures

Downtimes must be announced well in advance (1 week is reasonable time) There are always downtime due to hardware etc. failures that

cannot be anticipated

All downtimes must be entered properly in HGSM That way they are not be counted against the site’s

availability

In addition, all downtimes must be broadcasted by e-mail to the GIM, APP and proper VO mailing listsDowntime should not exceed 10% of the total time (monthly, quarterly) If yes, explanation must be provided If the explanation is not accepted by the project management,

SA1 claims will be rejected

Page 23: SEE-GRID-SCI Operations Procedures and Tools

Regional SEE-GRID-SCI Training for Site Administrators, Institute of Physics Belgrade, March 5-6, 2009 23

Upgrade procedures

All upgrades/updates are announced over the GIM listThe broadcasts contain links to further instructions for upgrades for each Grid service Site admins should carefully examine them before performing

the update!

In addition, possible SEE-GRID-specific instructions are given in the e-mailFor especially important updates/changes, tickets are created for each siteFor some upgrades/updates to be performed, downtimes may be requiredOS updates must be regularly installed, to minimize security risks

Page 24: SEE-GRID-SCI Operations Procedures and Tools

Regional SEE-GRID-SCI Training for Site Administrators, Institute of Physics Belgrade, March 5-6, 2009 24

Grid-Operator-On-Duty (GOOD)

Rotating shifts on a weekly basis Each country’s GIM is responsible to monitor sites during

his/her shift Tickets are submitted to sites with problems, according to the

status of sites in various monitoring tools (BBmSAM, GStat, Nagios, Accounting portal, etc.)

Older tickets that are not resolved are escalated Support is given to sites that cannot resolve earlier identified

operational problems User tickets are assigned to the appropriate supporters Wiki documentation is updated, or new wiki pages created if

necessaryURLs: http://wiki.egee-see.org/index.php/SG_GOOD http://wiki.egee-see.org/index.php/SG_Helpdesk_tickets

Page 25: SEE-GRID-SCI Operations Procedures and Tools

Regional SEE-GRID-SCI Training for Site Administrators, Institute of Physics Belgrade, March 5-6, 2009 25

Usual problems and links to (possible) solutions

BDII siteBDII (GIIS) or top-level BDII is Unreachable

http://faq.twgrid.org/faq/index.php?action=artikel&cat=14&id=11&artlang=en No info published

http://goc.grid.sinica.edu.tw/gocwiki/No_data_published_by_top_level_BDII CA

CA version test failed with error message:This CA is an old one and time allowed to upgrade is overhttp://grid-deployment.web.cern.ch/grid-deployment/lcg2CAlist.html

CE (Computing Element) Job submission failed with error message:

Brokerhelper: Cannot plan. No compatible resources:http://goc.grid.sinica.edu.tw/gocwiki/Brokerhelper%3A_Cannot_plan._No_compatible_resources

Job submission failed with error message:Got a job held event, reason: Unspecified gridmanager errorhttp://goc.grid.sinica.edu.tw/gocwiki/Unspecified_gridmanager_error

Job submission failed with error message:Cannot read JobWrapper output, both from Condor and from Maradonahttp://goc.grid.sinica.edu.tw/gocwiki/Cannot_read_JobWrapper_output%2e%2e%2e

Job submission failed with error message:7 authentication failedhttp://goc.grid.sinica.edu.tw/gocwiki/7_authentication_failed

Job submission failed with error message:10 data transfer to the server failedhttp://goc.grid.sinica.edu.tw/gocwiki/10_data_transfer_to_the_server_failed

4444 Waiting jobs in the GRIShttp://goc.grid.sinica.edu.tw/gocwiki/4444_Waiting_jobs_in_the_GRIS

SE (Storage Element) File copy and registration failed with error message:

535 535-FTPD GSSAPI error: GSS Major Status: General failurehttp://goc.grid.sinica.edu.tw/gocwiki/535_535-FTPD_GSSAPI_error%3A_GSS_Major_Status%3A_General_failure

Page 26: SEE-GRID-SCI Operations Procedures and Tools

Regional SEE-GRID-SCI Training for Site Administrators, Institute of Physics Belgrade, March 5-6, 2009 26

Service Level Agreement (SLA)

Old URL: http://wiki.egee-see.org/index.php/SG_SLAThe change to the current one is that the required availability is 80%, and that the availability is calculated on 3h basis, not on a daily basisBBmSAM portal provides SLA figuresSites not fully conforming to the SLA will have reduced fundingSites with the availability <50% will be uncertifiedSites fully conforming to the SLA will be put into seegrid_certified status and become visible to the whole SEE region (i.e. not only SEE-GRID, but also EGEE-SEE etc.)