cern - it department ch-1211 genève 23 switzerland t monitoring the atlas distributed data...

15
RN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf of the ARDA Dashboard team

Upload: kristina-moody

Post on 27-Dec-2015

217 views

Category:

Documents


1 download

TRANSCRIPT

CERN - IT DepartmentCH-1211 Genève 23

Switzerlandwww.cern.ch/it

Monitoring the ATLAS Distributed Data

Management SystemRicardo Rocha (CERN)

on behalf of the ARDA Dashboard team

CERN - IT DepartmentCH-1211 Genève 23

Switzerlandwww.cern.ch/it CHEP2007,Victoria, Canada [email protected] - 2

Outline

• Dashboard project• ATLAS DDM system• DDM Dashboard• Monitoring for operators• Monitoring for end users• Conclusion

CERN - IT DepartmentCH-1211 Genève 23

Switzerlandwww.cern.ch/it

Dashboard Project

• Started inside the ARDA group of the EGEE/LCG project in 2005– Initially covering only job monitoring for CMS

• Evolved into a python framework providing a set of flexible tools allowing coverage of other grid application areas

• The framework consists of a set of different components– Data access layer (DAO)– Service configuration (agents)– Web application– Command line tools– APIs

• Strong focus on allowing easy access to the information– HTTP query interface– Output in HTML (web interfaces), but also XML and CSV for integration

with external tools

• Applications currently cover job monitoring (for all HEP experiments + VLEMED/Biomed), data management, site efficiency / reliability, and many others

CHEP2007,Victoria, Canada [email protected] - 3

CERN - IT DepartmentCH-1211 Genève 23

Switzerlandwww.cern.ch/it

ATLAS DDM System

• Distributing all the data in the ATLAS infrastructure• Data is organized in datasets – collections of files• Users issue subscriptions on these datasets• Different agents take care of the several tasks

required for the successful movement of the data

CHEP2007,Victoria, Canada [email protected] - 4

CERN - IT DepartmentCH-1211 Genève 23

Switzerlandwww.cern.ch/it

ATLAS DDM System

• Each set of agents serves one or a group of sites (typically related to each other - cloud)

• Initial deployment involved the setup of these services on the VO boxes at each Tier1

• Debugging was extremely difficult (still is…)– Involved logging into each of the machines– And correlating this information…

• Accounting and monitoring virtually impossible• A central point where all the information would

be made available became vital

CHEP2007,Victoria, Canada [email protected] - 5

CERN - IT DepartmentCH-1211 Genève 23

Switzerlandwww.cern.ch/it

DDM Dashboard

• Main focus on ATLAS specific services (DQ2 system), receiving information from the different agents via HTTP callbacks– Transfer state changes– Dataset complete

– Transfer complete– Transfer / registration errors

• But also on grid fabric services– Data management related services up and running

– Storage space availability

• Data is put together in a structured way– Oracle database at CERN

• Different tools (agents) responsible for generating statistics and metrics

CHEP2007,Victoria, Canada [email protected] - 6

CERN - IT DepartmentCH-1211 Genève 23

Switzerlandwww.cern.ch/it

DDM Dashboard

• Serves different sets of use cases, coming from different types of users

• Site / system operators– “How is the overall system doing”?– “How is site X doing”?– “What is the most common error, and what is triggering it”?

• End users / production coordinators– “What is the status of this (set of) dataset subscription(s)”?– “When will the data become available”?– Essential to have real time information

• How much data? A lot!– Millions of file transfers, each reporting the different steps

• Average week means 2 million hits (90%+ bulk reports with up to 500 items)

– Especially critical when systems misbehave (more errors)– Lot of work on partitioning the data, optimizing the database and the

web server setup (Apache)

CHEP2007,Victoria, Canada [email protected] - 7

CERN - IT DepartmentCH-1211 Genève 23

Switzerlandwww.cern.ch/it

Monitoring for operators

• “How is the whole system performing”?

CHEP2007,Victoria, Canada [email protected] - 8

CERN - IT DepartmentCH-1211 Genève 23

Switzerlandwww.cern.ch/it

Monitoring for operators

• “What is wrong with site X”?

CHEP2007,Victoria, Canada [email protected] - 9

CERN - IT DepartmentCH-1211 Genève 23

Switzerlandwww.cern.ch/it

Monitoring for operators

• “What files are causing error ‘…’”?

CHEP2007,Victoria, Canada [email protected] - 10

CERN - IT DepartmentCH-1211 Genève 23

Switzerlandwww.cern.ch/it

Monitoring for operators

• “What files are causing error ‘…’”?

CHEP2007,Victoria, Canada [email protected] - 11

CERN - IT DepartmentCH-1211 Genève 23

Switzerlandwww.cern.ch/it

Monitoring for operators

• “What files are causing error ‘…’”?

CHEP2007,Victoria, Canada [email protected] - 12

CERN - IT DepartmentCH-1211 Genève 23

Switzerlandwww.cern.ch/it

Monitoring for end users

• “I subscribed to dataset X on site Y. What is the status”?

CHEP2007,Victoria, Canada [email protected] - 13

CERN - IT DepartmentCH-1211 Genève 23

Switzerlandwww.cern.ch/it

Conclusion

• Essential tool for all ATLAS operations• Currently used by a large number of people,

from site / system responsibles to end users• Usage goes beyond the web interface

– Data being queried by different external tools for automating operations (catalogs cleanup, consistency checks, alarms and notifications, …)

• What is coming next– Integration with the site specific monitoring tools– More alarms and notifications– Automated reaction to specific events– More focus on the end user

CHEP2007,Victoria, Canada [email protected] - 14

CERN - IT DepartmentCH-1211 Genève 23

Switzerlandwww.cern.ch/it

• Tier 0 exporthttp://dashb-atlas-data-tier0.cern.ch/dashboard/request.py/site

• Productionhttp://dashb-atlas-data.cern.ch/dashboard/request.py/site

• Homepage

http://dashboard.cern.ch

• Contact

[email protected]

CHEP2007,Victoria, Canada [email protected] - 15