atlas distributed analysis

1
ATLAS Distributed Analysis DISTRIBUTED ANALYSIS JOBS WITH THE DISTRIBUTED ANALYSIS JOBS WITH THE ATLAS PRODUCTION SYSTEM ATLAS PRODUCTION SYSTEM S. González ([email protected]), D. Liko ([email protected]), A. Nairz S. González ([email protected]), D. Liko ([email protected]), A. Nairz ([email protected]), ([email protected]), G. Mair ([email protected]), F. Orellana ([email protected]), L. Goossens ([email protected]) G. Mair ([email protected]), F. Orellana ([email protected]), L. Goossens ([email protected]) CERN, European Organization for the Nuclear Research 1211 Genève 23, Switzerland CERN, European Organization for the Nuclear Research 1211 Genève 23, Switzerland Silvia Resconi ([email protected]) Silvia Resconi ([email protected]) INFN-Istituto Nazionale di Fisica Nucleare, Sezione di Milano, Italy INFN-Istituto Nazionale di Fisica Nucleare, Sezione di Milano, Italy Alessandro De Salvo ([email protected]) Alessandro De Salvo ([email protected]) INFN-Istituto Nazionale di Fisica Nucleare, Sezione di Roma, Italy INFN-Istituto Nazionale di Fisica Nucleare, Sezione di Roma, Italy Experience running the Experience running the Analysis Analysis - Analysis has been done running our own supervisor and Lexor/CondorG instance. - Delays due to data transfer are not an issue any more because AOD input is available on-site and jobs are sent to those sites only. - System setup is not able yet to support long queues (simulation) and short queues (analysis) in parallel: - queues are filled with simulation jobs - long pending times for analysis jobs - The analysis has been launched over 350000 events in the ttbar reconstruction and over 50000 events in the SUSY case. - W->jj and Z reconstructed masses, after merging the histogram files, were produced through the ATLAS production system (“a la Grid”) with 100 input files each. - With free resources, system was able to process 10k-event jobs in about 10 min (total). Two datasets were used: - 50 events per file, a total of 7000 files for first dataset and 1000 for second one. - 80 jobs (70 for the first one and 10 for the second one) with 100 input files each were defined with AtCom. - these 80 jobs were submitted with Eowyn-Lexor/CondorG and they ran in several LCG sites. Each job produced three output files (ntuple, histogram and log) stored at Castor (CERN, IFIC-Valencia, Taiwan). - ROOT has been used to merge these histogram output files, in a post-processing step. Introduction Introduction Detector for the study of high-energy proton-proton collisions. The offline computing will have to deal with an output event rate of 200 Hz. i.e 10 9 events per year, with an average event size of 1.6 MB. In 2002 ATLAS computing planned a first series of Data Challenges (DC’s) in order to validate its: - Computing Model - Software - Data Model The ATLAS collaboration decided to perform the DCs using the Grid middleware developed in several Grid projects (Grid flavours) like: - LHC Computing Grid project (LCG), to which CERN is committed - GRID3/OSG - NorduGRID Storage: - Raw recording rate 320 MBytes/sec - Accumulating at 5- 8 PetaBytes/year - 20 PetaBytes of disk - 10 PetaBytes of tape Processing: - 40,000 of today’s fastest PCs http://uimon.cern.ch/twiki/bin/view/Atlas/DA-Activities International Conference on Computing in High Energy and Nucl 13-17 February 2006, Mumbai, India The production database, which contains abstract job definitions; The Eowyn supervisor that reads the production database for job definitions and present them to the different Grid executors in an easy-to-parse XML format; The Executors, one for each Grid flavor, that receive the job-definitions in XML format and convert them to the job description language of that particular Grid; Don Quijote II, the Atlas Distributed Data Management System (DDM), moves files from their temporary output locations to their final destination on some Storage Element and registers the files in the Replica Location Service of that Grid. ATLAS Production System ATLAS Production System In order to handle the task of ATLAS Data Challenges, an automated production system was designed. The ATLAS production system consists of 4 components The ATLAS production system has been successfully used to run production of ATLAS jobs at an unprecedented scale. On successful days there were more then 10000 jobs processed by the system. The experiences obtained operating the system, which includes several grid systems, are considered to be essential also to perform analysis using Grid resources. Don Quijote II Eowyn CondorG Panda/OSG Setup for Distributed Setup for Distributed Analysis Analysis - R 4 b l Using latest version of Production System: - Supervisor: Eowyn - Executors: Condor-G Lexor (currently under testing) - Data Management: plain LCG tools, LFC catalog DQ2 currently being integrated - Development database (devdb) dedicated DA database already in place Generic analysis transformation has been created: - compiles user code/package on the worker node - processes Analysis Object Data (AOD) input files - produces histogram + n-tuple file as outputs User Interface: AtCom4 The ATLAS Commander (AtCom) is used as a graphical user interface. Currently used for task and job definitions: - task: contains summary information about the jobs to be run (input/output datasets, transformation parameters, resource + environment requirements, etc). - job: concrete parameters needed for running, but no Grid-specifics - Following the ProdDB schema and xml description VO-based user authentication in AtCom: - user certificate and proxy used to enable DB access - no proxy delegation, following current ProdSys strategy (jobs still run under (a few) production managers‘ IDs) It communicates directly with the Oracle database and provides support to define jobs and monitor their status The two algorithms of choice have been ttbar and a SUSY reconstructions algorithms provided by the physics group. In case of the ttbar reconstruction the algorithm is already part of the release and has been used as installed already on the Grid sites. In case of the SUSY algorithm the source code of the algorithm has been transferred together with the job and the shared library was obtained by compilation on the worker node.

Upload: grazia

Post on 17-Jan-2016

50 views

Category:

Documents


0 download

DESCRIPTION

ATLAS Distributed Analysis. ATLAS Production System. In order to handle the task of ATLAS Data Challenges, an automated production system was designed. The ATLAS production system consists of 4 components. Don Quijote II. Eowyn. CondorG. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: ATLAS Distributed Analysis

ATLAS Distributed Analysis

DISTRIBUTED ANALYSIS JOBS WITH THE DISTRIBUTED ANALYSIS JOBS WITH THE ATLAS PRODUCTION SYSTEMATLAS PRODUCTION SYSTEM

S. González ([email protected]), D. Liko ([email protected]), A. Nairz ([email protected]), S. González ([email protected]), D. Liko ([email protected]), A. Nairz ([email protected]), G. Mair ([email protected]), F. Orellana ([email protected]), L. Goossens ([email protected])G. Mair ([email protected]), F. Orellana ([email protected]), L. Goossens ([email protected])

CERN, European Organization for the Nuclear Research 1211 Genève 23, SwitzerlandCERN, European Organization for the Nuclear Research 1211 Genève 23, Switzerland

Silvia Resconi ([email protected])Silvia Resconi ([email protected])INFN-Istituto Nazionale di Fisica Nucleare, Sezione di Milano, ItalyINFN-Istituto Nazionale di Fisica Nucleare, Sezione di Milano, Italy

Alessandro De Salvo ([email protected])Alessandro De Salvo ([email protected])INFN-Istituto Nazionale di Fisica Nucleare, Sezione di Roma, ItalyINFN-Istituto Nazionale di Fisica Nucleare, Sezione di Roma, Italy

Experience running the AnalysisExperience running the Analysis

- Analysis has been done running our own supervisor and Lexor/CondorG instance. - Delays due to data transfer are not an issue any more because AOD input is available on-site and jobs are sent to those sites only.- System setup is not able yet to support long queues (simulation) and short queues (analysis) in parallel:

- queues are filled with simulation jobs - long pending times for analysis jobs- The analysis has been launched over 350000 events in the ttbar reconstruction

and over 50000 events in the SUSY case. - W->jj and Z reconstructed masses, after merging the histogram files, were produced through the ATLAS production system (“a la Grid”) with 100 input files each.- With free resources, system was able to process 10k-event jobs in about 10 min (total).

Two datasets were used: - 50 events per file, a total of 7000 files for first dataset and 1000 for second one. - 80 jobs (70 for the first one and 10 for the second one) with 100 input files each were defined with AtCom. - these 80 jobs were submitted with Eowyn-Lexor/CondorG and they ran in several LCG sites. Each job produced three output files (ntuple, histogram and log) stored at Castor (CERN, IFIC-Valencia, Taiwan). - ROOT has been used to merge these histogram output files, in a post-processing step.

IntroductionIntroduction

Detector for the study of high-energy proton-proton collisions.The offline computing will have to deal with an output event rate of 200 Hz. i.e 109 events per year, with an average event size of 1.6 MB.

In 2002 ATLAS computing planned a first series of Data Challenges (DC’s) in order to validate its:

- Computing Model- Software- Data Model

The ATLAS collaboration decided to perform the DCs using the Grid middleware developed in several Grid projects (Grid flavours) like:

- LHC Computing Grid project (LCG), to which CERN is committed

- GRID3/OSG- NorduGRID

Storage: - Raw recording rate 320 MBytes/sec - Accumulating at 5-8 PetaBytes/year - 20 PetaBytes of disk - 10 PetaBytes of tapeProcessing: - 40,000 of today’s fastest PCs

http://uimon.cern.ch/twiki/bin/view/Atlas/DA-Activities

International Conference on Computing in High Energy and Nuclear Physics

13-17 February 2006, Mumbai, India

The production database, which contains abstract job definitions;The Eowyn supervisor that reads the production database for job definitions and present them to the different Grid executors in an easy-to-parse XML format; The Executors, one for each Grid flavor, that receive the job-definitions in XML format and convert them to the job description language of that particular Grid;Don Quijote II, the Atlas Distributed Data Management System (DDM), moves files from their temporary output locations to their final destination on some Storage Element and registers the files in the Replica Location Service of that Grid.

ATLAS Production System ATLAS Production System In order to handle the task of ATLAS Data Challenges, an automated production system was designed.The ATLAS production system consists of 4 components

The ATLAS production system has been successfully used to run production of ATLAS jobs at an unprecedented scale. On successful days there were more then 10000 jobs processed by the system. The experiences obtained operating the system, which includes several grid systems, are considered to be essential also to perform analysis using Grid resources.

Don Quijote II

Eowyn

CondorG

Panda/OSG

Setup for Distributed AnalysisSetup for Distributed Analysis

-

R4bl

Using latest version of Production System:- Supervisor: Eowyn- Executors: Condor-G

Lexor (currently under testing)- Data Management: plain LCG tools, LFC catalog

DQ2 currently being integrated- Development database (devdb)

dedicated DA database already in place

Generic analysis transformation has been created:- compiles user code/package on the worker node- processes Analysis Object Data (AOD) input files- produces histogram + n-tuple file as outputs

User Interface: AtCom4

The ATLAS Commander (AtCom) is used as a graphical user interface.Currently used for task and job definitions:

- task: contains summary information about the jobs to be run (input/output datasets, transformation parameters, resource + environment requirements, etc). - job: concrete parameters needed for running, but no Grid-specifics - Following the ProdDB schema and xml description

VO-based user authentication in AtCom:

- user certificate and proxy used to enable DB access - no proxy delegation, following current ProdSys strategy (jobs still run under (a few) production managers‘ IDs)

It communicates directly with the Oracle database and provides support to define jobs and monitor their status The two algorithms of choice have been ttbar and a SUSY reconstructions

algorithms provided by the physics group. In case of the ttbar reconstruction the algorithm is already part of the release and has been used as installed already on the Grid sites. In case of the SUSY algorithm the source code of the algorithm has been transferred together with the job and the shared library was obtained by compilation on the worker node.