dataflow/workflow with real data through tiers
DESCRIPTION
Tutorial. Dataflow/workflow with real data through Tiers. N. De Filippis. Department of Physics and INFN Bari. Outline. Computing facilities in the control room, at Tier-0 and at Central Analysis facilities (CAF): Ex: Tracker Analysis Centre (TAC) - PowerPoint PPT PresentationTRANSCRIPT
1CMSItalia 12-14 Feb 2007
N. De Filippis
Dataflow/workflow with real data through Tiers
Department of Physics and INFN Bari
N. De Filippis
Tutorial
2CMSItalia 12-14 Feb 2007
N. De Filippis
OutlineComputing facilities in the control room, at
Tier-0 and at Central Analysis facilities (CAF): Ex: Tracker Analysis Centre (TAC) Local storage and automatic processing at TAC
how to register files in DBS/DLS
Automatic data shipping and remote processing at Tier-1 /Tier 2
Injection in PhEDEx for the transfer Re-reconstruction and skimming with Prodagent Data analysis in a distributed environment via
CRAB simulation of cosmics in a Tier-2 site
3CMSItalia 12-14 Feb 2007
N. De Filippis
What expected in the CMS Computing Model
Dataflow/workflow from Point 5 to Tiers:
DAQ+Filter Farm
Disk Storage(temporary before transfer
to CASTOR)
CASTOR /Publishing in DBS/DLS
Local Storage and reconstruction
DQM Visualization
Shipping to TIER 1 / 2
Re-reconstruction
skimming
End-useranalsysis
The CAF will support:
diagnostics of detector problems, trigger performance services,
derivation of calibration and alignment data
reconstruction services, interactive and batch analysis facilities
Most of tasks have to be performed in remote Tier sites in distributed environ.
4CMSItalia 12-14 Feb 2007
N. De Filippis
Computing facilities in the control room, at Tier-0 and at Central Analysis
facilities
5CMSItalia 12-14 Feb 2007
N. De Filippis
Example of a facility for Tracker
• The TAC is a dedicated Tracker Control Room – To serve the needs of collecting and analysing the data
from the 25% Tracker test at the Tracker Integration Facility (TIF)
– In use since Oct. 1st 2006 by DAQ and detector people
• Computing elements at TAC:– 1 disk server: CMSTKSTORAGE– 1 DB server: CMSTIBDB– 1 wireless/wired router– 12 PC’s
• 2 DAQ (CMSTAC02 e CMSTAC02)• 3 DQM, 1 Visualization (CMSTKMON, CMSTAC04 e CMSTAC05)• 2 TIB/TID (CMSTAC00 e CMSTAC01)• 3 DCS (PCCMSTRDCS10, • PCCMSTRDCS11 and PCCMSTRDCS12)• 2 TEC+ (CMSTAC06 and CMSTAC07) + 1 private PC
TAC is like a control room + Tier-0 + CAF “in miniatura”
6CMSItalia 12-14 Feb 2007
N. De Filippis
Local storage and processing at TAC
• A dedicated PC (CMSTKSTORAGE) is devoted to store temporarily the data:– it has now 2.8 TB local fast disk (no redundancy) – it will allow local caching for about 10 days of data taking (300 GB/day
expected for 25 % test)
• CMSTKSTORAGE also used to perform the following tasks:a) perform o2o for connection and pedestals runs to fill the Offline DB b) convert RU files into EDM-compliant formats c) write files to CASTOR when ready
Area in castor created under …/store/…• /castor/cern.ch/cms/store/TAC/PIXEL• /castor/cern.ch/cms/store/TAC/TIB• /castor/cern.ch/cms/store/TAC/TOB• /castor/cern.ch/cms/store/TAC/TEC
d) register files in Data Bookkeeping Service (DBS) and Data Location Service (DLS)
7CMSItalia 12-14 Feb 2007
N. De Filippis
How to register files in DBS/DLS (1)
A grid certificate with CMS Role=Production is needed:
voms-proxy-init -voms cms:/cms/Role=production
DBS and DLS API
cvs co –r DBS_0_0_3a DBS cvs co –r DLS_0_1_2 DLS
One DBS and DLS instance: please use
MCLocal_4/Writer for DBS
prod-lfc-cms-central.cern.ch/grid/cms/DLS/MCLocal_4 for DLS
The following info about your EDM-compliant file are needed:--PrimaryDataset=TAC-TIB-120-DAQ-EDM--ProcessedDataset=CMSSW_1_2_0-RAW-Run-0000505--DataTier=RAW --LFN=/store/TAC/TIB/edm_2007_01_29/EDM0000505_000.root--Size=205347982--TotalEvents= 3707
One processed
dataset per run
8CMSItalia 12-14 Feb 2007
N. De Filippis
-- GUID=38ACFC35-06B0-DB11-B463 extracted with EdmFileUtil -u file:file.root
-- CheckSum=4264158233 extracted with cksum command-- CMSSWVersion=CMSSW_1_2_0 -- ApplicationName=FUEventProcess -- ApplicationFamily=Online
-- PSetHash= 4cff1ae0-1565-43f8-b1e9-82ee0793cc8c extracted with uuidgen
Run the script for the registration in DBS:python dbsCgiCHWriter.py --DBSInstance=MCLocal_4/Writer --DBSURL= “http://cmsdbs.cern.ch/cms/prod/comp/DBS/CGIServer/prodquery" --PrimaryDataset=$primdataset --ProcessedDataset=$procdataset --DataTier=RAW --LFN=$lfn --Size=$size --TotalEvents=$nevts --GUID=$guid --CheckSum=$cksum --CMSSWVersion=CMSSW_1_2_0 --ApplicationName=FUEventProcess --ApplicationFamily=Online --PSetHash=$psethash
Closure of blocks in DBS:python closeDBSFileBlock.py --DBSAddress=MCLocal_4/Writer -datasetPath=$dataset
The two scripts dbsCgiCHWriter.py and closeDBSFileBlock.py can be found in /afs/cern.ch/user/n/ndefilip/public/Registration/
How to register files in DBS/DLS (2)
9CMSItalia 12-14 Feb 2007
N. De Filippis
How to register files in DBS/DLS (3)
Run the script for the registration of blocks of files in DLS:
python dbsread.py --datasetPath=$dataset
or for each block of files:
dls-add -i DLS_TYPE_LFC -e prod-lfc-cmscentral.cern.ch/grid/cms/DLS/MCLocal_4/TAC-TIB-120-DAQ-EDM/CMSSW_1_2_0-RAW-Run-0000505#497a013d-3b49-43ad-a80f-dbc590e593d7 srm.cern.ch which is the name of the SE
Data registered in DBS
10CMSItalia 12-14 Feb 2007
N. De Filippis
Tracker data
MTCC data
Results in Data discovery page:
http://cmsdbs.cern.ch/discovery/expert
11CMSItalia 12-14 Feb 2007
N. De Filippis
Automatic data shipping and remote processing at
Tier-1/Tier 2
12CMSItalia 12-14 Feb 2007
N. De Filippis
Data published in DBS and DLS are ready to be transferred via the CMS official data movement tool, PhEDEx The injection, that is the procedure to write into the database for transfer of PhEDEx, has to be run in principle from CERN where the data are collected but it can be run also in a remote site in a Tier-1 / Tier-2 hosting PhEDEx It runs at Bari via an official PhEDEx agent and a component of ProdAgent modified to “close” blocks at the end of the transfer in order to enable automatic publishing in DLS (the same procedure used for Monte Carlo data) complete automatisation is reached with a script that watches for new tracker related entries in DBS/DLS Once data are injected in PhEDEX any Tier-1 or Tier-2 can subscribe to them
PhEDEx injection (1)
13CMSItalia 12-14 Feb 2007
N. De Filippis
PhEDEx injection (2)
ProdAgent_v0XX is needed: configure PA to use the PhEDEX dropbox /dir/state/inject-tmdb/inbox
prodAgent-edit-config --component=PhEDExInterface --parameter=PhEDExDropBox --value=/dropboxdir/
start the PhEDExInterface component of PA:
prodAgentd --start --component=PhEDExInterface
PhEDEx_2.4 is needed: configure the inject-tmdb agent in your Config file
### AGENT LABEL=inject-tmdb PROGRAM=Toolkit/DropBox/DropTMDBPublisher -db ${PHEDEX_DBPARAM} -node TX_NON_EXISTENT_NODE
start the inject-tmdb agent of PHEDEx:./Master -config Config start inject-tmdb
14CMSItalia 12-14 Feb 2007
N. De Filippis
PhEDEx injection (3) For each datasetpath of a run:
python dbsinjectTMDB.py --datasetPath=$dataset --injectdir=logs/
In the log of PhEDEX you will find the following messages
In /afs/cern.ch/user/n/ndefilip/public/Registration
2007-01-31 07:55:05: TMDBInject[18582]: (re)connecting to databaseConnecting to databaseReading file information from /home1/prodagent/state/inject-tmdb/work/_TAC-TIB-120-DAQ-EDM_CMSSW_1_2_0-RAW-Run-0000520_353a3ae2-30a0-4f30-86df-e08ba9ac6869-1170230102.09/_TAC-TIB-120-DAQ-EDM_CMSSW_1_2_0-RAW-Run-0000520_353a3ae2-30a0-4f30-86df-e08ba9ac6869.xmlProcessing dbs http://cmsdbs.cern.ch/cms/prod/comp/DBS/CGIServer/prodquery?instance=MCLocal_4/Writer (204) Processing dataset /TAC-TIB-120-DAQ-EDM/RAW (1364) Processing block /TAC-TIB-120-DAQ-EDM/CMSSW_1_2_0-RAW-Run-0000520#353a3ae2-30a0-4f30-86df-e08ba9ac6869 (7634) :+/ 1 new files, 1 new replicas PTB R C2007-01-31 07:55:08: DropTMDBPublisher[5828]: stats: _TAC-TIB-120-DAQ-EDM_CMSSW_1_2_0-RAW-Run-0000520_353a3ae2-30a0-4f30-86df-e08ba9ac6869-1170230102.09 3.04r 0.18u 0.08s success
15CMSItalia 12-14 Feb 2007
N. De Filippis
Results in PhEDEx page:
http://cmsdoc.cern.ch/cms/aprom/phedex/prod/Data::Replicas?filter=TAC-T;view=global;dexp=1364;rows=;node=6;node=19;node=44;nvalue=Node%20files#d1364
http://cmsdoc.cern.ch/cms/aprom/phedex
16CMSItalia 12-14 Feb 2007
N. De Filippis
Goal: to run reconstruction of raw data in a standard and official way, typically using code of a CMSSW release (no prerelease, no user patch)
ProdAgent tool evaluated to perform reconstruction with the same procedures as for monte carlo samples
ProdAgent can be run everywhere…better in a Tier-1 / Tier-2
Running with ProdAgent will ensure that RECO data are automatically registered in DBS and DLS, ready to be shipped to Tier-1 and Tier-2 and analysed via computing tools
in the close future the standard reconstruction, calibration and alignment tasks will run on Central Analysis Facility (CAF) machines at CERN, such as expected in the Computing Model.
“Official” reconstruction/skimming (1)
17CMSItalia 12-14 Feb 2007
N. De Filippis
“Official” reconstruction/skimming (2)
Input data are processed run by run and new processed datasets are created as output, one for each run
ProdAgent use the DatasetInjector component to be aware of the input files to be processed
It is needed to create the workflow file from the cfg for reconstruction;
the following example is for DIGI-RECO processing starting from GEN-SIM input files
no Pileup, StartUp and LowLumi pileup can be set for digitization
splitting of input files can be done either by event of by file
18CMSItalia 12-14 Feb 2007
N. De Filippis
Creating the workflow file for no pileup case:python $PRODAGENT_ROOT/util/createProcessingWorkflow.py --dataset=/TAC-TIB-120-DAQ-EDM/RAW/CMSSW_1_2_0-RAW-Run-0000530--cfg=DIGI-RECO-NoPU-OnSel.cfg --version=CMSSW_1_2_0 --category=mc --dbs-address=MCLocal_4/Writer--dbs-url=http://cmsdbs.cern.ch/cms/prod/comp/DBS/CGIServer/prodquery --dls-type=DLS_TYPE_DLI --dls-address=lfc-cms-test.cern.ch/grid/cms/DLS/MCLocal_4 --same-primary-dataset --only-closed-blocks --fake-hash --split-type=event --split-size=1000 --pileup-files-per-job=1 --pileup-dataset=/mc-csa06-111-minbias/GEN/CMSSW_1_1_1-GEN-SIM-1164410273--name=TAC-TIB-120-DAQ-EDM-Run-0000530-DIGI-RECO-NoPU Submitting jobs: python PRODAGENT/test/python/IntTests/InjectTestSkimLCG.py --workflow=/yourpath/TAC-TIB-120-DAQ-EDM-Run-0000530-DIGI-RECO-NoPU-Workflow.xml --njobs=300
“Official” reconstruction/skimming (3)
19CMSItalia 12-14 Feb 2007
N. De Filippis
Data analysis via CRAB at Tiers (1)
• Data published in DBS/DLS can be processed via CRAB in remote using the distributed environment tools• users have to edit crab.cfg and insert the dataset path of the run to be analyzed as obtained by DBS info• users have to provide their CMSSW cfg, setup the environment and compile their code via scramv1• offline DB accessed via frontier at Tier-1/2 already tested during CSA06 with alignment data• an example cfg to perform the reconstruction chain starting from raw data can be found in /afs/cern.ch/user/n/ndefilip/public/Registration/TACAnalysis_Run2048.cfg
• Thanks to D. Giordano for the support
20CMSItalia 12-14 Feb 2007
N. De Filippis
Data analysis via CRAB at Tiers (2)
• The piece of the cfg useful to access the offline DB via frontier
• The output files produced with CRAB are not registrered in DBS/DLS (but the implementation of the code is under development…)
• Further details about CRAB in the tutorial of F. Fanzago.
21CMSItalia 12-14 Feb 2007
N. De Filippis
“Official” Cosmics simulation (1)
Goal: to make standard simulation of cosmics with official code in CMSSW release (no patch, no prereleses) CMSSW_1_2_2 is needed:
Download AnalysisExamples/SiStripDetectorPerformance
cvs co –r CMSSW_1_2_2 AnalysisExamples/SiStripDetectorPerformance
Complete geometry of CMS, no magnetic field, cosmic filter implemented to get muon triggered by scintillators:
AnalysisExamples/SiStripDetectorPerformance/src/CosmicTIFFilter.cc
The configuration file is: AnalysisExamples/SiStripDetectorPerformance/test/cosmic_tif.cfg
interactively: cmsRun cosmic_tif.cfg
by using ProdAgent to make large-scale and fully automatized productions Thanks to L.
Fanò
22CMSItalia 12-14 Feb 2007
N. De Filippis
“Official” Cosmics simulation (2)
ProdAgent_v012:
create the workflow from the cfg file for GEN-SIM-DIGI:
python $PRODAGENT_ROOT/util/createProductionWorkflow.py --cfg /your/path/cosmic_tif.cfg --version CMSSW_1_2_0 --fake-hash
Warnings: when using createPreProdWorkflow.py the PoolOutputModule name in cfg should be compliant with the conventions to reflect the data tier the output file contains (i.e. GEN-SIM , GEN-SIM-DIGI, FEVT ).
so download the modified cfg from /afs/cern.ch/user/n/ndefilip/public/Registration/COSMIC_TIF.cfg
the workflow can be found in:
/afs/cern.ch/user/n/ndefilip/public/Registration/COSMIC_TIF-Workflow.xml
Submit jobs via standard prodagent scripts:
python $PRODAGENT_ROOT/test/python/IntTests/InjectTestLCG.py --workflow=/your/path/COSMIC_TIF-Workflow.xml --run=30000001 --nevts=10000 –njobs=100
23CMSItalia 12-14 Feb 2007
N. De Filippis
Pro and con’s Advantages of the CMS computing approach:
Data officially published processed with official tools
so results are reproducible
the access to a large number of distributed resources
profit from the experience of the computing teams
Con’s:
initial effort to learn official computing tools
possible problems at remote sites, storage issues, instability of grid components (RB,CE), etc…
concurrence of analysis jobs and production jobs
policy/prioritization to be set in remote sites.
24CMSItalia 12-14 Feb 2007
N. De Filippis
Conclusions
First real data registered in DBS/DLS are officially available to the CMS community
Data are moved between sites and published by using official tools
Reconstruction, re-reconstruction and skimming could be “standardized” using ProdAgent
Data analysis is performed by using CRAB
Cosmic simulation for detector communities can be officially addressed
Many thanks to the people of the TAC team (fabrizio, giuseppe, domenico., livio, tommaso, subir….)