distributed analysis system in the atlas experiment minsuk kim university of alberta 24 jun 2008...
TRANSCRIPT
Distributed Analysis Distributed Analysis SystemSystem in the ATLAS in the ATLAS
ExperimentExperiment
Minsuk KimMinsuk KimUniversity of AlbertaUniversity of Alberta
24 Jun 200824 Jun 2008
KISTI SeminarKISTI Seminar
Minsuk Kim (Univ. of Alberta)Minsuk Kim (Univ. of Alberta) 22
OutlineOutline Large Hadron Collider (LHC)Large Hadron Collider (LHC)
• ATLAS Experiment and ComputingATLAS Experiment and Computing
Distributed Analysis Model in Distributed Analysis Model in ATLASATLAS• Grid InfrastructureGrid Infrastructure
Distributed Analysis ToolsDistributed Analysis Tools• Ganga and PathenaGanga and Pathena
ConclusionsConclusions
Minsuk Kim (Univ. of Alberta)Minsuk Kim (Univ. of Alberta) 33
LHC at CERNLHC at CERN
Minsuk Kim (Univ. of Alberta)Minsuk Kim (Univ. of Alberta) 44
Proton-Proton InteractionProton-Proton Interaction
Extracting interesting physicsExtracting interesting physicsfrom this massive data samplefrom this massive data sample
is a big challengeis a big challenge
Real-time data selection process reduces event rate to 100~200 events/s 109 events/yr
Minsuk Kim (Univ. of Alberta)Minsuk Kim (Univ. of Alberta) 55
ATLAS ExperimentATLAS Experiment37 Countries167 Institutes~2000 Collaborators(Canada ~4%)
(Pixels, SCT, TRT)
Minsuk Kim (Univ. of Alberta)Minsuk Kim (Univ. of Alberta) 66
ATLAS ComputingATLAS Computing 2x102x1099 events/yr and 1 event ~ 1.6 events/yr and 1 event ~ 1.6
MBMB ATLAS will record about 3.2 ATLAS will record about 3.2
Petabytes of data per year Petabytes of data per year (3.2 (3.2 million GB)million GB)
plus 2-3 times as much simulated plus 2-3 times as much simulated datadata
invites comparisons like invites comparisons like “if we “if we wrote one year’s data on DVDs it wrote one year’s data on DVDs it would make a stack roughly high would make a stack roughly high as the CN Tower (553 m)”as the CN Tower (553 m)”DVD thickness: 1.2 mmDVD capacity: 8.5 GB (1-side, 2-layer)3.2 PB/8.5 GB = 376470 discs = 452 m
Minsuk Kim (Univ. of Alberta)Minsuk Kim (Univ. of Alberta) 77
LHC Computing Grid (LCG)LHC Computing Grid (LCG)
UVicSFU
UofA
UofT
McGill
One massive computing One massive computing centre not possiblecentre not possible Farm out data around Farm out data around the world using the world using GRID GRID technologytechnology 12 Tier-1 for raw 12 Tier-1 for raw processingprocessing
1 in Canada: TRIUMF1 in Canada: TRIUMF >100 Tier-2 for analysis>100 Tier-2 for analysis
5 sites in Canada5 sites in Canada
- West: UVic, SFU, - West: UVic, SFU, AlbertaAlberta
- East: Toronto, - East: Toronto, McGillMcGill5 Gbit/s5 Gbit/s 1 Gbit/s1 Gbit/s
CERN CERN TRIUMF TRIUMF ALBERTA ALBERTA
Distributed Distributed Analysis Model in Analysis Model in
ATLASATLAS
Minsuk Kim (Univ. of Alberta)Minsuk Kim (Univ. of Alberta) 99
ATLAS Data Replication and ATLAS Data Replication and DistributionDistribution
EventFilter
Many Tier-3
CERNAnalysisFacility
Data Reprocessing
MC Production
User Analysis
1st pass calibration
Reconstruction 24h Data Export
Tier-1Tier-0
Tier-2
Minsuk Kim (Univ. of Alberta)Minsuk Kim (Univ. of Alberta) 1010
ATLAS Event Data ModelATLAS Event Data ModelRefining the data by: Add higher level info, Skin, Thin, Slim
Minsuk Kim (Univ. of Alberta)Minsuk Kim (Univ. of Alberta) 1111
Different kind of grid environ. based on 3 Different kind of grid environ. based on 3 gridsgrids• WLCG/EGEE (Enabling Grids for E-WLCG/EGEE (Enabling Grids for E-
sciencE)sciencE)• OSG (Open Science Grid)OSG (Open Science Grid)• NG (NorduGrid)NG (NorduGrid)
Grids have differences inGrids have differences in• Middle-wareMiddle-ware• Replica catalogs to store dataReplica catalogs to store data• Software tools to submit jobsSoftware tools to submit jobs
ATLAS Grid InfrastructureATLAS Grid Infrastructure
However, hide differences from the ATLAS user
Minsuk Kim (Univ. of Alberta)Minsuk Kim (Univ. of Alberta) 1212
Distributed Analysis Model Distributed Analysis Model The distributed analysis model is The distributed analysis model is
based on the ATLAS computing modelbased on the ATLAS computing model• Data is distributed in Tier-1/Tier-2 facilities by Data is distributed in Tier-1/Tier-2 facilities by
defaultdefault available 24/7available 24/7
• User jobs are sent to the dataUser jobs are sent to the data large input datasets (100 GB up to several TB)large input datasets (100 GB up to several TB)
• Results must be made available to the userResults must be made available to the user potentially already during processingpotentially already during processing
• Data is added with meta-data and Data is added with meta-data and bookkeeping in catalogsbookkeeping in catalogs
Minsuk Kim (Univ. of Alberta)Minsuk Kim (Univ. of Alberta) 1313
Distributed Analysis Model Distributed Analysis Model
Need for: Need for: Distributed Data Management Distributed Data Management (DDM)(DDM)• Managed by DDM system DQ2 (Don-Quijote 2)Managed by DDM system DQ2 (Don-Quijote 2)• System based on datasets which are System based on datasets which are
collections of filescollections of files a file exists in the context of datasetsa file exists in the context of datasets
• Automated file management, distribution and Automated file management, distribution and archiving throughout the whole grid using a archiving throughout the whole grid using a Central Catalog, FTS, LFCsCentral Catalog, FTS, LFCs
• Random access needs a pre-filtering of data Random access needs a pre-filtering of data of interestof interest
e.g. trigger or ID streams or TAGs (event-level meta e.g. trigger or ID streams or TAGs (event-level meta data)data)
Minsuk Kim (Univ. of Alberta)Minsuk Kim (Univ. of Alberta) 1414
DQ2: Data management system for all distributed ATLAS DQ2: Data management system for all distributed ATLAS datadata• Supports all three ATLAS Grid flavorsSupports all three ATLAS Grid flavors• Manages all data flows (EFManages all data flows (EFTier0Tier0GridTiersGridTiersInstitutesInstitutesusers)users)• Moves data between grid sites, query and retrieval of dataMoves data between grid sites, query and retrieval of data• Data is grouped into datasets, based on meta-data, like run Data is grouped into datasets, based on meta-data, like run
periodperiod dataset name: dataset name:
Project.NNNN.PhRef.ProductionStep.Format.VersionProject.NNNN.PhRef.ProductionStep.Format.Version User-defined should have prefix user.FirstnameLastnameUser-defined should have prefix user.FirstnameLastname
DQ2 end-user tools DQ2 end-user tools (the DQ2 dataset browser)• dq2_ls to list datasets matching a given patterndq2_ls to list datasets matching a given pattern• dq2_get to copy data from local storage or over the griddq2_get to copy data from local storage or over the grid• dq2_put to create user-defined datasetsdq2_put to create user-defined datasets
dq2 can see only Tier1/Tier2 SEs (castor, dCache, DPM), so files dq2 can see only Tier1/Tier2 SEs (castor, dCache, DPM), so files need to be copied to SE first and then registered to DQ2 systemneed to be copied to SE first and then registered to DQ2 system
Distributed Data Distributed Data ManagementManagement
Minsuk Kim (Univ. of Alberta)Minsuk Kim (Univ. of Alberta) 1515
Job Flow in the WLCG/EGEE Job Flow in the WLCG/EGEE GridGrid
Job goes to the data
Job goes to the data
Minsuk Kim (Univ. of Alberta)Minsuk Kim (Univ. of Alberta) 1616
Grid Job SubmissionGrid Job Submission Naive assumption: Grid ≈ large batch Naive assumption: Grid ≈ large batch
systemsystem
• Provide complicated job configuration jdlProvide complicated job configuration jdl filefile• Find suitable Athena software, installed as Find suitable Athena software, installed as
distribution kits in the Grid distribution kits in the Grid • Locate the data on different storage elementsLocate the data on different storage elements• Job splitting, monitoring and book-keepingJob splitting, monitoring and book-keeping• etc.etc.
Need for automation and integration of various different components
Minsuk Kim (Univ. of Alberta)Minsuk Kim (Univ. of Alberta) 1717
ATLAS offers several ways to do distributed analysisATLAS offers several ways to do distributed analysis• Data from MC Production System is currently consolidated by Data from MC Production System is currently consolidated by
DDM-operations team on all Tier1 and then all Tier2 sitesDDM-operations team on all Tier1 and then all Tier2 sites• Analysis model foresees Athena analysis of AODs/ESDs and Analysis model foresees Athena analysis of AODs/ESDs and
interactive use of Athena-aware-ROOT tuplesinteractive use of Athena-aware-ROOT tuples CESE/RLS
User with valid grid certificate Simplifying use of the Grid: easy-to-use frontends for job definition and management, implemented in Python
Distributed Analysis – Distributed Analysis – CurrentCurrent
DQ2
Minsuk Kim (Univ. of Alberta)Minsuk Kim (Univ. of Alberta) 1818
Distributed AnalysisDistributed Analysis
How to combine all these: Job scheduler/manager GANGA
Distributed Distributed Analysis with Analysis with
Ganga & PathenaGanga & Pathena
Minsuk Kim (Univ. of Alberta)Minsuk Kim (Univ. of Alberta) 2020
A user-friendly job definition and management toolA user-friendly job definition and management tool• Allows simple switching between testing and large-scale data processingAllows simple switching between testing and large-scale data processing
• Readily extended/customized to meet the needs of different usersReadily extended/customized to meet the needs of different users A job is constructed from a set of building blocksA job is constructed from a set of building blocks
• Specify which software to be run (application)Specify which software to be run (application)• Specify the processing system (backend), input/output and so onSpecify the processing system (backend), input/output and so on
What is GangaWhat is Ganga(an ATLAS/LHCb joint project)(an ATLAS/LHCb joint project)
Pluggable framework!
Mandatory
optional
Minsuk Kim (Univ. of Alberta)Minsuk Kim (Univ. of Alberta) 2121
Ease user’s experience in switching between different tech.Ease user’s experience in switching between different tech. Concentrate developer’s effort in specific domainConcentrate developer’s effort in specific domain
Plug-in based designPlug-in based design
Common interface
Specific implementation
Minsuk Kim (Univ. of Alberta)Minsuk Kim (Univ. of Alberta) 2222
• Each job has a jobid which specifies the location of the repository and workspace in gangadir• By default job to complete dataset locations, but often dataset is only incomplete (0 up to all files) make sure a dataset is present at a site by using this option, j.inputdata.min_num_files=N (N>0)
often advisable to force a job to a particular site or a subset of sitesoften advisable to force a job to a particular site or a subset of sites• Providing two user interface clients and scripting mode (PANDA-style job submission)
ganga athena --inDS [input dataset] --outputdata [output] --lcg --ce [nodes] ganga athena --inDS [input dataset] --outputdata [output] --lcg --ce [nodes] testJetReco.pytestJetReco.py
•j = Job()•j.name=test•j.application=Athena()•j.application.atlas_release=‘13.0.30’ #j.application.atlas_production=‘13.0.30.3’•j.application.option_file=‘testJetReco.py’•j.application.max_events=1000•j.inputdata=DQ2Dataset() #or ATLASLocalDataset()•j.inputdata.dataset=‘misal1_csc11.005012.J3_pythis_jetjet.digit.RDO.v12003103_tid016367’•j.outputdata=ATLASOutputDataset() #or DQ2OutputDataset()•j.outputdata.outputdata=[‘AOD.pool.root’]•j.inputsandbox=[‘PDGTABLE.MeV’] #also j.outputsandbox•j.backend=LCG() #or NG(), LSF(), Local()•j.backend.requirements.sites=[‘TRIUMF’,’ALBERTA’] #since 4.4.x•j.submit()
RDOAOD
Job
defi
nit
ion
wit
hin
Ip
yth
on
sh
ell Ganga How-toGanga How-to
Minsuk Kim (Univ. of Alberta)Minsuk Kim (Univ. of Alberta) 2323
Usage of Ganga at remote Usage of Ganga at remote sitessites
Used
at
~7
0 s
ites f
or
4 m
on
ths o
f 2
00
8O
ver
12
50
un
iqu
e u
sers
sin
ce 2
00
7
Canada ~1%
• Ganga Monitoring under http://www.cern.ch/ganga Usage Statistics, and Usage Report on the Grid (EGEE) with GangaRobot Jobs monitored by Dashboard (user, site, ce, application, and so on)
Minsuk Kim (Univ. of Alberta)Minsuk Kim (Univ. of Alberta) 2424
Main UsersMain Users
Other activitiesOther activities
Ganga ActivitiesGanga Activities
HARP
GarfieGarfie
ldld
Minsuk Kim (Univ. of Alberta)Minsuk Kim (Univ. of Alberta) 2525
PathenaPathena(a python script for access to OSG resources via the Panda system)(a python script for access to OSG resources via the Panda system)
A user A user job composed of sub-jobsjob composed of sub-jobs• One One buildJobbuildJob to receive source files from the user, to to receive source files from the user, to
compile and produce compile and produce librarieslibraries which are stored to the storage which are stored to the storage• Many Many runAthenarunAthena’s to receive the libraries’s to receive the libraries and runs Athenaand runs Athena
completion of buildJob triggers runAthenacompletion of buildJob triggers runAthena output files are added to output files are added to an output datasetan output dataset DDM moves the dataset to areaDDM moves the dataset to area
A unique A unique PandaIDPandaIDJob splitter
extracting run configurationextracting run configuration
ConfigExtractor > Input=POOLConfigExtractor > Input=POOL
ConfigExtractor > Output=AANT ConfigExtractor > Output=AANT AANTupleStream AANTAANTupleStream AANT
archive sourcesarchive sources
archive InstallAreaarchive InstallArea
post sources/jobOpost sources/jobO
query files in query files in dataset:fdr08_run1.0003067.StreamJet.merge.AOdataset:fdr08_run1.0003067.StreamJet.merge.AOD.o1_r8_t1D.o1_r8_t1
submitsubmit
====================================== JobID : 39 Status : 0 > build PandaID=8558262 > run PandaID=8558263
A unique A unique PandaIDPandaID
Minsuk Kim (Univ. of Alberta)Minsuk Kim (Univ. of Alberta) 2626
pathena AnalysisSkeleton_jetTrigger.pypathena AnalysisSkeleton_jetTrigger.py \ \ #not special job-option file (can run locally)#not special job-option file (can run locally) --outDS user.MinsukKim.fdr08_run1 \ --inDS fdr08_run1.0003067.StreamJet.merge.AOD.o1_r8_t1 \ --site ALBERTA --inputFileList fdr08_run1.list \ --libDS LAST
outputDSoutputDS output (name convention)output (name convention)
inDSinDS name of input datasetname of input dataset
sitesite job to OSG by default, possible to LCG (if AUTO, job to site w/ job to OSG by default, possible to LCG (if AUTO, job to site w/ most data)most data)
future: a job submission to best site based on data/CPUs future: a job submission to best site based on data/CPUs availabilityavailability
splitsplit number of sub-jobs to which an analysis job is split (i.e. how many number of sub-jobs to which an analysis job is split (i.e. how many CPUs)CPUs)
nFilesnFiles use an limited number of files in the input dataset (if not, all w/ use an limited number of files in the input dataset (if not, all w/ auto split)auto split)
inputFileLisinputFileListt
filename which contains a list of files to be run in the input filename which contains a list of files to be run in the input datasetdataset
libDSlibDS library dataset (e.g. “LAST” means what the last build used)library dataset (e.g. “LAST” means what the last build used)
commandcommand one-liner (e.g. “EvtMax=3”)one-liner (e.g. “EvtMax=3”)
noSubmitnoSubmit don’t submit a job (error checking for script/dataset/site)don’t submit a job (error checking for script/dataset/site)
FD
R-1
@Tie
r-2
Path
en
a O
pti
on
s (
just
exam
ple
)
Pathena How-toPathena How-to
Minsuk Kim (Univ. of Alberta)Minsuk Kim (Univ. of Alberta) 2727
Job status/site
Inputs/outputs
Log/debug
Panda MonitorPanda Monitor
ALBERTAALBERTA
En
ter
you
r P
an
daID
s
Minsuk Kim (Univ. of Alberta)Minsuk Kim (Univ. of Alberta) 2828
PathenaPathena GangaGangaDesignDesign Specialized toolSpecialized tool Extensible generic toolExtensible generic tool
SetupSetup Checkout PandaToolsCheckout PandaTools(set PATHENA_GRID_SETUP_SH)(set PATHENA_GRID_SETUP_SH)
Ganga must be installedGanga must be installed
(no need to setup grid (no need to setup grid environ.)environ.)
Submit jobsSubmit jobs
How and whereHow and whereShell command lineShell command lineGrids Grids (OSG,...CERN,TRIUMF,ALBERTA)(OSG,...CERN,TRIUMF,ALBERTA)
CLIP, Script and GUICLIP, Script and GUI
Grids, LSF, PBS,… and Grids, LSF, PBS,… and LocalLocal
Site (dataset) findingSite (dataset) finding AUTOAUTO AUTOAUTO
Input datasetsInput datasets DQ2DQ2 DQ2 or DQ2 or Local/castor/dCacheLocal/castor/dCache
Get resultsGet results DQ2DQ2 DQ2 or DQ2 or Local/castor/dCacheLocal/castor/dCache
Bookkeeping/RetryBookkeeping/Retry With With pathena_utilpathena_util With With gangaganga
Monitoring/ErrorLogMonitoring/ErrorLog Web and local utilityWeb and local utility Web and local utilityWeb and local utility
LXPLUS/THOR to submit jobs to CERN/TRIUMF/ALBERTA, and to retrieve outputsLXPLUS/THOR to submit jobs to CERN/TRIUMF/ALBERTA, and to retrieve outputs
Pathena vs. GangaPathena vs. Ganga
Minsuk Kim (Univ. of Alberta)Minsuk Kim (Univ. of Alberta) 2929
The generic grid job submission frameworks can The generic grid job submission frameworks can be used with DDM/DQ2 to perform Distributed be used with DDM/DQ2 to perform Distributed AnalysisAnalysis• Use both Pathena and Ganga at LXPLUS and THOR Use both Pathena and Ganga at LXPLUS and THOR
clustersclusters• Submit jobs to CERN, TRIUMF, ALBERTA & OSG/NG sitesSubmit jobs to CERN, TRIUMF, ALBERTA & OSG/NG sites• Ganga available with Local, Batch, and Grid systemsGanga available with Local, Batch, and Grid systems
Distributed analysis in ATLAS is evolving rapidlyDistributed analysis in ATLAS is evolving rapidly• Many key components like the DDM system have come Many key components like the DDM system have come
online (Data Management is a central issue)online (Data Management is a central issue)• Multi-pronged approach to distributed analysis have Multi-pronged approach to distributed analysis have
encouraged one submission system to learn from encouraged one submission system to learn from another and ultimately produced a more robust and another and ultimately produced a more robust and feature-rich distributed analysis systemfeature-rich distributed analysis system
Conclusion Conclusion
Minsuk Kim (Univ. of Alberta)Minsuk Kim (Univ. of Alberta) 3030
Conclusion Conclusion
Configure once, run Configure once, run anywhereanywhere
GangaGanga
LocalLocal
BatchBatch
GridGridUserUser
PathenaPathena
Minsuk Kim (Univ. of Alberta)Minsuk Kim (Univ. of Alberta) 3131
Tier-2@ALBERTATier-2@ALBERTA THOR Linux Computing ClusterTHOR Linux Computing Cluster
• Began around 1998, 42 dual Pentium II/III machines (100Mb/s Ethernet)Began around 1998, 42 dual Pentium II/III machines (100Mb/s Ethernet)• Beowulf-type, Cheaper than S-computers by more than a factor of tenBeowulf-type, Cheaper than S-computers by more than a factor of ten
Current Hardware/Software Configuration Current Hardware/Software Configuration (high-speed Gigabit link)(high-speed Gigabit link)• 3 head-nodes for Cluster/Interactive User, 1 Torque/Maui server node3 head-nodes for Cluster/Interactive User, 1 Torque/Maui server node• Many server nodes for Grid Compute ElementMany server nodes for Grid Compute Element• 74 dual processor compute nodes, 250 work nodes (200 AMD Opteron)74 dual processor compute nodes, 250 work nodes (200 AMD Opteron)• 4 data storage nodes (~6 TB of RAID disk storage)4 data storage nodes (~6 TB of RAID disk storage)• 4 iSCSI storage arrays (~22TB) and 2 mass storage tape systems4 iSCSI storage arrays (~22TB) and 2 mass storage tape systems• Scientific Linux 3, 4 (and Fedora Core 2, 3), Various applications & toolsScientific Linux 3, 4 (and Fedora Core 2, 3), Various applications & tools
Multi-purpose Computing FacilityMulti-purpose Computing Facility• Prototype of PC-based Event Filter sub-farm (high-level trigger system)Prototype of PC-based Event Filter sub-farm (high-level trigger system)• Multiple-serial Monte Carlo production (Tier-2)Multiple-serial Monte Carlo production (Tier-2)• Fully integrated with LCG and Grid Canada projects for distributed Fully integrated with LCG and Grid Canada projects for distributed
computingcomputing
BackupBackup
Minsuk Kim (Univ. of Alberta)Minsuk Kim (Univ. of Alberta) 3333
Middle-ware FeaturesMiddle-ware Features WLCG/gLite UI (EGEE)WLCG/gLite UI (EGEE)
• Job submission via LCG RB (Resource Broker)Job submission via LCG RB (Resource Broker)• Fast bulk submission with new gLite RBFast bulk submission with new gLite RB• LFC (Local File Catalog)LFC (Local File Catalog)
OSG/PanDAOSG/PanDA• An integrated An integrated PProduction roduction ANANd d DDistributed istributed AAnalysis nalysis
systemsystem• JobScheduler & Pilots : Acquisition of Grid CE resourcesJobScheduler & Pilots : Acquisition of Grid CE resources• LRC (Local Replica Catalog)LRC (Local Replica Catalog)
NordugridNordugrid• ARC middle-ware for job submissionARC middle-ware for job submission• RLS (Replica Location Server) file catalogRLS (Replica Location Server) file catalog• Now possible for distributed analysis (already in Now possible for distributed analysis (already in
production)production)
Minsuk Kim (Univ. of Alberta)Minsuk Kim (Univ. of Alberta) 3434
Ganga ArchitectureGanga Architecture
Minsuk Kim (Univ. of Alberta)Minsuk Kim (Univ. of Alberta) 3535
Ganga Applications and Ganga Applications and BackendsBackends
Minsuk Kim (Univ. of Alberta)Minsuk Kim (Univ. of Alberta) 3636
Three components:Three components: Central Dataset Catalog, Local Site Subscription Services, Client ToolsCentral Dataset Catalog, Local Site Subscription Services, Client Tools
DQ2 ArchitectureDQ2 Architecture
Minsuk Kim (Univ. of Alberta)Minsuk Kim (Univ. of Alberta) 3737
What is a Grid?What is a Grid? The key criteria:The key criteria:
• Coordinated distributed resources …Coordinated distributed resources …• Uses standard, open, general-purpose Uses standard, open, general-purpose
protocols protocols and interfaces …and interfaces …
• Deliver non-trivial qualities of serviceDeliver non-trivial qualities of service
What is not a Grid?What is not a Grid?• A cluster, a network attached storage device, a A cluster, a network attached storage device, a
scientific instrument, a network, etc.scientific instrument, a network, etc.• Each is an important component of a Grid, but Each is an important component of a Grid, but
by itself does not constitute a Gridby itself does not constitute a Grid