integrating dirac work ows in supercomputers

33
Integrating DIRAC workflows in Supercomputers Status and next steps Alexandre Boyer - Universite Clermont Auvergne, CERN [email protected] Virtual DIRAC Users’ Workshop - Monday, 10th May 2021 1/21

Upload: others

Post on 06-May-2022

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Integrating DIRAC work ows in Supercomputers

Integrating DIRAC workflows inSupercomputersStatus and next steps

Alexandre Boyer - Universite Clermont Auvergne CERNalexandrefranckboyercernch

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 121

Introduction

The DIRAC WMS implements the Pilot-Job paradigm

bull Able to federate a large variety of heterogeneous computingresources

bull Mainly Grid Sites Clouds but also opportunistic resources Whatabout Supercomputers

Supercomputers represent an important computing power

bull Different from the Grid Sites integrating VO-specific workflows onsuch machines through DIRAC requires work

bull Each machine is unique and the landscape quickly evolves

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 221

Table of Contents

bull DIRAC WMS on Grid Sites

bull DIRAC WMS and Supercomputers

bull Tackling the distributed computing challenges

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 321

DIRAC WMS on Grid Sites

WMS Workflow

ComputingElement

1 Submit pilots

Grid SiteWorker nodes

DIRAC

Services

Pilot-Factory

WaitingJobs

2 Fetchjobs

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 421

DIRAC WMS on Grid Sites

rdquoTypicalrdquo job requirements

Single-Coreallocation

x86architecture

SLC6CC7compatible

4 SP Jobs running on a Grid Site

CVMFS endpointmounted on the nodes

gt2Gb RAMper core

Internet

Network access(Fetch jobs upload outputs)

Worker Node

Core

LRMS accessiblefrom outside

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 521

DIRAC WMS and Supercomputers

Presentation

Definition A mainframe computer that is among the largest fastestor most powerful of those available at a given time

bull Twice a year top500org releases the list of the most powerful SCof the world

bull 1 Fugaku is composed of ARM processors and contains sim7 millioncores

bull 2 and 3 leverage IBM Power processors and Nvidia GPUs andcontain sim15-2 million cores

bull In comparison WLCG provides sim1 million cores (many additionalparameters have to be taken into account for a fair comparisonthough)

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 621

DIRAC WMS and Supercomputers

Features of Supercomputers

SingleMulti-Coreallocation

x86Non-x86architecture

FastNodes-interconnectivty

2Multi-Process(MP)Jobsrunningonafatnode

1MPJobrunningonGPU

GPUusage

SharedFileSystem

HPC

Nonetworkconnectivity

Restrictiveuserauthentication

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 721

DIRAC WMS and Supercomputers

Challenges

Software architecture (VO)

bull SC are many-core architecture

bull They can include non x86CPUs (ARM AMD Power)GPUs

bull They might contain less than2Gbcore

Distributed computing (DIRAC)

bull SC policies may differ fromthose of HEP Grid Sites

bull They might lack of CVMFSoutbound connectivity externalaccess to the LRMS

rArr SC are all made differently hard to build a unique solution for all ofthem

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 821

Tackling the distributed computing challenges

Overview

bull 1 main variable directly affects the chosen solution (push pull)

+ Do WNs have an external connectivity yes (or only via the edgenode) no

bull Other variables generate variations that can be added up to theproposed solution

+ Is CVMFS mounted on the WNs yes no

+ Is LRMS accessible from outside yes no

+ What type of allocations can we make single-core multi-coremulti-node

rArr We will go through different cases from the easiest to the hardestone

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 921

Tackling the distributed computing challengesPull model single-core allocation

Similar to a Grid Site

bull Uncommon for a SC

bull Often need tocollaborate with thesystem administrators

Single-Coreallocation

CVMFS endpointmounted on the nodes

Internet

External connectivity(Fetch jobs upload outputs)

LRMS accessible from outside(push pilots from the DIRAC server)

Shared FS

Edge Node

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1021

Tackling the distributed computing challengesPull model Multi-core allocation

Integrated since v7r0

SC often require their users to allocate many cores or even nodes to runa program (queue configuration)

Fat node partition [3]

bull One pilot per fat nodeexecute several SPMPjobs per allocation

bull In the Queue conf addLocalCEType=Pool

andNumberOfProcessors=N

Multi-Coreallocation

CVMFS endpointmounted on the nodes

Internet

External connectivity(Fetch jobs upload outputs)

LRMS accessible from outside(push pilots from the DIRAC server)

Shared FS

Edge Node

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1121

Tackling the distributed computing challengesPull model Multi-node allocation

Almost integrated in v7r1

Allows to get a large number of resources with a small number ofallocations

Sub-Pilots (specific to SLURM currently)

bull One sub-pilot per fatnode allocated pilotssharing a same idstatus and output

bull In the Queue conf addParallelLibrary=PL

andNumberOfNodes=Nlt-Mgt

Shared FS

Multi-Nodeallocation

CVMFS endpointmounted on the nodes

Internet

External connectivity(Fetch jobs upload outputs)

LRMS accessible from outside(push pilots from the DIRAC server)

Edge Node

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1221

Tackling the distributed computing challengesPull model CVMFS not mounted on WNs

Not Integrated VO action

SC by default do not provide CVMFS on the WNs

CVMFS-exec on the shared FS [2]

bull Mount CVMFS as anunprivileged user

bull Purely a siteadminVOaction actually mightneed to add aparameter in DIRAC toease the process

CVMFS endpoint

Internet

External connectivity(Fetch jobs upload outputs)

LRMS accessible from outside(push pilots from the DIRAC server)

CVMFS notmounted

CVMFS-exec on theshared file system

Edge Node

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1321

Tackling the distributed computing challengesPull model No remote access to LRMS

Integrated since v7r0 VO action

Some SC can only be accessed via a VPN (No CE no direct SSH)

Site Director on the edge node

bull Directly submit pilotsfrom the edge node

bull Would need to beallowed to executeagents on the edgenode

bull Would need to beupdated manually

LRMS not accessiblefrom outside

Internet

External connectivity(Fetch jobs upload outputs)

LRMS accessible from outside(push pilots from the DIRAC server)

Pilot-Factory onthe edge node

CVMFS endpointmounted on the nodes

Shared FS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1421

Tackling the distributed computing challengesPull model Ext connectivity only from the edge node

Not Integrated VO action

Some SC only provide external connectivity from the edge node Pilotscannot directly interact with DIRAC services in this case

Gateway

bull Would be installed onthe edge node (ifpossible)

bull Would capture the Pilotand Job calls and wouldredirect them

Shared FS

External connectivityonly from the edge node

Internet

External connectivity(Fetch jobs upload outputs)

Gateway serviceon the edge node

captures calls andredirects them

LRMS accessible from outside(push pilots from the DIRAC server)

CVMFS endpointmounted on the nodes

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1521

Tackling the distributed computing challengesPush model No Ext connectivity

In Progress v7r2

Some SC do not provide any external connectivity at all neither on theWNs or the edge node

PushJobAgent

bull Works like a pilotoutside of the SC

bull Fetches jobs deals withinputs and outputssubmits the applicationpart to a SC

bull Require a direct accessto the LRMS

Shared FS

No externalConnectivity

CVMFS endpointmounted on the nodes

Internet

1 Fetch a joband get inputs

3 Get outputsand store them

DIRAC 2 Submitthe

Application

Computing Element

Edge Node

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1621

Tackling the distributed computing challengesPush model No Ext connectivity Multi-corenode

In Progress v7r2

BundleCE

bull Would aggregatemultiple applicationsinto one allocation

Shared FS

No externalConnectivity

Multi-core Multi-node

Internet

DIRAC

1 SubmitApplications

2 AggregateApplications

3 Submitthem as 1multi-core

job

Computing Element

Bundle CE

CVMFS endpointmounted on the nodes

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1721

Tackling the distributed computing challenges

Push model No Ext connectivity No CVMFS

In Progress v7r2 VO action

As it was already said SC do not provide CVMFS by defaultCVMFS-exec cannot be used in this context

Subset-CVMFS-Builder

bull Run amp extract CVMFSdependencies of givenjobs

bull UseCVMFS-Shrinkwrapper[1] to make a subset ofCVMFS

bull Test it amp deploy it onthe SC shared FS

No externalConnectivity

No CVMFS

Internet

Build and deploy asubset of CVMFS on

the shared FS

CVMFS SubsetBuilder

CVMFS Subset

CVMFSendpoint

Edge NodeDIRAC

Computing Element

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1821

Conclusion

Status

bull Support many SC with external connectivity (multi-core allocations)

bull Tools to exploit SC with no external connectivity are in progress

Next Steps

bull Provide the push model solution and its variations

bull Work on DB12 (CPU Power computation) support for multi-coreallocations

bull Provide a complete documentation about SC integration

bull Provide side projects to minimize VO actions(Subset-CVMFS-Builder)

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1921

Thanks

Any questions Comments

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2021

CVMFS cvmfs-shrinkwrap utilityhttpscvmfsreadthedocsioenstablecpt-

shrinkwraphtmlcpt-shrinkwrap Online accessed 4May 2021 2021

CVMFS cvmfsexechttpsgithubcomcvmfscvmfsexec Online accessed4 May 2021 2021

Federico Stagni Andrea Valassi and Vladimir RomanovskiyldquoIntegrating LHCb workflows on HPC resources status andstrategiesrdquo In arXiv200613603 [hep-ex physicsphysics](June 2020) arXiv 200613603 urlhttparxivorgabs200613603

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

Backup Slides

Real use-cases we have in LHCb

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Available HPCs

bull Piz Daint in CSCS (Suisse)

bull Marconi-A2 in CINECA (Italy) ndash not used anymore

bull SDumont in LNCC (Brazil)

bull MareNostrum in BSC (Spain)

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Piz Daint CSCS

bull Ranked 12th in Top500 (Nov 2020)

bull 387872 cores (Nov 2020)

+ Collaboration with the local System Administrators allows atraditional Grid Site usage

rArr No change required

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Marconi-A2 CINECA

bull Ranked 19th in Top500 (Nov2019)

+ External connectivity from theWNs

+ CVMS mounted on the WNs

+ Accessible via a CE

bull 348000 cores (Nov 2019)

- Multi-core allocations 272logical cores per node (IntelKNL)

- Low memorycore

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Marconi-A2 CINECA Development

bull External Connectivity Use thepull model

bull Multi-core allocations Use thefat node partitioning variation

HTCondorCE

1 Submit pilots

Marconi-A2Worker nodes

DIRAC

Services

Pilot-Factory

WaitingJobs

2 Fetchjobs

LRMS CVMFS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Marconi-A2 CINECA Status

rArr More details about LHCb work on CINECA [3]

rArr Marconi-A2 has been replaced by Marconi-100 V100 GPUs andPower9 CPUs cluster

Done

bull Exploited 68272 cores pernode not enough memory formore jobs

To be done

bull Nothing to do Marconi-A2disappeared

bull LHCb software not ready forGPUs

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

SDumont LNCC

bull Ranked 277th in Top500 (Nov2020)

+ External connectivity from theWNs

+ CVMFS mounted on the WNs

+ Accessible via SSH (specialaccess)

bull 33856 cores (Nov 2020)

- Multi-core allocations 24 or 48logical cores per node

- Multi-node allocations 21nodes per allocation requiredby some queues

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

SDumont LNCC Development

bull External Connectivity Use thepull model

bull Multi-core allocations Use thefat node partitioning variation

bull Multi-node allocations Use thesub-pilots variation

1 Submit pilots via ssh

SDumontWorker nodes

DIRAC

Services

Pilot-Factory

WaitingJobs

2 Fetchjobs

LRMS CVMFS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

SDumont LNCC Status

Done

bull Exploit 2424 and 4848 coresper node

To be done

bull Multi-node allocation shouldhave results soon

bull Dirac Benchmark not adaptedto multi-core allocations 20of the jobs run out of time

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Mare Nostrum BSC

bull Ranked 42nd in Top500 (Nov2020)

+ Accessible via a CE (ARC) andalso SSH

+ Single-core allocation possiblebut not recommended

bull 153216 cores (Nov 2020)

- No network connectivity

- CVMFS not mounted on theWNs

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Mare Nostrum BSC Development

bull No external connectivity Usethe push model

bull No CVMFS mounted on theWNs Use theSubset-CVMFS-Buildervariation

bull To get multi-core allocationsUse the BundleCE variation ARCCE

MareNostrumWorker nodes

DIRAC

Services Waiting

Jobs

1 Fetchjobs BundleCE

CVMFS SubsetBuilder

2 Submitapps

Subset CVMFS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Mare Nostrum BSC Status

Done

bull Prototype to run simplesubmissions (Hello World)

bull First version of theSubset-CVMFS-Builder

To be done

bull CE configuration to run jobswithin Singularity

bull BundleCE to aggregatemultiple jobs in an allocation

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

  • Introduction
  • Table of Contents
  • DIRAC WMS on Grid Sites
  • DIRAC WMS and Supercomputers
  • Tackling the distributed computing challenges
  • References
  • LHCb-supercomputers collaboration
Page 2: Integrating DIRAC work ows in Supercomputers

Introduction

The DIRAC WMS implements the Pilot-Job paradigm

bull Able to federate a large variety of heterogeneous computingresources

bull Mainly Grid Sites Clouds but also opportunistic resources Whatabout Supercomputers

Supercomputers represent an important computing power

bull Different from the Grid Sites integrating VO-specific workflows onsuch machines through DIRAC requires work

bull Each machine is unique and the landscape quickly evolves

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 221

Table of Contents

bull DIRAC WMS on Grid Sites

bull DIRAC WMS and Supercomputers

bull Tackling the distributed computing challenges

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 321

DIRAC WMS on Grid Sites

WMS Workflow

ComputingElement

1 Submit pilots

Grid SiteWorker nodes

DIRAC

Services

Pilot-Factory

WaitingJobs

2 Fetchjobs

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 421

DIRAC WMS on Grid Sites

rdquoTypicalrdquo job requirements

Single-Coreallocation

x86architecture

SLC6CC7compatible

4 SP Jobs running on a Grid Site

CVMFS endpointmounted on the nodes

gt2Gb RAMper core

Internet

Network access(Fetch jobs upload outputs)

Worker Node

Core

LRMS accessiblefrom outside

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 521

DIRAC WMS and Supercomputers

Presentation

Definition A mainframe computer that is among the largest fastestor most powerful of those available at a given time

bull Twice a year top500org releases the list of the most powerful SCof the world

bull 1 Fugaku is composed of ARM processors and contains sim7 millioncores

bull 2 and 3 leverage IBM Power processors and Nvidia GPUs andcontain sim15-2 million cores

bull In comparison WLCG provides sim1 million cores (many additionalparameters have to be taken into account for a fair comparisonthough)

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 621

DIRAC WMS and Supercomputers

Features of Supercomputers

SingleMulti-Coreallocation

x86Non-x86architecture

FastNodes-interconnectivty

2Multi-Process(MP)Jobsrunningonafatnode

1MPJobrunningonGPU

GPUusage

SharedFileSystem

HPC

Nonetworkconnectivity

Restrictiveuserauthentication

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 721

DIRAC WMS and Supercomputers

Challenges

Software architecture (VO)

bull SC are many-core architecture

bull They can include non x86CPUs (ARM AMD Power)GPUs

bull They might contain less than2Gbcore

Distributed computing (DIRAC)

bull SC policies may differ fromthose of HEP Grid Sites

bull They might lack of CVMFSoutbound connectivity externalaccess to the LRMS

rArr SC are all made differently hard to build a unique solution for all ofthem

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 821

Tackling the distributed computing challenges

Overview

bull 1 main variable directly affects the chosen solution (push pull)

+ Do WNs have an external connectivity yes (or only via the edgenode) no

bull Other variables generate variations that can be added up to theproposed solution

+ Is CVMFS mounted on the WNs yes no

+ Is LRMS accessible from outside yes no

+ What type of allocations can we make single-core multi-coremulti-node

rArr We will go through different cases from the easiest to the hardestone

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 921

Tackling the distributed computing challengesPull model single-core allocation

Similar to a Grid Site

bull Uncommon for a SC

bull Often need tocollaborate with thesystem administrators

Single-Coreallocation

CVMFS endpointmounted on the nodes

Internet

External connectivity(Fetch jobs upload outputs)

LRMS accessible from outside(push pilots from the DIRAC server)

Shared FS

Edge Node

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1021

Tackling the distributed computing challengesPull model Multi-core allocation

Integrated since v7r0

SC often require their users to allocate many cores or even nodes to runa program (queue configuration)

Fat node partition [3]

bull One pilot per fat nodeexecute several SPMPjobs per allocation

bull In the Queue conf addLocalCEType=Pool

andNumberOfProcessors=N

Multi-Coreallocation

CVMFS endpointmounted on the nodes

Internet

External connectivity(Fetch jobs upload outputs)

LRMS accessible from outside(push pilots from the DIRAC server)

Shared FS

Edge Node

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1121

Tackling the distributed computing challengesPull model Multi-node allocation

Almost integrated in v7r1

Allows to get a large number of resources with a small number ofallocations

Sub-Pilots (specific to SLURM currently)

bull One sub-pilot per fatnode allocated pilotssharing a same idstatus and output

bull In the Queue conf addParallelLibrary=PL

andNumberOfNodes=Nlt-Mgt

Shared FS

Multi-Nodeallocation

CVMFS endpointmounted on the nodes

Internet

External connectivity(Fetch jobs upload outputs)

LRMS accessible from outside(push pilots from the DIRAC server)

Edge Node

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1221

Tackling the distributed computing challengesPull model CVMFS not mounted on WNs

Not Integrated VO action

SC by default do not provide CVMFS on the WNs

CVMFS-exec on the shared FS [2]

bull Mount CVMFS as anunprivileged user

bull Purely a siteadminVOaction actually mightneed to add aparameter in DIRAC toease the process

CVMFS endpoint

Internet

External connectivity(Fetch jobs upload outputs)

LRMS accessible from outside(push pilots from the DIRAC server)

CVMFS notmounted

CVMFS-exec on theshared file system

Edge Node

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1321

Tackling the distributed computing challengesPull model No remote access to LRMS

Integrated since v7r0 VO action

Some SC can only be accessed via a VPN (No CE no direct SSH)

Site Director on the edge node

bull Directly submit pilotsfrom the edge node

bull Would need to beallowed to executeagents on the edgenode

bull Would need to beupdated manually

LRMS not accessiblefrom outside

Internet

External connectivity(Fetch jobs upload outputs)

LRMS accessible from outside(push pilots from the DIRAC server)

Pilot-Factory onthe edge node

CVMFS endpointmounted on the nodes

Shared FS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1421

Tackling the distributed computing challengesPull model Ext connectivity only from the edge node

Not Integrated VO action

Some SC only provide external connectivity from the edge node Pilotscannot directly interact with DIRAC services in this case

Gateway

bull Would be installed onthe edge node (ifpossible)

bull Would capture the Pilotand Job calls and wouldredirect them

Shared FS

External connectivityonly from the edge node

Internet

External connectivity(Fetch jobs upload outputs)

Gateway serviceon the edge node

captures calls andredirects them

LRMS accessible from outside(push pilots from the DIRAC server)

CVMFS endpointmounted on the nodes

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1521

Tackling the distributed computing challengesPush model No Ext connectivity

In Progress v7r2

Some SC do not provide any external connectivity at all neither on theWNs or the edge node

PushJobAgent

bull Works like a pilotoutside of the SC

bull Fetches jobs deals withinputs and outputssubmits the applicationpart to a SC

bull Require a direct accessto the LRMS

Shared FS

No externalConnectivity

CVMFS endpointmounted on the nodes

Internet

1 Fetch a joband get inputs

3 Get outputsand store them

DIRAC 2 Submitthe

Application

Computing Element

Edge Node

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1621

Tackling the distributed computing challengesPush model No Ext connectivity Multi-corenode

In Progress v7r2

BundleCE

bull Would aggregatemultiple applicationsinto one allocation

Shared FS

No externalConnectivity

Multi-core Multi-node

Internet

DIRAC

1 SubmitApplications

2 AggregateApplications

3 Submitthem as 1multi-core

job

Computing Element

Bundle CE

CVMFS endpointmounted on the nodes

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1721

Tackling the distributed computing challenges

Push model No Ext connectivity No CVMFS

In Progress v7r2 VO action

As it was already said SC do not provide CVMFS by defaultCVMFS-exec cannot be used in this context

Subset-CVMFS-Builder

bull Run amp extract CVMFSdependencies of givenjobs

bull UseCVMFS-Shrinkwrapper[1] to make a subset ofCVMFS

bull Test it amp deploy it onthe SC shared FS

No externalConnectivity

No CVMFS

Internet

Build and deploy asubset of CVMFS on

the shared FS

CVMFS SubsetBuilder

CVMFS Subset

CVMFSendpoint

Edge NodeDIRAC

Computing Element

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1821

Conclusion

Status

bull Support many SC with external connectivity (multi-core allocations)

bull Tools to exploit SC with no external connectivity are in progress

Next Steps

bull Provide the push model solution and its variations

bull Work on DB12 (CPU Power computation) support for multi-coreallocations

bull Provide a complete documentation about SC integration

bull Provide side projects to minimize VO actions(Subset-CVMFS-Builder)

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1921

Thanks

Any questions Comments

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2021

CVMFS cvmfs-shrinkwrap utilityhttpscvmfsreadthedocsioenstablecpt-

shrinkwraphtmlcpt-shrinkwrap Online accessed 4May 2021 2021

CVMFS cvmfsexechttpsgithubcomcvmfscvmfsexec Online accessed4 May 2021 2021

Federico Stagni Andrea Valassi and Vladimir RomanovskiyldquoIntegrating LHCb workflows on HPC resources status andstrategiesrdquo In arXiv200613603 [hep-ex physicsphysics](June 2020) arXiv 200613603 urlhttparxivorgabs200613603

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

Backup Slides

Real use-cases we have in LHCb

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Available HPCs

bull Piz Daint in CSCS (Suisse)

bull Marconi-A2 in CINECA (Italy) ndash not used anymore

bull SDumont in LNCC (Brazil)

bull MareNostrum in BSC (Spain)

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Piz Daint CSCS

bull Ranked 12th in Top500 (Nov 2020)

bull 387872 cores (Nov 2020)

+ Collaboration with the local System Administrators allows atraditional Grid Site usage

rArr No change required

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Marconi-A2 CINECA

bull Ranked 19th in Top500 (Nov2019)

+ External connectivity from theWNs

+ CVMS mounted on the WNs

+ Accessible via a CE

bull 348000 cores (Nov 2019)

- Multi-core allocations 272logical cores per node (IntelKNL)

- Low memorycore

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Marconi-A2 CINECA Development

bull External Connectivity Use thepull model

bull Multi-core allocations Use thefat node partitioning variation

HTCondorCE

1 Submit pilots

Marconi-A2Worker nodes

DIRAC

Services

Pilot-Factory

WaitingJobs

2 Fetchjobs

LRMS CVMFS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Marconi-A2 CINECA Status

rArr More details about LHCb work on CINECA [3]

rArr Marconi-A2 has been replaced by Marconi-100 V100 GPUs andPower9 CPUs cluster

Done

bull Exploited 68272 cores pernode not enough memory formore jobs

To be done

bull Nothing to do Marconi-A2disappeared

bull LHCb software not ready forGPUs

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

SDumont LNCC

bull Ranked 277th in Top500 (Nov2020)

+ External connectivity from theWNs

+ CVMFS mounted on the WNs

+ Accessible via SSH (specialaccess)

bull 33856 cores (Nov 2020)

- Multi-core allocations 24 or 48logical cores per node

- Multi-node allocations 21nodes per allocation requiredby some queues

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

SDumont LNCC Development

bull External Connectivity Use thepull model

bull Multi-core allocations Use thefat node partitioning variation

bull Multi-node allocations Use thesub-pilots variation

1 Submit pilots via ssh

SDumontWorker nodes

DIRAC

Services

Pilot-Factory

WaitingJobs

2 Fetchjobs

LRMS CVMFS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

SDumont LNCC Status

Done

bull Exploit 2424 and 4848 coresper node

To be done

bull Multi-node allocation shouldhave results soon

bull Dirac Benchmark not adaptedto multi-core allocations 20of the jobs run out of time

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Mare Nostrum BSC

bull Ranked 42nd in Top500 (Nov2020)

+ Accessible via a CE (ARC) andalso SSH

+ Single-core allocation possiblebut not recommended

bull 153216 cores (Nov 2020)

- No network connectivity

- CVMFS not mounted on theWNs

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Mare Nostrum BSC Development

bull No external connectivity Usethe push model

bull No CVMFS mounted on theWNs Use theSubset-CVMFS-Buildervariation

bull To get multi-core allocationsUse the BundleCE variation ARCCE

MareNostrumWorker nodes

DIRAC

Services Waiting

Jobs

1 Fetchjobs BundleCE

CVMFS SubsetBuilder

2 Submitapps

Subset CVMFS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Mare Nostrum BSC Status

Done

bull Prototype to run simplesubmissions (Hello World)

bull First version of theSubset-CVMFS-Builder

To be done

bull CE configuration to run jobswithin Singularity

bull BundleCE to aggregatemultiple jobs in an allocation

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

  • Introduction
  • Table of Contents
  • DIRAC WMS on Grid Sites
  • DIRAC WMS and Supercomputers
  • Tackling the distributed computing challenges
  • References
  • LHCb-supercomputers collaboration
Page 3: Integrating DIRAC work ows in Supercomputers

Table of Contents

bull DIRAC WMS on Grid Sites

bull DIRAC WMS and Supercomputers

bull Tackling the distributed computing challenges

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 321

DIRAC WMS on Grid Sites

WMS Workflow

ComputingElement

1 Submit pilots

Grid SiteWorker nodes

DIRAC

Services

Pilot-Factory

WaitingJobs

2 Fetchjobs

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 421

DIRAC WMS on Grid Sites

rdquoTypicalrdquo job requirements

Single-Coreallocation

x86architecture

SLC6CC7compatible

4 SP Jobs running on a Grid Site

CVMFS endpointmounted on the nodes

gt2Gb RAMper core

Internet

Network access(Fetch jobs upload outputs)

Worker Node

Core

LRMS accessiblefrom outside

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 521

DIRAC WMS and Supercomputers

Presentation

Definition A mainframe computer that is among the largest fastestor most powerful of those available at a given time

bull Twice a year top500org releases the list of the most powerful SCof the world

bull 1 Fugaku is composed of ARM processors and contains sim7 millioncores

bull 2 and 3 leverage IBM Power processors and Nvidia GPUs andcontain sim15-2 million cores

bull In comparison WLCG provides sim1 million cores (many additionalparameters have to be taken into account for a fair comparisonthough)

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 621

DIRAC WMS and Supercomputers

Features of Supercomputers

SingleMulti-Coreallocation

x86Non-x86architecture

FastNodes-interconnectivty

2Multi-Process(MP)Jobsrunningonafatnode

1MPJobrunningonGPU

GPUusage

SharedFileSystem

HPC

Nonetworkconnectivity

Restrictiveuserauthentication

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 721

DIRAC WMS and Supercomputers

Challenges

Software architecture (VO)

bull SC are many-core architecture

bull They can include non x86CPUs (ARM AMD Power)GPUs

bull They might contain less than2Gbcore

Distributed computing (DIRAC)

bull SC policies may differ fromthose of HEP Grid Sites

bull They might lack of CVMFSoutbound connectivity externalaccess to the LRMS

rArr SC are all made differently hard to build a unique solution for all ofthem

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 821

Tackling the distributed computing challenges

Overview

bull 1 main variable directly affects the chosen solution (push pull)

+ Do WNs have an external connectivity yes (or only via the edgenode) no

bull Other variables generate variations that can be added up to theproposed solution

+ Is CVMFS mounted on the WNs yes no

+ Is LRMS accessible from outside yes no

+ What type of allocations can we make single-core multi-coremulti-node

rArr We will go through different cases from the easiest to the hardestone

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 921

Tackling the distributed computing challengesPull model single-core allocation

Similar to a Grid Site

bull Uncommon for a SC

bull Often need tocollaborate with thesystem administrators

Single-Coreallocation

CVMFS endpointmounted on the nodes

Internet

External connectivity(Fetch jobs upload outputs)

LRMS accessible from outside(push pilots from the DIRAC server)

Shared FS

Edge Node

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1021

Tackling the distributed computing challengesPull model Multi-core allocation

Integrated since v7r0

SC often require their users to allocate many cores or even nodes to runa program (queue configuration)

Fat node partition [3]

bull One pilot per fat nodeexecute several SPMPjobs per allocation

bull In the Queue conf addLocalCEType=Pool

andNumberOfProcessors=N

Multi-Coreallocation

CVMFS endpointmounted on the nodes

Internet

External connectivity(Fetch jobs upload outputs)

LRMS accessible from outside(push pilots from the DIRAC server)

Shared FS

Edge Node

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1121

Tackling the distributed computing challengesPull model Multi-node allocation

Almost integrated in v7r1

Allows to get a large number of resources with a small number ofallocations

Sub-Pilots (specific to SLURM currently)

bull One sub-pilot per fatnode allocated pilotssharing a same idstatus and output

bull In the Queue conf addParallelLibrary=PL

andNumberOfNodes=Nlt-Mgt

Shared FS

Multi-Nodeallocation

CVMFS endpointmounted on the nodes

Internet

External connectivity(Fetch jobs upload outputs)

LRMS accessible from outside(push pilots from the DIRAC server)

Edge Node

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1221

Tackling the distributed computing challengesPull model CVMFS not mounted on WNs

Not Integrated VO action

SC by default do not provide CVMFS on the WNs

CVMFS-exec on the shared FS [2]

bull Mount CVMFS as anunprivileged user

bull Purely a siteadminVOaction actually mightneed to add aparameter in DIRAC toease the process

CVMFS endpoint

Internet

External connectivity(Fetch jobs upload outputs)

LRMS accessible from outside(push pilots from the DIRAC server)

CVMFS notmounted

CVMFS-exec on theshared file system

Edge Node

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1321

Tackling the distributed computing challengesPull model No remote access to LRMS

Integrated since v7r0 VO action

Some SC can only be accessed via a VPN (No CE no direct SSH)

Site Director on the edge node

bull Directly submit pilotsfrom the edge node

bull Would need to beallowed to executeagents on the edgenode

bull Would need to beupdated manually

LRMS not accessiblefrom outside

Internet

External connectivity(Fetch jobs upload outputs)

LRMS accessible from outside(push pilots from the DIRAC server)

Pilot-Factory onthe edge node

CVMFS endpointmounted on the nodes

Shared FS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1421

Tackling the distributed computing challengesPull model Ext connectivity only from the edge node

Not Integrated VO action

Some SC only provide external connectivity from the edge node Pilotscannot directly interact with DIRAC services in this case

Gateway

bull Would be installed onthe edge node (ifpossible)

bull Would capture the Pilotand Job calls and wouldredirect them

Shared FS

External connectivityonly from the edge node

Internet

External connectivity(Fetch jobs upload outputs)

Gateway serviceon the edge node

captures calls andredirects them

LRMS accessible from outside(push pilots from the DIRAC server)

CVMFS endpointmounted on the nodes

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1521

Tackling the distributed computing challengesPush model No Ext connectivity

In Progress v7r2

Some SC do not provide any external connectivity at all neither on theWNs or the edge node

PushJobAgent

bull Works like a pilotoutside of the SC

bull Fetches jobs deals withinputs and outputssubmits the applicationpart to a SC

bull Require a direct accessto the LRMS

Shared FS

No externalConnectivity

CVMFS endpointmounted on the nodes

Internet

1 Fetch a joband get inputs

3 Get outputsand store them

DIRAC 2 Submitthe

Application

Computing Element

Edge Node

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1621

Tackling the distributed computing challengesPush model No Ext connectivity Multi-corenode

In Progress v7r2

BundleCE

bull Would aggregatemultiple applicationsinto one allocation

Shared FS

No externalConnectivity

Multi-core Multi-node

Internet

DIRAC

1 SubmitApplications

2 AggregateApplications

3 Submitthem as 1multi-core

job

Computing Element

Bundle CE

CVMFS endpointmounted on the nodes

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1721

Tackling the distributed computing challenges

Push model No Ext connectivity No CVMFS

In Progress v7r2 VO action

As it was already said SC do not provide CVMFS by defaultCVMFS-exec cannot be used in this context

Subset-CVMFS-Builder

bull Run amp extract CVMFSdependencies of givenjobs

bull UseCVMFS-Shrinkwrapper[1] to make a subset ofCVMFS

bull Test it amp deploy it onthe SC shared FS

No externalConnectivity

No CVMFS

Internet

Build and deploy asubset of CVMFS on

the shared FS

CVMFS SubsetBuilder

CVMFS Subset

CVMFSendpoint

Edge NodeDIRAC

Computing Element

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1821

Conclusion

Status

bull Support many SC with external connectivity (multi-core allocations)

bull Tools to exploit SC with no external connectivity are in progress

Next Steps

bull Provide the push model solution and its variations

bull Work on DB12 (CPU Power computation) support for multi-coreallocations

bull Provide a complete documentation about SC integration

bull Provide side projects to minimize VO actions(Subset-CVMFS-Builder)

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1921

Thanks

Any questions Comments

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2021

CVMFS cvmfs-shrinkwrap utilityhttpscvmfsreadthedocsioenstablecpt-

shrinkwraphtmlcpt-shrinkwrap Online accessed 4May 2021 2021

CVMFS cvmfsexechttpsgithubcomcvmfscvmfsexec Online accessed4 May 2021 2021

Federico Stagni Andrea Valassi and Vladimir RomanovskiyldquoIntegrating LHCb workflows on HPC resources status andstrategiesrdquo In arXiv200613603 [hep-ex physicsphysics](June 2020) arXiv 200613603 urlhttparxivorgabs200613603

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

Backup Slides

Real use-cases we have in LHCb

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Available HPCs

bull Piz Daint in CSCS (Suisse)

bull Marconi-A2 in CINECA (Italy) ndash not used anymore

bull SDumont in LNCC (Brazil)

bull MareNostrum in BSC (Spain)

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Piz Daint CSCS

bull Ranked 12th in Top500 (Nov 2020)

bull 387872 cores (Nov 2020)

+ Collaboration with the local System Administrators allows atraditional Grid Site usage

rArr No change required

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Marconi-A2 CINECA

bull Ranked 19th in Top500 (Nov2019)

+ External connectivity from theWNs

+ CVMS mounted on the WNs

+ Accessible via a CE

bull 348000 cores (Nov 2019)

- Multi-core allocations 272logical cores per node (IntelKNL)

- Low memorycore

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Marconi-A2 CINECA Development

bull External Connectivity Use thepull model

bull Multi-core allocations Use thefat node partitioning variation

HTCondorCE

1 Submit pilots

Marconi-A2Worker nodes

DIRAC

Services

Pilot-Factory

WaitingJobs

2 Fetchjobs

LRMS CVMFS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Marconi-A2 CINECA Status

rArr More details about LHCb work on CINECA [3]

rArr Marconi-A2 has been replaced by Marconi-100 V100 GPUs andPower9 CPUs cluster

Done

bull Exploited 68272 cores pernode not enough memory formore jobs

To be done

bull Nothing to do Marconi-A2disappeared

bull LHCb software not ready forGPUs

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

SDumont LNCC

bull Ranked 277th in Top500 (Nov2020)

+ External connectivity from theWNs

+ CVMFS mounted on the WNs

+ Accessible via SSH (specialaccess)

bull 33856 cores (Nov 2020)

- Multi-core allocations 24 or 48logical cores per node

- Multi-node allocations 21nodes per allocation requiredby some queues

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

SDumont LNCC Development

bull External Connectivity Use thepull model

bull Multi-core allocations Use thefat node partitioning variation

bull Multi-node allocations Use thesub-pilots variation

1 Submit pilots via ssh

SDumontWorker nodes

DIRAC

Services

Pilot-Factory

WaitingJobs

2 Fetchjobs

LRMS CVMFS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

SDumont LNCC Status

Done

bull Exploit 2424 and 4848 coresper node

To be done

bull Multi-node allocation shouldhave results soon

bull Dirac Benchmark not adaptedto multi-core allocations 20of the jobs run out of time

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Mare Nostrum BSC

bull Ranked 42nd in Top500 (Nov2020)

+ Accessible via a CE (ARC) andalso SSH

+ Single-core allocation possiblebut not recommended

bull 153216 cores (Nov 2020)

- No network connectivity

- CVMFS not mounted on theWNs

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Mare Nostrum BSC Development

bull No external connectivity Usethe push model

bull No CVMFS mounted on theWNs Use theSubset-CVMFS-Buildervariation

bull To get multi-core allocationsUse the BundleCE variation ARCCE

MareNostrumWorker nodes

DIRAC

Services Waiting

Jobs

1 Fetchjobs BundleCE

CVMFS SubsetBuilder

2 Submitapps

Subset CVMFS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Mare Nostrum BSC Status

Done

bull Prototype to run simplesubmissions (Hello World)

bull First version of theSubset-CVMFS-Builder

To be done

bull CE configuration to run jobswithin Singularity

bull BundleCE to aggregatemultiple jobs in an allocation

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

  • Introduction
  • Table of Contents
  • DIRAC WMS on Grid Sites
  • DIRAC WMS and Supercomputers
  • Tackling the distributed computing challenges
  • References
  • LHCb-supercomputers collaboration
Page 4: Integrating DIRAC work ows in Supercomputers

DIRAC WMS on Grid Sites

WMS Workflow

ComputingElement

1 Submit pilots

Grid SiteWorker nodes

DIRAC

Services

Pilot-Factory

WaitingJobs

2 Fetchjobs

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 421

DIRAC WMS on Grid Sites

rdquoTypicalrdquo job requirements

Single-Coreallocation

x86architecture

SLC6CC7compatible

4 SP Jobs running on a Grid Site

CVMFS endpointmounted on the nodes

gt2Gb RAMper core

Internet

Network access(Fetch jobs upload outputs)

Worker Node

Core

LRMS accessiblefrom outside

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 521

DIRAC WMS and Supercomputers

Presentation

Definition A mainframe computer that is among the largest fastestor most powerful of those available at a given time

bull Twice a year top500org releases the list of the most powerful SCof the world

bull 1 Fugaku is composed of ARM processors and contains sim7 millioncores

bull 2 and 3 leverage IBM Power processors and Nvidia GPUs andcontain sim15-2 million cores

bull In comparison WLCG provides sim1 million cores (many additionalparameters have to be taken into account for a fair comparisonthough)

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 621

DIRAC WMS and Supercomputers

Features of Supercomputers

SingleMulti-Coreallocation

x86Non-x86architecture

FastNodes-interconnectivty

2Multi-Process(MP)Jobsrunningonafatnode

1MPJobrunningonGPU

GPUusage

SharedFileSystem

HPC

Nonetworkconnectivity

Restrictiveuserauthentication

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 721

DIRAC WMS and Supercomputers

Challenges

Software architecture (VO)

bull SC are many-core architecture

bull They can include non x86CPUs (ARM AMD Power)GPUs

bull They might contain less than2Gbcore

Distributed computing (DIRAC)

bull SC policies may differ fromthose of HEP Grid Sites

bull They might lack of CVMFSoutbound connectivity externalaccess to the LRMS

rArr SC are all made differently hard to build a unique solution for all ofthem

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 821

Tackling the distributed computing challenges

Overview

bull 1 main variable directly affects the chosen solution (push pull)

+ Do WNs have an external connectivity yes (or only via the edgenode) no

bull Other variables generate variations that can be added up to theproposed solution

+ Is CVMFS mounted on the WNs yes no

+ Is LRMS accessible from outside yes no

+ What type of allocations can we make single-core multi-coremulti-node

rArr We will go through different cases from the easiest to the hardestone

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 921

Tackling the distributed computing challengesPull model single-core allocation

Similar to a Grid Site

bull Uncommon for a SC

bull Often need tocollaborate with thesystem administrators

Single-Coreallocation

CVMFS endpointmounted on the nodes

Internet

External connectivity(Fetch jobs upload outputs)

LRMS accessible from outside(push pilots from the DIRAC server)

Shared FS

Edge Node

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1021

Tackling the distributed computing challengesPull model Multi-core allocation

Integrated since v7r0

SC often require their users to allocate many cores or even nodes to runa program (queue configuration)

Fat node partition [3]

bull One pilot per fat nodeexecute several SPMPjobs per allocation

bull In the Queue conf addLocalCEType=Pool

andNumberOfProcessors=N

Multi-Coreallocation

CVMFS endpointmounted on the nodes

Internet

External connectivity(Fetch jobs upload outputs)

LRMS accessible from outside(push pilots from the DIRAC server)

Shared FS

Edge Node

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1121

Tackling the distributed computing challengesPull model Multi-node allocation

Almost integrated in v7r1

Allows to get a large number of resources with a small number ofallocations

Sub-Pilots (specific to SLURM currently)

bull One sub-pilot per fatnode allocated pilotssharing a same idstatus and output

bull In the Queue conf addParallelLibrary=PL

andNumberOfNodes=Nlt-Mgt

Shared FS

Multi-Nodeallocation

CVMFS endpointmounted on the nodes

Internet

External connectivity(Fetch jobs upload outputs)

LRMS accessible from outside(push pilots from the DIRAC server)

Edge Node

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1221

Tackling the distributed computing challengesPull model CVMFS not mounted on WNs

Not Integrated VO action

SC by default do not provide CVMFS on the WNs

CVMFS-exec on the shared FS [2]

bull Mount CVMFS as anunprivileged user

bull Purely a siteadminVOaction actually mightneed to add aparameter in DIRAC toease the process

CVMFS endpoint

Internet

External connectivity(Fetch jobs upload outputs)

LRMS accessible from outside(push pilots from the DIRAC server)

CVMFS notmounted

CVMFS-exec on theshared file system

Edge Node

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1321

Tackling the distributed computing challengesPull model No remote access to LRMS

Integrated since v7r0 VO action

Some SC can only be accessed via a VPN (No CE no direct SSH)

Site Director on the edge node

bull Directly submit pilotsfrom the edge node

bull Would need to beallowed to executeagents on the edgenode

bull Would need to beupdated manually

LRMS not accessiblefrom outside

Internet

External connectivity(Fetch jobs upload outputs)

LRMS accessible from outside(push pilots from the DIRAC server)

Pilot-Factory onthe edge node

CVMFS endpointmounted on the nodes

Shared FS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1421

Tackling the distributed computing challengesPull model Ext connectivity only from the edge node

Not Integrated VO action

Some SC only provide external connectivity from the edge node Pilotscannot directly interact with DIRAC services in this case

Gateway

bull Would be installed onthe edge node (ifpossible)

bull Would capture the Pilotand Job calls and wouldredirect them

Shared FS

External connectivityonly from the edge node

Internet

External connectivity(Fetch jobs upload outputs)

Gateway serviceon the edge node

captures calls andredirects them

LRMS accessible from outside(push pilots from the DIRAC server)

CVMFS endpointmounted on the nodes

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1521

Tackling the distributed computing challengesPush model No Ext connectivity

In Progress v7r2

Some SC do not provide any external connectivity at all neither on theWNs or the edge node

PushJobAgent

bull Works like a pilotoutside of the SC

bull Fetches jobs deals withinputs and outputssubmits the applicationpart to a SC

bull Require a direct accessto the LRMS

Shared FS

No externalConnectivity

CVMFS endpointmounted on the nodes

Internet

1 Fetch a joband get inputs

3 Get outputsand store them

DIRAC 2 Submitthe

Application

Computing Element

Edge Node

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1621

Tackling the distributed computing challengesPush model No Ext connectivity Multi-corenode

In Progress v7r2

BundleCE

bull Would aggregatemultiple applicationsinto one allocation

Shared FS

No externalConnectivity

Multi-core Multi-node

Internet

DIRAC

1 SubmitApplications

2 AggregateApplications

3 Submitthem as 1multi-core

job

Computing Element

Bundle CE

CVMFS endpointmounted on the nodes

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1721

Tackling the distributed computing challenges

Push model No Ext connectivity No CVMFS

In Progress v7r2 VO action

As it was already said SC do not provide CVMFS by defaultCVMFS-exec cannot be used in this context

Subset-CVMFS-Builder

bull Run amp extract CVMFSdependencies of givenjobs

bull UseCVMFS-Shrinkwrapper[1] to make a subset ofCVMFS

bull Test it amp deploy it onthe SC shared FS

No externalConnectivity

No CVMFS

Internet

Build and deploy asubset of CVMFS on

the shared FS

CVMFS SubsetBuilder

CVMFS Subset

CVMFSendpoint

Edge NodeDIRAC

Computing Element

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1821

Conclusion

Status

bull Support many SC with external connectivity (multi-core allocations)

bull Tools to exploit SC with no external connectivity are in progress

Next Steps

bull Provide the push model solution and its variations

bull Work on DB12 (CPU Power computation) support for multi-coreallocations

bull Provide a complete documentation about SC integration

bull Provide side projects to minimize VO actions(Subset-CVMFS-Builder)

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1921

Thanks

Any questions Comments

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2021

CVMFS cvmfs-shrinkwrap utilityhttpscvmfsreadthedocsioenstablecpt-

shrinkwraphtmlcpt-shrinkwrap Online accessed 4May 2021 2021

CVMFS cvmfsexechttpsgithubcomcvmfscvmfsexec Online accessed4 May 2021 2021

Federico Stagni Andrea Valassi and Vladimir RomanovskiyldquoIntegrating LHCb workflows on HPC resources status andstrategiesrdquo In arXiv200613603 [hep-ex physicsphysics](June 2020) arXiv 200613603 urlhttparxivorgabs200613603

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

Backup Slides

Real use-cases we have in LHCb

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Available HPCs

bull Piz Daint in CSCS (Suisse)

bull Marconi-A2 in CINECA (Italy) ndash not used anymore

bull SDumont in LNCC (Brazil)

bull MareNostrum in BSC (Spain)

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Piz Daint CSCS

bull Ranked 12th in Top500 (Nov 2020)

bull 387872 cores (Nov 2020)

+ Collaboration with the local System Administrators allows atraditional Grid Site usage

rArr No change required

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Marconi-A2 CINECA

bull Ranked 19th in Top500 (Nov2019)

+ External connectivity from theWNs

+ CVMS mounted on the WNs

+ Accessible via a CE

bull 348000 cores (Nov 2019)

- Multi-core allocations 272logical cores per node (IntelKNL)

- Low memorycore

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Marconi-A2 CINECA Development

bull External Connectivity Use thepull model

bull Multi-core allocations Use thefat node partitioning variation

HTCondorCE

1 Submit pilots

Marconi-A2Worker nodes

DIRAC

Services

Pilot-Factory

WaitingJobs

2 Fetchjobs

LRMS CVMFS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Marconi-A2 CINECA Status

rArr More details about LHCb work on CINECA [3]

rArr Marconi-A2 has been replaced by Marconi-100 V100 GPUs andPower9 CPUs cluster

Done

bull Exploited 68272 cores pernode not enough memory formore jobs

To be done

bull Nothing to do Marconi-A2disappeared

bull LHCb software not ready forGPUs

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

SDumont LNCC

bull Ranked 277th in Top500 (Nov2020)

+ External connectivity from theWNs

+ CVMFS mounted on the WNs

+ Accessible via SSH (specialaccess)

bull 33856 cores (Nov 2020)

- Multi-core allocations 24 or 48logical cores per node

- Multi-node allocations 21nodes per allocation requiredby some queues

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

SDumont LNCC Development

bull External Connectivity Use thepull model

bull Multi-core allocations Use thefat node partitioning variation

bull Multi-node allocations Use thesub-pilots variation

1 Submit pilots via ssh

SDumontWorker nodes

DIRAC

Services

Pilot-Factory

WaitingJobs

2 Fetchjobs

LRMS CVMFS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

SDumont LNCC Status

Done

bull Exploit 2424 and 4848 coresper node

To be done

bull Multi-node allocation shouldhave results soon

bull Dirac Benchmark not adaptedto multi-core allocations 20of the jobs run out of time

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Mare Nostrum BSC

bull Ranked 42nd in Top500 (Nov2020)

+ Accessible via a CE (ARC) andalso SSH

+ Single-core allocation possiblebut not recommended

bull 153216 cores (Nov 2020)

- No network connectivity

- CVMFS not mounted on theWNs

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Mare Nostrum BSC Development

bull No external connectivity Usethe push model

bull No CVMFS mounted on theWNs Use theSubset-CVMFS-Buildervariation

bull To get multi-core allocationsUse the BundleCE variation ARCCE

MareNostrumWorker nodes

DIRAC

Services Waiting

Jobs

1 Fetchjobs BundleCE

CVMFS SubsetBuilder

2 Submitapps

Subset CVMFS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Mare Nostrum BSC Status

Done

bull Prototype to run simplesubmissions (Hello World)

bull First version of theSubset-CVMFS-Builder

To be done

bull CE configuration to run jobswithin Singularity

bull BundleCE to aggregatemultiple jobs in an allocation

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

  • Introduction
  • Table of Contents
  • DIRAC WMS on Grid Sites
  • DIRAC WMS and Supercomputers
  • Tackling the distributed computing challenges
  • References
  • LHCb-supercomputers collaboration
Page 5: Integrating DIRAC work ows in Supercomputers

DIRAC WMS on Grid Sites

rdquoTypicalrdquo job requirements

Single-Coreallocation

x86architecture

SLC6CC7compatible

4 SP Jobs running on a Grid Site

CVMFS endpointmounted on the nodes

gt2Gb RAMper core

Internet

Network access(Fetch jobs upload outputs)

Worker Node

Core

LRMS accessiblefrom outside

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 521

DIRAC WMS and Supercomputers

Presentation

Definition A mainframe computer that is among the largest fastestor most powerful of those available at a given time

bull Twice a year top500org releases the list of the most powerful SCof the world

bull 1 Fugaku is composed of ARM processors and contains sim7 millioncores

bull 2 and 3 leverage IBM Power processors and Nvidia GPUs andcontain sim15-2 million cores

bull In comparison WLCG provides sim1 million cores (many additionalparameters have to be taken into account for a fair comparisonthough)

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 621

DIRAC WMS and Supercomputers

Features of Supercomputers

SingleMulti-Coreallocation

x86Non-x86architecture

FastNodes-interconnectivty

2Multi-Process(MP)Jobsrunningonafatnode

1MPJobrunningonGPU

GPUusage

SharedFileSystem

HPC

Nonetworkconnectivity

Restrictiveuserauthentication

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 721

DIRAC WMS and Supercomputers

Challenges

Software architecture (VO)

bull SC are many-core architecture

bull They can include non x86CPUs (ARM AMD Power)GPUs

bull They might contain less than2Gbcore

Distributed computing (DIRAC)

bull SC policies may differ fromthose of HEP Grid Sites

bull They might lack of CVMFSoutbound connectivity externalaccess to the LRMS

rArr SC are all made differently hard to build a unique solution for all ofthem

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 821

Tackling the distributed computing challenges

Overview

bull 1 main variable directly affects the chosen solution (push pull)

+ Do WNs have an external connectivity yes (or only via the edgenode) no

bull Other variables generate variations that can be added up to theproposed solution

+ Is CVMFS mounted on the WNs yes no

+ Is LRMS accessible from outside yes no

+ What type of allocations can we make single-core multi-coremulti-node

rArr We will go through different cases from the easiest to the hardestone

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 921

Tackling the distributed computing challengesPull model single-core allocation

Similar to a Grid Site

bull Uncommon for a SC

bull Often need tocollaborate with thesystem administrators

Single-Coreallocation

CVMFS endpointmounted on the nodes

Internet

External connectivity(Fetch jobs upload outputs)

LRMS accessible from outside(push pilots from the DIRAC server)

Shared FS

Edge Node

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1021

Tackling the distributed computing challengesPull model Multi-core allocation

Integrated since v7r0

SC often require their users to allocate many cores or even nodes to runa program (queue configuration)

Fat node partition [3]

bull One pilot per fat nodeexecute several SPMPjobs per allocation

bull In the Queue conf addLocalCEType=Pool

andNumberOfProcessors=N

Multi-Coreallocation

CVMFS endpointmounted on the nodes

Internet

External connectivity(Fetch jobs upload outputs)

LRMS accessible from outside(push pilots from the DIRAC server)

Shared FS

Edge Node

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1121

Tackling the distributed computing challengesPull model Multi-node allocation

Almost integrated in v7r1

Allows to get a large number of resources with a small number ofallocations

Sub-Pilots (specific to SLURM currently)

bull One sub-pilot per fatnode allocated pilotssharing a same idstatus and output

bull In the Queue conf addParallelLibrary=PL

andNumberOfNodes=Nlt-Mgt

Shared FS

Multi-Nodeallocation

CVMFS endpointmounted on the nodes

Internet

External connectivity(Fetch jobs upload outputs)

LRMS accessible from outside(push pilots from the DIRAC server)

Edge Node

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1221

Tackling the distributed computing challengesPull model CVMFS not mounted on WNs

Not Integrated VO action

SC by default do not provide CVMFS on the WNs

CVMFS-exec on the shared FS [2]

bull Mount CVMFS as anunprivileged user

bull Purely a siteadminVOaction actually mightneed to add aparameter in DIRAC toease the process

CVMFS endpoint

Internet

External connectivity(Fetch jobs upload outputs)

LRMS accessible from outside(push pilots from the DIRAC server)

CVMFS notmounted

CVMFS-exec on theshared file system

Edge Node

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1321

Tackling the distributed computing challengesPull model No remote access to LRMS

Integrated since v7r0 VO action

Some SC can only be accessed via a VPN (No CE no direct SSH)

Site Director on the edge node

bull Directly submit pilotsfrom the edge node

bull Would need to beallowed to executeagents on the edgenode

bull Would need to beupdated manually

LRMS not accessiblefrom outside

Internet

External connectivity(Fetch jobs upload outputs)

LRMS accessible from outside(push pilots from the DIRAC server)

Pilot-Factory onthe edge node

CVMFS endpointmounted on the nodes

Shared FS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1421

Tackling the distributed computing challengesPull model Ext connectivity only from the edge node

Not Integrated VO action

Some SC only provide external connectivity from the edge node Pilotscannot directly interact with DIRAC services in this case

Gateway

bull Would be installed onthe edge node (ifpossible)

bull Would capture the Pilotand Job calls and wouldredirect them

Shared FS

External connectivityonly from the edge node

Internet

External connectivity(Fetch jobs upload outputs)

Gateway serviceon the edge node

captures calls andredirects them

LRMS accessible from outside(push pilots from the DIRAC server)

CVMFS endpointmounted on the nodes

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1521

Tackling the distributed computing challengesPush model No Ext connectivity

In Progress v7r2

Some SC do not provide any external connectivity at all neither on theWNs or the edge node

PushJobAgent

bull Works like a pilotoutside of the SC

bull Fetches jobs deals withinputs and outputssubmits the applicationpart to a SC

bull Require a direct accessto the LRMS

Shared FS

No externalConnectivity

CVMFS endpointmounted on the nodes

Internet

1 Fetch a joband get inputs

3 Get outputsand store them

DIRAC 2 Submitthe

Application

Computing Element

Edge Node

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1621

Tackling the distributed computing challengesPush model No Ext connectivity Multi-corenode

In Progress v7r2

BundleCE

bull Would aggregatemultiple applicationsinto one allocation

Shared FS

No externalConnectivity

Multi-core Multi-node

Internet

DIRAC

1 SubmitApplications

2 AggregateApplications

3 Submitthem as 1multi-core

job

Computing Element

Bundle CE

CVMFS endpointmounted on the nodes

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1721

Tackling the distributed computing challenges

Push model No Ext connectivity No CVMFS

In Progress v7r2 VO action

As it was already said SC do not provide CVMFS by defaultCVMFS-exec cannot be used in this context

Subset-CVMFS-Builder

bull Run amp extract CVMFSdependencies of givenjobs

bull UseCVMFS-Shrinkwrapper[1] to make a subset ofCVMFS

bull Test it amp deploy it onthe SC shared FS

No externalConnectivity

No CVMFS

Internet

Build and deploy asubset of CVMFS on

the shared FS

CVMFS SubsetBuilder

CVMFS Subset

CVMFSendpoint

Edge NodeDIRAC

Computing Element

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1821

Conclusion

Status

bull Support many SC with external connectivity (multi-core allocations)

bull Tools to exploit SC with no external connectivity are in progress

Next Steps

bull Provide the push model solution and its variations

bull Work on DB12 (CPU Power computation) support for multi-coreallocations

bull Provide a complete documentation about SC integration

bull Provide side projects to minimize VO actions(Subset-CVMFS-Builder)

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1921

Thanks

Any questions Comments

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2021

CVMFS cvmfs-shrinkwrap utilityhttpscvmfsreadthedocsioenstablecpt-

shrinkwraphtmlcpt-shrinkwrap Online accessed 4May 2021 2021

CVMFS cvmfsexechttpsgithubcomcvmfscvmfsexec Online accessed4 May 2021 2021

Federico Stagni Andrea Valassi and Vladimir RomanovskiyldquoIntegrating LHCb workflows on HPC resources status andstrategiesrdquo In arXiv200613603 [hep-ex physicsphysics](June 2020) arXiv 200613603 urlhttparxivorgabs200613603

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

Backup Slides

Real use-cases we have in LHCb

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Available HPCs

bull Piz Daint in CSCS (Suisse)

bull Marconi-A2 in CINECA (Italy) ndash not used anymore

bull SDumont in LNCC (Brazil)

bull MareNostrum in BSC (Spain)

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Piz Daint CSCS

bull Ranked 12th in Top500 (Nov 2020)

bull 387872 cores (Nov 2020)

+ Collaboration with the local System Administrators allows atraditional Grid Site usage

rArr No change required

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Marconi-A2 CINECA

bull Ranked 19th in Top500 (Nov2019)

+ External connectivity from theWNs

+ CVMS mounted on the WNs

+ Accessible via a CE

bull 348000 cores (Nov 2019)

- Multi-core allocations 272logical cores per node (IntelKNL)

- Low memorycore

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Marconi-A2 CINECA Development

bull External Connectivity Use thepull model

bull Multi-core allocations Use thefat node partitioning variation

HTCondorCE

1 Submit pilots

Marconi-A2Worker nodes

DIRAC

Services

Pilot-Factory

WaitingJobs

2 Fetchjobs

LRMS CVMFS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Marconi-A2 CINECA Status

rArr More details about LHCb work on CINECA [3]

rArr Marconi-A2 has been replaced by Marconi-100 V100 GPUs andPower9 CPUs cluster

Done

bull Exploited 68272 cores pernode not enough memory formore jobs

To be done

bull Nothing to do Marconi-A2disappeared

bull LHCb software not ready forGPUs

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

SDumont LNCC

bull Ranked 277th in Top500 (Nov2020)

+ External connectivity from theWNs

+ CVMFS mounted on the WNs

+ Accessible via SSH (specialaccess)

bull 33856 cores (Nov 2020)

- Multi-core allocations 24 or 48logical cores per node

- Multi-node allocations 21nodes per allocation requiredby some queues

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

SDumont LNCC Development

bull External Connectivity Use thepull model

bull Multi-core allocations Use thefat node partitioning variation

bull Multi-node allocations Use thesub-pilots variation

1 Submit pilots via ssh

SDumontWorker nodes

DIRAC

Services

Pilot-Factory

WaitingJobs

2 Fetchjobs

LRMS CVMFS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

SDumont LNCC Status

Done

bull Exploit 2424 and 4848 coresper node

To be done

bull Multi-node allocation shouldhave results soon

bull Dirac Benchmark not adaptedto multi-core allocations 20of the jobs run out of time

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Mare Nostrum BSC

bull Ranked 42nd in Top500 (Nov2020)

+ Accessible via a CE (ARC) andalso SSH

+ Single-core allocation possiblebut not recommended

bull 153216 cores (Nov 2020)

- No network connectivity

- CVMFS not mounted on theWNs

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Mare Nostrum BSC Development

bull No external connectivity Usethe push model

bull No CVMFS mounted on theWNs Use theSubset-CVMFS-Buildervariation

bull To get multi-core allocationsUse the BundleCE variation ARCCE

MareNostrumWorker nodes

DIRAC

Services Waiting

Jobs

1 Fetchjobs BundleCE

CVMFS SubsetBuilder

2 Submitapps

Subset CVMFS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Mare Nostrum BSC Status

Done

bull Prototype to run simplesubmissions (Hello World)

bull First version of theSubset-CVMFS-Builder

To be done

bull CE configuration to run jobswithin Singularity

bull BundleCE to aggregatemultiple jobs in an allocation

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

  • Introduction
  • Table of Contents
  • DIRAC WMS on Grid Sites
  • DIRAC WMS and Supercomputers
  • Tackling the distributed computing challenges
  • References
  • LHCb-supercomputers collaboration
Page 6: Integrating DIRAC work ows in Supercomputers

DIRAC WMS and Supercomputers

Presentation

Definition A mainframe computer that is among the largest fastestor most powerful of those available at a given time

bull Twice a year top500org releases the list of the most powerful SCof the world

bull 1 Fugaku is composed of ARM processors and contains sim7 millioncores

bull 2 and 3 leverage IBM Power processors and Nvidia GPUs andcontain sim15-2 million cores

bull In comparison WLCG provides sim1 million cores (many additionalparameters have to be taken into account for a fair comparisonthough)

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 621

DIRAC WMS and Supercomputers

Features of Supercomputers

SingleMulti-Coreallocation

x86Non-x86architecture

FastNodes-interconnectivty

2Multi-Process(MP)Jobsrunningonafatnode

1MPJobrunningonGPU

GPUusage

SharedFileSystem

HPC

Nonetworkconnectivity

Restrictiveuserauthentication

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 721

DIRAC WMS and Supercomputers

Challenges

Software architecture (VO)

bull SC are many-core architecture

bull They can include non x86CPUs (ARM AMD Power)GPUs

bull They might contain less than2Gbcore

Distributed computing (DIRAC)

bull SC policies may differ fromthose of HEP Grid Sites

bull They might lack of CVMFSoutbound connectivity externalaccess to the LRMS

rArr SC are all made differently hard to build a unique solution for all ofthem

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 821

Tackling the distributed computing challenges

Overview

bull 1 main variable directly affects the chosen solution (push pull)

+ Do WNs have an external connectivity yes (or only via the edgenode) no

bull Other variables generate variations that can be added up to theproposed solution

+ Is CVMFS mounted on the WNs yes no

+ Is LRMS accessible from outside yes no

+ What type of allocations can we make single-core multi-coremulti-node

rArr We will go through different cases from the easiest to the hardestone

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 921

Tackling the distributed computing challengesPull model single-core allocation

Similar to a Grid Site

bull Uncommon for a SC

bull Often need tocollaborate with thesystem administrators

Single-Coreallocation

CVMFS endpointmounted on the nodes

Internet

External connectivity(Fetch jobs upload outputs)

LRMS accessible from outside(push pilots from the DIRAC server)

Shared FS

Edge Node

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1021

Tackling the distributed computing challengesPull model Multi-core allocation

Integrated since v7r0

SC often require their users to allocate many cores or even nodes to runa program (queue configuration)

Fat node partition [3]

bull One pilot per fat nodeexecute several SPMPjobs per allocation

bull In the Queue conf addLocalCEType=Pool

andNumberOfProcessors=N

Multi-Coreallocation

CVMFS endpointmounted on the nodes

Internet

External connectivity(Fetch jobs upload outputs)

LRMS accessible from outside(push pilots from the DIRAC server)

Shared FS

Edge Node

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1121

Tackling the distributed computing challengesPull model Multi-node allocation

Almost integrated in v7r1

Allows to get a large number of resources with a small number ofallocations

Sub-Pilots (specific to SLURM currently)

bull One sub-pilot per fatnode allocated pilotssharing a same idstatus and output

bull In the Queue conf addParallelLibrary=PL

andNumberOfNodes=Nlt-Mgt

Shared FS

Multi-Nodeallocation

CVMFS endpointmounted on the nodes

Internet

External connectivity(Fetch jobs upload outputs)

LRMS accessible from outside(push pilots from the DIRAC server)

Edge Node

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1221

Tackling the distributed computing challengesPull model CVMFS not mounted on WNs

Not Integrated VO action

SC by default do not provide CVMFS on the WNs

CVMFS-exec on the shared FS [2]

bull Mount CVMFS as anunprivileged user

bull Purely a siteadminVOaction actually mightneed to add aparameter in DIRAC toease the process

CVMFS endpoint

Internet

External connectivity(Fetch jobs upload outputs)

LRMS accessible from outside(push pilots from the DIRAC server)

CVMFS notmounted

CVMFS-exec on theshared file system

Edge Node

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1321

Tackling the distributed computing challengesPull model No remote access to LRMS

Integrated since v7r0 VO action

Some SC can only be accessed via a VPN (No CE no direct SSH)

Site Director on the edge node

bull Directly submit pilotsfrom the edge node

bull Would need to beallowed to executeagents on the edgenode

bull Would need to beupdated manually

LRMS not accessiblefrom outside

Internet

External connectivity(Fetch jobs upload outputs)

LRMS accessible from outside(push pilots from the DIRAC server)

Pilot-Factory onthe edge node

CVMFS endpointmounted on the nodes

Shared FS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1421

Tackling the distributed computing challengesPull model Ext connectivity only from the edge node

Not Integrated VO action

Some SC only provide external connectivity from the edge node Pilotscannot directly interact with DIRAC services in this case

Gateway

bull Would be installed onthe edge node (ifpossible)

bull Would capture the Pilotand Job calls and wouldredirect them

Shared FS

External connectivityonly from the edge node

Internet

External connectivity(Fetch jobs upload outputs)

Gateway serviceon the edge node

captures calls andredirects them

LRMS accessible from outside(push pilots from the DIRAC server)

CVMFS endpointmounted on the nodes

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1521

Tackling the distributed computing challengesPush model No Ext connectivity

In Progress v7r2

Some SC do not provide any external connectivity at all neither on theWNs or the edge node

PushJobAgent

bull Works like a pilotoutside of the SC

bull Fetches jobs deals withinputs and outputssubmits the applicationpart to a SC

bull Require a direct accessto the LRMS

Shared FS

No externalConnectivity

CVMFS endpointmounted on the nodes

Internet

1 Fetch a joband get inputs

3 Get outputsand store them

DIRAC 2 Submitthe

Application

Computing Element

Edge Node

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1621

Tackling the distributed computing challengesPush model No Ext connectivity Multi-corenode

In Progress v7r2

BundleCE

bull Would aggregatemultiple applicationsinto one allocation

Shared FS

No externalConnectivity

Multi-core Multi-node

Internet

DIRAC

1 SubmitApplications

2 AggregateApplications

3 Submitthem as 1multi-core

job

Computing Element

Bundle CE

CVMFS endpointmounted on the nodes

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1721

Tackling the distributed computing challenges

Push model No Ext connectivity No CVMFS

In Progress v7r2 VO action

As it was already said SC do not provide CVMFS by defaultCVMFS-exec cannot be used in this context

Subset-CVMFS-Builder

bull Run amp extract CVMFSdependencies of givenjobs

bull UseCVMFS-Shrinkwrapper[1] to make a subset ofCVMFS

bull Test it amp deploy it onthe SC shared FS

No externalConnectivity

No CVMFS

Internet

Build and deploy asubset of CVMFS on

the shared FS

CVMFS SubsetBuilder

CVMFS Subset

CVMFSendpoint

Edge NodeDIRAC

Computing Element

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1821

Conclusion

Status

bull Support many SC with external connectivity (multi-core allocations)

bull Tools to exploit SC with no external connectivity are in progress

Next Steps

bull Provide the push model solution and its variations

bull Work on DB12 (CPU Power computation) support for multi-coreallocations

bull Provide a complete documentation about SC integration

bull Provide side projects to minimize VO actions(Subset-CVMFS-Builder)

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1921

Thanks

Any questions Comments

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2021

CVMFS cvmfs-shrinkwrap utilityhttpscvmfsreadthedocsioenstablecpt-

shrinkwraphtmlcpt-shrinkwrap Online accessed 4May 2021 2021

CVMFS cvmfsexechttpsgithubcomcvmfscvmfsexec Online accessed4 May 2021 2021

Federico Stagni Andrea Valassi and Vladimir RomanovskiyldquoIntegrating LHCb workflows on HPC resources status andstrategiesrdquo In arXiv200613603 [hep-ex physicsphysics](June 2020) arXiv 200613603 urlhttparxivorgabs200613603

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

Backup Slides

Real use-cases we have in LHCb

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Available HPCs

bull Piz Daint in CSCS (Suisse)

bull Marconi-A2 in CINECA (Italy) ndash not used anymore

bull SDumont in LNCC (Brazil)

bull MareNostrum in BSC (Spain)

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Piz Daint CSCS

bull Ranked 12th in Top500 (Nov 2020)

bull 387872 cores (Nov 2020)

+ Collaboration with the local System Administrators allows atraditional Grid Site usage

rArr No change required

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Marconi-A2 CINECA

bull Ranked 19th in Top500 (Nov2019)

+ External connectivity from theWNs

+ CVMS mounted on the WNs

+ Accessible via a CE

bull 348000 cores (Nov 2019)

- Multi-core allocations 272logical cores per node (IntelKNL)

- Low memorycore

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Marconi-A2 CINECA Development

bull External Connectivity Use thepull model

bull Multi-core allocations Use thefat node partitioning variation

HTCondorCE

1 Submit pilots

Marconi-A2Worker nodes

DIRAC

Services

Pilot-Factory

WaitingJobs

2 Fetchjobs

LRMS CVMFS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Marconi-A2 CINECA Status

rArr More details about LHCb work on CINECA [3]

rArr Marconi-A2 has been replaced by Marconi-100 V100 GPUs andPower9 CPUs cluster

Done

bull Exploited 68272 cores pernode not enough memory formore jobs

To be done

bull Nothing to do Marconi-A2disappeared

bull LHCb software not ready forGPUs

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

SDumont LNCC

bull Ranked 277th in Top500 (Nov2020)

+ External connectivity from theWNs

+ CVMFS mounted on the WNs

+ Accessible via SSH (specialaccess)

bull 33856 cores (Nov 2020)

- Multi-core allocations 24 or 48logical cores per node

- Multi-node allocations 21nodes per allocation requiredby some queues

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

SDumont LNCC Development

bull External Connectivity Use thepull model

bull Multi-core allocations Use thefat node partitioning variation

bull Multi-node allocations Use thesub-pilots variation

1 Submit pilots via ssh

SDumontWorker nodes

DIRAC

Services

Pilot-Factory

WaitingJobs

2 Fetchjobs

LRMS CVMFS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

SDumont LNCC Status

Done

bull Exploit 2424 and 4848 coresper node

To be done

bull Multi-node allocation shouldhave results soon

bull Dirac Benchmark not adaptedto multi-core allocations 20of the jobs run out of time

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Mare Nostrum BSC

bull Ranked 42nd in Top500 (Nov2020)

+ Accessible via a CE (ARC) andalso SSH

+ Single-core allocation possiblebut not recommended

bull 153216 cores (Nov 2020)

- No network connectivity

- CVMFS not mounted on theWNs

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Mare Nostrum BSC Development

bull No external connectivity Usethe push model

bull No CVMFS mounted on theWNs Use theSubset-CVMFS-Buildervariation

bull To get multi-core allocationsUse the BundleCE variation ARCCE

MareNostrumWorker nodes

DIRAC

Services Waiting

Jobs

1 Fetchjobs BundleCE

CVMFS SubsetBuilder

2 Submitapps

Subset CVMFS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Mare Nostrum BSC Status

Done

bull Prototype to run simplesubmissions (Hello World)

bull First version of theSubset-CVMFS-Builder

To be done

bull CE configuration to run jobswithin Singularity

bull BundleCE to aggregatemultiple jobs in an allocation

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

  • Introduction
  • Table of Contents
  • DIRAC WMS on Grid Sites
  • DIRAC WMS and Supercomputers
  • Tackling the distributed computing challenges
  • References
  • LHCb-supercomputers collaboration
Page 7: Integrating DIRAC work ows in Supercomputers

DIRAC WMS and Supercomputers

Features of Supercomputers

SingleMulti-Coreallocation

x86Non-x86architecture

FastNodes-interconnectivty

2Multi-Process(MP)Jobsrunningonafatnode

1MPJobrunningonGPU

GPUusage

SharedFileSystem

HPC

Nonetworkconnectivity

Restrictiveuserauthentication

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 721

DIRAC WMS and Supercomputers

Challenges

Software architecture (VO)

bull SC are many-core architecture

bull They can include non x86CPUs (ARM AMD Power)GPUs

bull They might contain less than2Gbcore

Distributed computing (DIRAC)

bull SC policies may differ fromthose of HEP Grid Sites

bull They might lack of CVMFSoutbound connectivity externalaccess to the LRMS

rArr SC are all made differently hard to build a unique solution for all ofthem

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 821

Tackling the distributed computing challenges

Overview

bull 1 main variable directly affects the chosen solution (push pull)

+ Do WNs have an external connectivity yes (or only via the edgenode) no

bull Other variables generate variations that can be added up to theproposed solution

+ Is CVMFS mounted on the WNs yes no

+ Is LRMS accessible from outside yes no

+ What type of allocations can we make single-core multi-coremulti-node

rArr We will go through different cases from the easiest to the hardestone

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 921

Tackling the distributed computing challengesPull model single-core allocation

Similar to a Grid Site

bull Uncommon for a SC

bull Often need tocollaborate with thesystem administrators

Single-Coreallocation

CVMFS endpointmounted on the nodes

Internet

External connectivity(Fetch jobs upload outputs)

LRMS accessible from outside(push pilots from the DIRAC server)

Shared FS

Edge Node

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1021

Tackling the distributed computing challengesPull model Multi-core allocation

Integrated since v7r0

SC often require their users to allocate many cores or even nodes to runa program (queue configuration)

Fat node partition [3]

bull One pilot per fat nodeexecute several SPMPjobs per allocation

bull In the Queue conf addLocalCEType=Pool

andNumberOfProcessors=N

Multi-Coreallocation

CVMFS endpointmounted on the nodes

Internet

External connectivity(Fetch jobs upload outputs)

LRMS accessible from outside(push pilots from the DIRAC server)

Shared FS

Edge Node

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1121

Tackling the distributed computing challengesPull model Multi-node allocation

Almost integrated in v7r1

Allows to get a large number of resources with a small number ofallocations

Sub-Pilots (specific to SLURM currently)

bull One sub-pilot per fatnode allocated pilotssharing a same idstatus and output

bull In the Queue conf addParallelLibrary=PL

andNumberOfNodes=Nlt-Mgt

Shared FS

Multi-Nodeallocation

CVMFS endpointmounted on the nodes

Internet

External connectivity(Fetch jobs upload outputs)

LRMS accessible from outside(push pilots from the DIRAC server)

Edge Node

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1221

Tackling the distributed computing challengesPull model CVMFS not mounted on WNs

Not Integrated VO action

SC by default do not provide CVMFS on the WNs

CVMFS-exec on the shared FS [2]

bull Mount CVMFS as anunprivileged user

bull Purely a siteadminVOaction actually mightneed to add aparameter in DIRAC toease the process

CVMFS endpoint

Internet

External connectivity(Fetch jobs upload outputs)

LRMS accessible from outside(push pilots from the DIRAC server)

CVMFS notmounted

CVMFS-exec on theshared file system

Edge Node

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1321

Tackling the distributed computing challengesPull model No remote access to LRMS

Integrated since v7r0 VO action

Some SC can only be accessed via a VPN (No CE no direct SSH)

Site Director on the edge node

bull Directly submit pilotsfrom the edge node

bull Would need to beallowed to executeagents on the edgenode

bull Would need to beupdated manually

LRMS not accessiblefrom outside

Internet

External connectivity(Fetch jobs upload outputs)

LRMS accessible from outside(push pilots from the DIRAC server)

Pilot-Factory onthe edge node

CVMFS endpointmounted on the nodes

Shared FS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1421

Tackling the distributed computing challengesPull model Ext connectivity only from the edge node

Not Integrated VO action

Some SC only provide external connectivity from the edge node Pilotscannot directly interact with DIRAC services in this case

Gateway

bull Would be installed onthe edge node (ifpossible)

bull Would capture the Pilotand Job calls and wouldredirect them

Shared FS

External connectivityonly from the edge node

Internet

External connectivity(Fetch jobs upload outputs)

Gateway serviceon the edge node

captures calls andredirects them

LRMS accessible from outside(push pilots from the DIRAC server)

CVMFS endpointmounted on the nodes

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1521

Tackling the distributed computing challengesPush model No Ext connectivity

In Progress v7r2

Some SC do not provide any external connectivity at all neither on theWNs or the edge node

PushJobAgent

bull Works like a pilotoutside of the SC

bull Fetches jobs deals withinputs and outputssubmits the applicationpart to a SC

bull Require a direct accessto the LRMS

Shared FS

No externalConnectivity

CVMFS endpointmounted on the nodes

Internet

1 Fetch a joband get inputs

3 Get outputsand store them

DIRAC 2 Submitthe

Application

Computing Element

Edge Node

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1621

Tackling the distributed computing challengesPush model No Ext connectivity Multi-corenode

In Progress v7r2

BundleCE

bull Would aggregatemultiple applicationsinto one allocation

Shared FS

No externalConnectivity

Multi-core Multi-node

Internet

DIRAC

1 SubmitApplications

2 AggregateApplications

3 Submitthem as 1multi-core

job

Computing Element

Bundle CE

CVMFS endpointmounted on the nodes

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1721

Tackling the distributed computing challenges

Push model No Ext connectivity No CVMFS

In Progress v7r2 VO action

As it was already said SC do not provide CVMFS by defaultCVMFS-exec cannot be used in this context

Subset-CVMFS-Builder

bull Run amp extract CVMFSdependencies of givenjobs

bull UseCVMFS-Shrinkwrapper[1] to make a subset ofCVMFS

bull Test it amp deploy it onthe SC shared FS

No externalConnectivity

No CVMFS

Internet

Build and deploy asubset of CVMFS on

the shared FS

CVMFS SubsetBuilder

CVMFS Subset

CVMFSendpoint

Edge NodeDIRAC

Computing Element

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1821

Conclusion

Status

bull Support many SC with external connectivity (multi-core allocations)

bull Tools to exploit SC with no external connectivity are in progress

Next Steps

bull Provide the push model solution and its variations

bull Work on DB12 (CPU Power computation) support for multi-coreallocations

bull Provide a complete documentation about SC integration

bull Provide side projects to minimize VO actions(Subset-CVMFS-Builder)

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1921

Thanks

Any questions Comments

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2021

CVMFS cvmfs-shrinkwrap utilityhttpscvmfsreadthedocsioenstablecpt-

shrinkwraphtmlcpt-shrinkwrap Online accessed 4May 2021 2021

CVMFS cvmfsexechttpsgithubcomcvmfscvmfsexec Online accessed4 May 2021 2021

Federico Stagni Andrea Valassi and Vladimir RomanovskiyldquoIntegrating LHCb workflows on HPC resources status andstrategiesrdquo In arXiv200613603 [hep-ex physicsphysics](June 2020) arXiv 200613603 urlhttparxivorgabs200613603

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

Backup Slides

Real use-cases we have in LHCb

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Available HPCs

bull Piz Daint in CSCS (Suisse)

bull Marconi-A2 in CINECA (Italy) ndash not used anymore

bull SDumont in LNCC (Brazil)

bull MareNostrum in BSC (Spain)

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Piz Daint CSCS

bull Ranked 12th in Top500 (Nov 2020)

bull 387872 cores (Nov 2020)

+ Collaboration with the local System Administrators allows atraditional Grid Site usage

rArr No change required

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Marconi-A2 CINECA

bull Ranked 19th in Top500 (Nov2019)

+ External connectivity from theWNs

+ CVMS mounted on the WNs

+ Accessible via a CE

bull 348000 cores (Nov 2019)

- Multi-core allocations 272logical cores per node (IntelKNL)

- Low memorycore

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Marconi-A2 CINECA Development

bull External Connectivity Use thepull model

bull Multi-core allocations Use thefat node partitioning variation

HTCondorCE

1 Submit pilots

Marconi-A2Worker nodes

DIRAC

Services

Pilot-Factory

WaitingJobs

2 Fetchjobs

LRMS CVMFS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Marconi-A2 CINECA Status

rArr More details about LHCb work on CINECA [3]

rArr Marconi-A2 has been replaced by Marconi-100 V100 GPUs andPower9 CPUs cluster

Done

bull Exploited 68272 cores pernode not enough memory formore jobs

To be done

bull Nothing to do Marconi-A2disappeared

bull LHCb software not ready forGPUs

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

SDumont LNCC

bull Ranked 277th in Top500 (Nov2020)

+ External connectivity from theWNs

+ CVMFS mounted on the WNs

+ Accessible via SSH (specialaccess)

bull 33856 cores (Nov 2020)

- Multi-core allocations 24 or 48logical cores per node

- Multi-node allocations 21nodes per allocation requiredby some queues

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

SDumont LNCC Development

bull External Connectivity Use thepull model

bull Multi-core allocations Use thefat node partitioning variation

bull Multi-node allocations Use thesub-pilots variation

1 Submit pilots via ssh

SDumontWorker nodes

DIRAC

Services

Pilot-Factory

WaitingJobs

2 Fetchjobs

LRMS CVMFS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

SDumont LNCC Status

Done

bull Exploit 2424 and 4848 coresper node

To be done

bull Multi-node allocation shouldhave results soon

bull Dirac Benchmark not adaptedto multi-core allocations 20of the jobs run out of time

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Mare Nostrum BSC

bull Ranked 42nd in Top500 (Nov2020)

+ Accessible via a CE (ARC) andalso SSH

+ Single-core allocation possiblebut not recommended

bull 153216 cores (Nov 2020)

- No network connectivity

- CVMFS not mounted on theWNs

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Mare Nostrum BSC Development

bull No external connectivity Usethe push model

bull No CVMFS mounted on theWNs Use theSubset-CVMFS-Buildervariation

bull To get multi-core allocationsUse the BundleCE variation ARCCE

MareNostrumWorker nodes

DIRAC

Services Waiting

Jobs

1 Fetchjobs BundleCE

CVMFS SubsetBuilder

2 Submitapps

Subset CVMFS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Mare Nostrum BSC Status

Done

bull Prototype to run simplesubmissions (Hello World)

bull First version of theSubset-CVMFS-Builder

To be done

bull CE configuration to run jobswithin Singularity

bull BundleCE to aggregatemultiple jobs in an allocation

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

  • Introduction
  • Table of Contents
  • DIRAC WMS on Grid Sites
  • DIRAC WMS and Supercomputers
  • Tackling the distributed computing challenges
  • References
  • LHCb-supercomputers collaboration
Page 8: Integrating DIRAC work ows in Supercomputers

DIRAC WMS and Supercomputers

Challenges

Software architecture (VO)

bull SC are many-core architecture

bull They can include non x86CPUs (ARM AMD Power)GPUs

bull They might contain less than2Gbcore

Distributed computing (DIRAC)

bull SC policies may differ fromthose of HEP Grid Sites

bull They might lack of CVMFSoutbound connectivity externalaccess to the LRMS

rArr SC are all made differently hard to build a unique solution for all ofthem

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 821

Tackling the distributed computing challenges

Overview

bull 1 main variable directly affects the chosen solution (push pull)

+ Do WNs have an external connectivity yes (or only via the edgenode) no

bull Other variables generate variations that can be added up to theproposed solution

+ Is CVMFS mounted on the WNs yes no

+ Is LRMS accessible from outside yes no

+ What type of allocations can we make single-core multi-coremulti-node

rArr We will go through different cases from the easiest to the hardestone

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 921

Tackling the distributed computing challengesPull model single-core allocation

Similar to a Grid Site

bull Uncommon for a SC

bull Often need tocollaborate with thesystem administrators

Single-Coreallocation

CVMFS endpointmounted on the nodes

Internet

External connectivity(Fetch jobs upload outputs)

LRMS accessible from outside(push pilots from the DIRAC server)

Shared FS

Edge Node

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1021

Tackling the distributed computing challengesPull model Multi-core allocation

Integrated since v7r0

SC often require their users to allocate many cores or even nodes to runa program (queue configuration)

Fat node partition [3]

bull One pilot per fat nodeexecute several SPMPjobs per allocation

bull In the Queue conf addLocalCEType=Pool

andNumberOfProcessors=N

Multi-Coreallocation

CVMFS endpointmounted on the nodes

Internet

External connectivity(Fetch jobs upload outputs)

LRMS accessible from outside(push pilots from the DIRAC server)

Shared FS

Edge Node

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1121

Tackling the distributed computing challengesPull model Multi-node allocation

Almost integrated in v7r1

Allows to get a large number of resources with a small number ofallocations

Sub-Pilots (specific to SLURM currently)

bull One sub-pilot per fatnode allocated pilotssharing a same idstatus and output

bull In the Queue conf addParallelLibrary=PL

andNumberOfNodes=Nlt-Mgt

Shared FS

Multi-Nodeallocation

CVMFS endpointmounted on the nodes

Internet

External connectivity(Fetch jobs upload outputs)

LRMS accessible from outside(push pilots from the DIRAC server)

Edge Node

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1221

Tackling the distributed computing challengesPull model CVMFS not mounted on WNs

Not Integrated VO action

SC by default do not provide CVMFS on the WNs

CVMFS-exec on the shared FS [2]

bull Mount CVMFS as anunprivileged user

bull Purely a siteadminVOaction actually mightneed to add aparameter in DIRAC toease the process

CVMFS endpoint

Internet

External connectivity(Fetch jobs upload outputs)

LRMS accessible from outside(push pilots from the DIRAC server)

CVMFS notmounted

CVMFS-exec on theshared file system

Edge Node

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1321

Tackling the distributed computing challengesPull model No remote access to LRMS

Integrated since v7r0 VO action

Some SC can only be accessed via a VPN (No CE no direct SSH)

Site Director on the edge node

bull Directly submit pilotsfrom the edge node

bull Would need to beallowed to executeagents on the edgenode

bull Would need to beupdated manually

LRMS not accessiblefrom outside

Internet

External connectivity(Fetch jobs upload outputs)

LRMS accessible from outside(push pilots from the DIRAC server)

Pilot-Factory onthe edge node

CVMFS endpointmounted on the nodes

Shared FS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1421

Tackling the distributed computing challengesPull model Ext connectivity only from the edge node

Not Integrated VO action

Some SC only provide external connectivity from the edge node Pilotscannot directly interact with DIRAC services in this case

Gateway

bull Would be installed onthe edge node (ifpossible)

bull Would capture the Pilotand Job calls and wouldredirect them

Shared FS

External connectivityonly from the edge node

Internet

External connectivity(Fetch jobs upload outputs)

Gateway serviceon the edge node

captures calls andredirects them

LRMS accessible from outside(push pilots from the DIRAC server)

CVMFS endpointmounted on the nodes

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1521

Tackling the distributed computing challengesPush model No Ext connectivity

In Progress v7r2

Some SC do not provide any external connectivity at all neither on theWNs or the edge node

PushJobAgent

bull Works like a pilotoutside of the SC

bull Fetches jobs deals withinputs and outputssubmits the applicationpart to a SC

bull Require a direct accessto the LRMS

Shared FS

No externalConnectivity

CVMFS endpointmounted on the nodes

Internet

1 Fetch a joband get inputs

3 Get outputsand store them

DIRAC 2 Submitthe

Application

Computing Element

Edge Node

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1621

Tackling the distributed computing challengesPush model No Ext connectivity Multi-corenode

In Progress v7r2

BundleCE

bull Would aggregatemultiple applicationsinto one allocation

Shared FS

No externalConnectivity

Multi-core Multi-node

Internet

DIRAC

1 SubmitApplications

2 AggregateApplications

3 Submitthem as 1multi-core

job

Computing Element

Bundle CE

CVMFS endpointmounted on the nodes

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1721

Tackling the distributed computing challenges

Push model No Ext connectivity No CVMFS

In Progress v7r2 VO action

As it was already said SC do not provide CVMFS by defaultCVMFS-exec cannot be used in this context

Subset-CVMFS-Builder

bull Run amp extract CVMFSdependencies of givenjobs

bull UseCVMFS-Shrinkwrapper[1] to make a subset ofCVMFS

bull Test it amp deploy it onthe SC shared FS

No externalConnectivity

No CVMFS

Internet

Build and deploy asubset of CVMFS on

the shared FS

CVMFS SubsetBuilder

CVMFS Subset

CVMFSendpoint

Edge NodeDIRAC

Computing Element

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1821

Conclusion

Status

bull Support many SC with external connectivity (multi-core allocations)

bull Tools to exploit SC with no external connectivity are in progress

Next Steps

bull Provide the push model solution and its variations

bull Work on DB12 (CPU Power computation) support for multi-coreallocations

bull Provide a complete documentation about SC integration

bull Provide side projects to minimize VO actions(Subset-CVMFS-Builder)

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1921

Thanks

Any questions Comments

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2021

CVMFS cvmfs-shrinkwrap utilityhttpscvmfsreadthedocsioenstablecpt-

shrinkwraphtmlcpt-shrinkwrap Online accessed 4May 2021 2021

CVMFS cvmfsexechttpsgithubcomcvmfscvmfsexec Online accessed4 May 2021 2021

Federico Stagni Andrea Valassi and Vladimir RomanovskiyldquoIntegrating LHCb workflows on HPC resources status andstrategiesrdquo In arXiv200613603 [hep-ex physicsphysics](June 2020) arXiv 200613603 urlhttparxivorgabs200613603

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

Backup Slides

Real use-cases we have in LHCb

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Available HPCs

bull Piz Daint in CSCS (Suisse)

bull Marconi-A2 in CINECA (Italy) ndash not used anymore

bull SDumont in LNCC (Brazil)

bull MareNostrum in BSC (Spain)

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Piz Daint CSCS

bull Ranked 12th in Top500 (Nov 2020)

bull 387872 cores (Nov 2020)

+ Collaboration with the local System Administrators allows atraditional Grid Site usage

rArr No change required

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Marconi-A2 CINECA

bull Ranked 19th in Top500 (Nov2019)

+ External connectivity from theWNs

+ CVMS mounted on the WNs

+ Accessible via a CE

bull 348000 cores (Nov 2019)

- Multi-core allocations 272logical cores per node (IntelKNL)

- Low memorycore

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Marconi-A2 CINECA Development

bull External Connectivity Use thepull model

bull Multi-core allocations Use thefat node partitioning variation

HTCondorCE

1 Submit pilots

Marconi-A2Worker nodes

DIRAC

Services

Pilot-Factory

WaitingJobs

2 Fetchjobs

LRMS CVMFS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Marconi-A2 CINECA Status

rArr More details about LHCb work on CINECA [3]

rArr Marconi-A2 has been replaced by Marconi-100 V100 GPUs andPower9 CPUs cluster

Done

bull Exploited 68272 cores pernode not enough memory formore jobs

To be done

bull Nothing to do Marconi-A2disappeared

bull LHCb software not ready forGPUs

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

SDumont LNCC

bull Ranked 277th in Top500 (Nov2020)

+ External connectivity from theWNs

+ CVMFS mounted on the WNs

+ Accessible via SSH (specialaccess)

bull 33856 cores (Nov 2020)

- Multi-core allocations 24 or 48logical cores per node

- Multi-node allocations 21nodes per allocation requiredby some queues

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

SDumont LNCC Development

bull External Connectivity Use thepull model

bull Multi-core allocations Use thefat node partitioning variation

bull Multi-node allocations Use thesub-pilots variation

1 Submit pilots via ssh

SDumontWorker nodes

DIRAC

Services

Pilot-Factory

WaitingJobs

2 Fetchjobs

LRMS CVMFS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

SDumont LNCC Status

Done

bull Exploit 2424 and 4848 coresper node

To be done

bull Multi-node allocation shouldhave results soon

bull Dirac Benchmark not adaptedto multi-core allocations 20of the jobs run out of time

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Mare Nostrum BSC

bull Ranked 42nd in Top500 (Nov2020)

+ Accessible via a CE (ARC) andalso SSH

+ Single-core allocation possiblebut not recommended

bull 153216 cores (Nov 2020)

- No network connectivity

- CVMFS not mounted on theWNs

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Mare Nostrum BSC Development

bull No external connectivity Usethe push model

bull No CVMFS mounted on theWNs Use theSubset-CVMFS-Buildervariation

bull To get multi-core allocationsUse the BundleCE variation ARCCE

MareNostrumWorker nodes

DIRAC

Services Waiting

Jobs

1 Fetchjobs BundleCE

CVMFS SubsetBuilder

2 Submitapps

Subset CVMFS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Mare Nostrum BSC Status

Done

bull Prototype to run simplesubmissions (Hello World)

bull First version of theSubset-CVMFS-Builder

To be done

bull CE configuration to run jobswithin Singularity

bull BundleCE to aggregatemultiple jobs in an allocation

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

  • Introduction
  • Table of Contents
  • DIRAC WMS on Grid Sites
  • DIRAC WMS and Supercomputers
  • Tackling the distributed computing challenges
  • References
  • LHCb-supercomputers collaboration
Page 9: Integrating DIRAC work ows in Supercomputers

Tackling the distributed computing challenges

Overview

bull 1 main variable directly affects the chosen solution (push pull)

+ Do WNs have an external connectivity yes (or only via the edgenode) no

bull Other variables generate variations that can be added up to theproposed solution

+ Is CVMFS mounted on the WNs yes no

+ Is LRMS accessible from outside yes no

+ What type of allocations can we make single-core multi-coremulti-node

rArr We will go through different cases from the easiest to the hardestone

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 921

Tackling the distributed computing challengesPull model single-core allocation

Similar to a Grid Site

bull Uncommon for a SC

bull Often need tocollaborate with thesystem administrators

Single-Coreallocation

CVMFS endpointmounted on the nodes

Internet

External connectivity(Fetch jobs upload outputs)

LRMS accessible from outside(push pilots from the DIRAC server)

Shared FS

Edge Node

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1021

Tackling the distributed computing challengesPull model Multi-core allocation

Integrated since v7r0

SC often require their users to allocate many cores or even nodes to runa program (queue configuration)

Fat node partition [3]

bull One pilot per fat nodeexecute several SPMPjobs per allocation

bull In the Queue conf addLocalCEType=Pool

andNumberOfProcessors=N

Multi-Coreallocation

CVMFS endpointmounted on the nodes

Internet

External connectivity(Fetch jobs upload outputs)

LRMS accessible from outside(push pilots from the DIRAC server)

Shared FS

Edge Node

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1121

Tackling the distributed computing challengesPull model Multi-node allocation

Almost integrated in v7r1

Allows to get a large number of resources with a small number ofallocations

Sub-Pilots (specific to SLURM currently)

bull One sub-pilot per fatnode allocated pilotssharing a same idstatus and output

bull In the Queue conf addParallelLibrary=PL

andNumberOfNodes=Nlt-Mgt

Shared FS

Multi-Nodeallocation

CVMFS endpointmounted on the nodes

Internet

External connectivity(Fetch jobs upload outputs)

LRMS accessible from outside(push pilots from the DIRAC server)

Edge Node

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1221

Tackling the distributed computing challengesPull model CVMFS not mounted on WNs

Not Integrated VO action

SC by default do not provide CVMFS on the WNs

CVMFS-exec on the shared FS [2]

bull Mount CVMFS as anunprivileged user

bull Purely a siteadminVOaction actually mightneed to add aparameter in DIRAC toease the process

CVMFS endpoint

Internet

External connectivity(Fetch jobs upload outputs)

LRMS accessible from outside(push pilots from the DIRAC server)

CVMFS notmounted

CVMFS-exec on theshared file system

Edge Node

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1321

Tackling the distributed computing challengesPull model No remote access to LRMS

Integrated since v7r0 VO action

Some SC can only be accessed via a VPN (No CE no direct SSH)

Site Director on the edge node

bull Directly submit pilotsfrom the edge node

bull Would need to beallowed to executeagents on the edgenode

bull Would need to beupdated manually

LRMS not accessiblefrom outside

Internet

External connectivity(Fetch jobs upload outputs)

LRMS accessible from outside(push pilots from the DIRAC server)

Pilot-Factory onthe edge node

CVMFS endpointmounted on the nodes

Shared FS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1421

Tackling the distributed computing challengesPull model Ext connectivity only from the edge node

Not Integrated VO action

Some SC only provide external connectivity from the edge node Pilotscannot directly interact with DIRAC services in this case

Gateway

bull Would be installed onthe edge node (ifpossible)

bull Would capture the Pilotand Job calls and wouldredirect them

Shared FS

External connectivityonly from the edge node

Internet

External connectivity(Fetch jobs upload outputs)

Gateway serviceon the edge node

captures calls andredirects them

LRMS accessible from outside(push pilots from the DIRAC server)

CVMFS endpointmounted on the nodes

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1521

Tackling the distributed computing challengesPush model No Ext connectivity

In Progress v7r2

Some SC do not provide any external connectivity at all neither on theWNs or the edge node

PushJobAgent

bull Works like a pilotoutside of the SC

bull Fetches jobs deals withinputs and outputssubmits the applicationpart to a SC

bull Require a direct accessto the LRMS

Shared FS

No externalConnectivity

CVMFS endpointmounted on the nodes

Internet

1 Fetch a joband get inputs

3 Get outputsand store them

DIRAC 2 Submitthe

Application

Computing Element

Edge Node

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1621

Tackling the distributed computing challengesPush model No Ext connectivity Multi-corenode

In Progress v7r2

BundleCE

bull Would aggregatemultiple applicationsinto one allocation

Shared FS

No externalConnectivity

Multi-core Multi-node

Internet

DIRAC

1 SubmitApplications

2 AggregateApplications

3 Submitthem as 1multi-core

job

Computing Element

Bundle CE

CVMFS endpointmounted on the nodes

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1721

Tackling the distributed computing challenges

Push model No Ext connectivity No CVMFS

In Progress v7r2 VO action

As it was already said SC do not provide CVMFS by defaultCVMFS-exec cannot be used in this context

Subset-CVMFS-Builder

bull Run amp extract CVMFSdependencies of givenjobs

bull UseCVMFS-Shrinkwrapper[1] to make a subset ofCVMFS

bull Test it amp deploy it onthe SC shared FS

No externalConnectivity

No CVMFS

Internet

Build and deploy asubset of CVMFS on

the shared FS

CVMFS SubsetBuilder

CVMFS Subset

CVMFSendpoint

Edge NodeDIRAC

Computing Element

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1821

Conclusion

Status

bull Support many SC with external connectivity (multi-core allocations)

bull Tools to exploit SC with no external connectivity are in progress

Next Steps

bull Provide the push model solution and its variations

bull Work on DB12 (CPU Power computation) support for multi-coreallocations

bull Provide a complete documentation about SC integration

bull Provide side projects to minimize VO actions(Subset-CVMFS-Builder)

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1921

Thanks

Any questions Comments

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2021

CVMFS cvmfs-shrinkwrap utilityhttpscvmfsreadthedocsioenstablecpt-

shrinkwraphtmlcpt-shrinkwrap Online accessed 4May 2021 2021

CVMFS cvmfsexechttpsgithubcomcvmfscvmfsexec Online accessed4 May 2021 2021

Federico Stagni Andrea Valassi and Vladimir RomanovskiyldquoIntegrating LHCb workflows on HPC resources status andstrategiesrdquo In arXiv200613603 [hep-ex physicsphysics](June 2020) arXiv 200613603 urlhttparxivorgabs200613603

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

Backup Slides

Real use-cases we have in LHCb

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Available HPCs

bull Piz Daint in CSCS (Suisse)

bull Marconi-A2 in CINECA (Italy) ndash not used anymore

bull SDumont in LNCC (Brazil)

bull MareNostrum in BSC (Spain)

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Piz Daint CSCS

bull Ranked 12th in Top500 (Nov 2020)

bull 387872 cores (Nov 2020)

+ Collaboration with the local System Administrators allows atraditional Grid Site usage

rArr No change required

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Marconi-A2 CINECA

bull Ranked 19th in Top500 (Nov2019)

+ External connectivity from theWNs

+ CVMS mounted on the WNs

+ Accessible via a CE

bull 348000 cores (Nov 2019)

- Multi-core allocations 272logical cores per node (IntelKNL)

- Low memorycore

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Marconi-A2 CINECA Development

bull External Connectivity Use thepull model

bull Multi-core allocations Use thefat node partitioning variation

HTCondorCE

1 Submit pilots

Marconi-A2Worker nodes

DIRAC

Services

Pilot-Factory

WaitingJobs

2 Fetchjobs

LRMS CVMFS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Marconi-A2 CINECA Status

rArr More details about LHCb work on CINECA [3]

rArr Marconi-A2 has been replaced by Marconi-100 V100 GPUs andPower9 CPUs cluster

Done

bull Exploited 68272 cores pernode not enough memory formore jobs

To be done

bull Nothing to do Marconi-A2disappeared

bull LHCb software not ready forGPUs

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

SDumont LNCC

bull Ranked 277th in Top500 (Nov2020)

+ External connectivity from theWNs

+ CVMFS mounted on the WNs

+ Accessible via SSH (specialaccess)

bull 33856 cores (Nov 2020)

- Multi-core allocations 24 or 48logical cores per node

- Multi-node allocations 21nodes per allocation requiredby some queues

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

SDumont LNCC Development

bull External Connectivity Use thepull model

bull Multi-core allocations Use thefat node partitioning variation

bull Multi-node allocations Use thesub-pilots variation

1 Submit pilots via ssh

SDumontWorker nodes

DIRAC

Services

Pilot-Factory

WaitingJobs

2 Fetchjobs

LRMS CVMFS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

SDumont LNCC Status

Done

bull Exploit 2424 and 4848 coresper node

To be done

bull Multi-node allocation shouldhave results soon

bull Dirac Benchmark not adaptedto multi-core allocations 20of the jobs run out of time

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Mare Nostrum BSC

bull Ranked 42nd in Top500 (Nov2020)

+ Accessible via a CE (ARC) andalso SSH

+ Single-core allocation possiblebut not recommended

bull 153216 cores (Nov 2020)

- No network connectivity

- CVMFS not mounted on theWNs

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Mare Nostrum BSC Development

bull No external connectivity Usethe push model

bull No CVMFS mounted on theWNs Use theSubset-CVMFS-Buildervariation

bull To get multi-core allocationsUse the BundleCE variation ARCCE

MareNostrumWorker nodes

DIRAC

Services Waiting

Jobs

1 Fetchjobs BundleCE

CVMFS SubsetBuilder

2 Submitapps

Subset CVMFS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Mare Nostrum BSC Status

Done

bull Prototype to run simplesubmissions (Hello World)

bull First version of theSubset-CVMFS-Builder

To be done

bull CE configuration to run jobswithin Singularity

bull BundleCE to aggregatemultiple jobs in an allocation

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

  • Introduction
  • Table of Contents
  • DIRAC WMS on Grid Sites
  • DIRAC WMS and Supercomputers
  • Tackling the distributed computing challenges
  • References
  • LHCb-supercomputers collaboration
Page 10: Integrating DIRAC work ows in Supercomputers

Tackling the distributed computing challengesPull model single-core allocation

Similar to a Grid Site

bull Uncommon for a SC

bull Often need tocollaborate with thesystem administrators

Single-Coreallocation

CVMFS endpointmounted on the nodes

Internet

External connectivity(Fetch jobs upload outputs)

LRMS accessible from outside(push pilots from the DIRAC server)

Shared FS

Edge Node

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1021

Tackling the distributed computing challengesPull model Multi-core allocation

Integrated since v7r0

SC often require their users to allocate many cores or even nodes to runa program (queue configuration)

Fat node partition [3]

bull One pilot per fat nodeexecute several SPMPjobs per allocation

bull In the Queue conf addLocalCEType=Pool

andNumberOfProcessors=N

Multi-Coreallocation

CVMFS endpointmounted on the nodes

Internet

External connectivity(Fetch jobs upload outputs)

LRMS accessible from outside(push pilots from the DIRAC server)

Shared FS

Edge Node

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1121

Tackling the distributed computing challengesPull model Multi-node allocation

Almost integrated in v7r1

Allows to get a large number of resources with a small number ofallocations

Sub-Pilots (specific to SLURM currently)

bull One sub-pilot per fatnode allocated pilotssharing a same idstatus and output

bull In the Queue conf addParallelLibrary=PL

andNumberOfNodes=Nlt-Mgt

Shared FS

Multi-Nodeallocation

CVMFS endpointmounted on the nodes

Internet

External connectivity(Fetch jobs upload outputs)

LRMS accessible from outside(push pilots from the DIRAC server)

Edge Node

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1221

Tackling the distributed computing challengesPull model CVMFS not mounted on WNs

Not Integrated VO action

SC by default do not provide CVMFS on the WNs

CVMFS-exec on the shared FS [2]

bull Mount CVMFS as anunprivileged user

bull Purely a siteadminVOaction actually mightneed to add aparameter in DIRAC toease the process

CVMFS endpoint

Internet

External connectivity(Fetch jobs upload outputs)

LRMS accessible from outside(push pilots from the DIRAC server)

CVMFS notmounted

CVMFS-exec on theshared file system

Edge Node

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1321

Tackling the distributed computing challengesPull model No remote access to LRMS

Integrated since v7r0 VO action

Some SC can only be accessed via a VPN (No CE no direct SSH)

Site Director on the edge node

bull Directly submit pilotsfrom the edge node

bull Would need to beallowed to executeagents on the edgenode

bull Would need to beupdated manually

LRMS not accessiblefrom outside

Internet

External connectivity(Fetch jobs upload outputs)

LRMS accessible from outside(push pilots from the DIRAC server)

Pilot-Factory onthe edge node

CVMFS endpointmounted on the nodes

Shared FS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1421

Tackling the distributed computing challengesPull model Ext connectivity only from the edge node

Not Integrated VO action

Some SC only provide external connectivity from the edge node Pilotscannot directly interact with DIRAC services in this case

Gateway

bull Would be installed onthe edge node (ifpossible)

bull Would capture the Pilotand Job calls and wouldredirect them

Shared FS

External connectivityonly from the edge node

Internet

External connectivity(Fetch jobs upload outputs)

Gateway serviceon the edge node

captures calls andredirects them

LRMS accessible from outside(push pilots from the DIRAC server)

CVMFS endpointmounted on the nodes

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1521

Tackling the distributed computing challengesPush model No Ext connectivity

In Progress v7r2

Some SC do not provide any external connectivity at all neither on theWNs or the edge node

PushJobAgent

bull Works like a pilotoutside of the SC

bull Fetches jobs deals withinputs and outputssubmits the applicationpart to a SC

bull Require a direct accessto the LRMS

Shared FS

No externalConnectivity

CVMFS endpointmounted on the nodes

Internet

1 Fetch a joband get inputs

3 Get outputsand store them

DIRAC 2 Submitthe

Application

Computing Element

Edge Node

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1621

Tackling the distributed computing challengesPush model No Ext connectivity Multi-corenode

In Progress v7r2

BundleCE

bull Would aggregatemultiple applicationsinto one allocation

Shared FS

No externalConnectivity

Multi-core Multi-node

Internet

DIRAC

1 SubmitApplications

2 AggregateApplications

3 Submitthem as 1multi-core

job

Computing Element

Bundle CE

CVMFS endpointmounted on the nodes

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1721

Tackling the distributed computing challenges

Push model No Ext connectivity No CVMFS

In Progress v7r2 VO action

As it was already said SC do not provide CVMFS by defaultCVMFS-exec cannot be used in this context

Subset-CVMFS-Builder

bull Run amp extract CVMFSdependencies of givenjobs

bull UseCVMFS-Shrinkwrapper[1] to make a subset ofCVMFS

bull Test it amp deploy it onthe SC shared FS

No externalConnectivity

No CVMFS

Internet

Build and deploy asubset of CVMFS on

the shared FS

CVMFS SubsetBuilder

CVMFS Subset

CVMFSendpoint

Edge NodeDIRAC

Computing Element

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1821

Conclusion

Status

bull Support many SC with external connectivity (multi-core allocations)

bull Tools to exploit SC with no external connectivity are in progress

Next Steps

bull Provide the push model solution and its variations

bull Work on DB12 (CPU Power computation) support for multi-coreallocations

bull Provide a complete documentation about SC integration

bull Provide side projects to minimize VO actions(Subset-CVMFS-Builder)

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1921

Thanks

Any questions Comments

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2021

CVMFS cvmfs-shrinkwrap utilityhttpscvmfsreadthedocsioenstablecpt-

shrinkwraphtmlcpt-shrinkwrap Online accessed 4May 2021 2021

CVMFS cvmfsexechttpsgithubcomcvmfscvmfsexec Online accessed4 May 2021 2021

Federico Stagni Andrea Valassi and Vladimir RomanovskiyldquoIntegrating LHCb workflows on HPC resources status andstrategiesrdquo In arXiv200613603 [hep-ex physicsphysics](June 2020) arXiv 200613603 urlhttparxivorgabs200613603

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

Backup Slides

Real use-cases we have in LHCb

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Available HPCs

bull Piz Daint in CSCS (Suisse)

bull Marconi-A2 in CINECA (Italy) ndash not used anymore

bull SDumont in LNCC (Brazil)

bull MareNostrum in BSC (Spain)

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Piz Daint CSCS

bull Ranked 12th in Top500 (Nov 2020)

bull 387872 cores (Nov 2020)

+ Collaboration with the local System Administrators allows atraditional Grid Site usage

rArr No change required

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Marconi-A2 CINECA

bull Ranked 19th in Top500 (Nov2019)

+ External connectivity from theWNs

+ CVMS mounted on the WNs

+ Accessible via a CE

bull 348000 cores (Nov 2019)

- Multi-core allocations 272logical cores per node (IntelKNL)

- Low memorycore

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Marconi-A2 CINECA Development

bull External Connectivity Use thepull model

bull Multi-core allocations Use thefat node partitioning variation

HTCondorCE

1 Submit pilots

Marconi-A2Worker nodes

DIRAC

Services

Pilot-Factory

WaitingJobs

2 Fetchjobs

LRMS CVMFS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Marconi-A2 CINECA Status

rArr More details about LHCb work on CINECA [3]

rArr Marconi-A2 has been replaced by Marconi-100 V100 GPUs andPower9 CPUs cluster

Done

bull Exploited 68272 cores pernode not enough memory formore jobs

To be done

bull Nothing to do Marconi-A2disappeared

bull LHCb software not ready forGPUs

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

SDumont LNCC

bull Ranked 277th in Top500 (Nov2020)

+ External connectivity from theWNs

+ CVMFS mounted on the WNs

+ Accessible via SSH (specialaccess)

bull 33856 cores (Nov 2020)

- Multi-core allocations 24 or 48logical cores per node

- Multi-node allocations 21nodes per allocation requiredby some queues

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

SDumont LNCC Development

bull External Connectivity Use thepull model

bull Multi-core allocations Use thefat node partitioning variation

bull Multi-node allocations Use thesub-pilots variation

1 Submit pilots via ssh

SDumontWorker nodes

DIRAC

Services

Pilot-Factory

WaitingJobs

2 Fetchjobs

LRMS CVMFS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

SDumont LNCC Status

Done

bull Exploit 2424 and 4848 coresper node

To be done

bull Multi-node allocation shouldhave results soon

bull Dirac Benchmark not adaptedto multi-core allocations 20of the jobs run out of time

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Mare Nostrum BSC

bull Ranked 42nd in Top500 (Nov2020)

+ Accessible via a CE (ARC) andalso SSH

+ Single-core allocation possiblebut not recommended

bull 153216 cores (Nov 2020)

- No network connectivity

- CVMFS not mounted on theWNs

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Mare Nostrum BSC Development

bull No external connectivity Usethe push model

bull No CVMFS mounted on theWNs Use theSubset-CVMFS-Buildervariation

bull To get multi-core allocationsUse the BundleCE variation ARCCE

MareNostrumWorker nodes

DIRAC

Services Waiting

Jobs

1 Fetchjobs BundleCE

CVMFS SubsetBuilder

2 Submitapps

Subset CVMFS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Mare Nostrum BSC Status

Done

bull Prototype to run simplesubmissions (Hello World)

bull First version of theSubset-CVMFS-Builder

To be done

bull CE configuration to run jobswithin Singularity

bull BundleCE to aggregatemultiple jobs in an allocation

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

  • Introduction
  • Table of Contents
  • DIRAC WMS on Grid Sites
  • DIRAC WMS and Supercomputers
  • Tackling the distributed computing challenges
  • References
  • LHCb-supercomputers collaboration
Page 11: Integrating DIRAC work ows in Supercomputers

Tackling the distributed computing challengesPull model Multi-core allocation

Integrated since v7r0

SC often require their users to allocate many cores or even nodes to runa program (queue configuration)

Fat node partition [3]

bull One pilot per fat nodeexecute several SPMPjobs per allocation

bull In the Queue conf addLocalCEType=Pool

andNumberOfProcessors=N

Multi-Coreallocation

CVMFS endpointmounted on the nodes

Internet

External connectivity(Fetch jobs upload outputs)

LRMS accessible from outside(push pilots from the DIRAC server)

Shared FS

Edge Node

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1121

Tackling the distributed computing challengesPull model Multi-node allocation

Almost integrated in v7r1

Allows to get a large number of resources with a small number ofallocations

Sub-Pilots (specific to SLURM currently)

bull One sub-pilot per fatnode allocated pilotssharing a same idstatus and output

bull In the Queue conf addParallelLibrary=PL

andNumberOfNodes=Nlt-Mgt

Shared FS

Multi-Nodeallocation

CVMFS endpointmounted on the nodes

Internet

External connectivity(Fetch jobs upload outputs)

LRMS accessible from outside(push pilots from the DIRAC server)

Edge Node

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1221

Tackling the distributed computing challengesPull model CVMFS not mounted on WNs

Not Integrated VO action

SC by default do not provide CVMFS on the WNs

CVMFS-exec on the shared FS [2]

bull Mount CVMFS as anunprivileged user

bull Purely a siteadminVOaction actually mightneed to add aparameter in DIRAC toease the process

CVMFS endpoint

Internet

External connectivity(Fetch jobs upload outputs)

LRMS accessible from outside(push pilots from the DIRAC server)

CVMFS notmounted

CVMFS-exec on theshared file system

Edge Node

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1321

Tackling the distributed computing challengesPull model No remote access to LRMS

Integrated since v7r0 VO action

Some SC can only be accessed via a VPN (No CE no direct SSH)

Site Director on the edge node

bull Directly submit pilotsfrom the edge node

bull Would need to beallowed to executeagents on the edgenode

bull Would need to beupdated manually

LRMS not accessiblefrom outside

Internet

External connectivity(Fetch jobs upload outputs)

LRMS accessible from outside(push pilots from the DIRAC server)

Pilot-Factory onthe edge node

CVMFS endpointmounted on the nodes

Shared FS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1421

Tackling the distributed computing challengesPull model Ext connectivity only from the edge node

Not Integrated VO action

Some SC only provide external connectivity from the edge node Pilotscannot directly interact with DIRAC services in this case

Gateway

bull Would be installed onthe edge node (ifpossible)

bull Would capture the Pilotand Job calls and wouldredirect them

Shared FS

External connectivityonly from the edge node

Internet

External connectivity(Fetch jobs upload outputs)

Gateway serviceon the edge node

captures calls andredirects them

LRMS accessible from outside(push pilots from the DIRAC server)

CVMFS endpointmounted on the nodes

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1521

Tackling the distributed computing challengesPush model No Ext connectivity

In Progress v7r2

Some SC do not provide any external connectivity at all neither on theWNs or the edge node

PushJobAgent

bull Works like a pilotoutside of the SC

bull Fetches jobs deals withinputs and outputssubmits the applicationpart to a SC

bull Require a direct accessto the LRMS

Shared FS

No externalConnectivity

CVMFS endpointmounted on the nodes

Internet

1 Fetch a joband get inputs

3 Get outputsand store them

DIRAC 2 Submitthe

Application

Computing Element

Edge Node

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1621

Tackling the distributed computing challengesPush model No Ext connectivity Multi-corenode

In Progress v7r2

BundleCE

bull Would aggregatemultiple applicationsinto one allocation

Shared FS

No externalConnectivity

Multi-core Multi-node

Internet

DIRAC

1 SubmitApplications

2 AggregateApplications

3 Submitthem as 1multi-core

job

Computing Element

Bundle CE

CVMFS endpointmounted on the nodes

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1721

Tackling the distributed computing challenges

Push model No Ext connectivity No CVMFS

In Progress v7r2 VO action

As it was already said SC do not provide CVMFS by defaultCVMFS-exec cannot be used in this context

Subset-CVMFS-Builder

bull Run amp extract CVMFSdependencies of givenjobs

bull UseCVMFS-Shrinkwrapper[1] to make a subset ofCVMFS

bull Test it amp deploy it onthe SC shared FS

No externalConnectivity

No CVMFS

Internet

Build and deploy asubset of CVMFS on

the shared FS

CVMFS SubsetBuilder

CVMFS Subset

CVMFSendpoint

Edge NodeDIRAC

Computing Element

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1821

Conclusion

Status

bull Support many SC with external connectivity (multi-core allocations)

bull Tools to exploit SC with no external connectivity are in progress

Next Steps

bull Provide the push model solution and its variations

bull Work on DB12 (CPU Power computation) support for multi-coreallocations

bull Provide a complete documentation about SC integration

bull Provide side projects to minimize VO actions(Subset-CVMFS-Builder)

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1921

Thanks

Any questions Comments

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2021

CVMFS cvmfs-shrinkwrap utilityhttpscvmfsreadthedocsioenstablecpt-

shrinkwraphtmlcpt-shrinkwrap Online accessed 4May 2021 2021

CVMFS cvmfsexechttpsgithubcomcvmfscvmfsexec Online accessed4 May 2021 2021

Federico Stagni Andrea Valassi and Vladimir RomanovskiyldquoIntegrating LHCb workflows on HPC resources status andstrategiesrdquo In arXiv200613603 [hep-ex physicsphysics](June 2020) arXiv 200613603 urlhttparxivorgabs200613603

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

Backup Slides

Real use-cases we have in LHCb

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Available HPCs

bull Piz Daint in CSCS (Suisse)

bull Marconi-A2 in CINECA (Italy) ndash not used anymore

bull SDumont in LNCC (Brazil)

bull MareNostrum in BSC (Spain)

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Piz Daint CSCS

bull Ranked 12th in Top500 (Nov 2020)

bull 387872 cores (Nov 2020)

+ Collaboration with the local System Administrators allows atraditional Grid Site usage

rArr No change required

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Marconi-A2 CINECA

bull Ranked 19th in Top500 (Nov2019)

+ External connectivity from theWNs

+ CVMS mounted on the WNs

+ Accessible via a CE

bull 348000 cores (Nov 2019)

- Multi-core allocations 272logical cores per node (IntelKNL)

- Low memorycore

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Marconi-A2 CINECA Development

bull External Connectivity Use thepull model

bull Multi-core allocations Use thefat node partitioning variation

HTCondorCE

1 Submit pilots

Marconi-A2Worker nodes

DIRAC

Services

Pilot-Factory

WaitingJobs

2 Fetchjobs

LRMS CVMFS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Marconi-A2 CINECA Status

rArr More details about LHCb work on CINECA [3]

rArr Marconi-A2 has been replaced by Marconi-100 V100 GPUs andPower9 CPUs cluster

Done

bull Exploited 68272 cores pernode not enough memory formore jobs

To be done

bull Nothing to do Marconi-A2disappeared

bull LHCb software not ready forGPUs

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

SDumont LNCC

bull Ranked 277th in Top500 (Nov2020)

+ External connectivity from theWNs

+ CVMFS mounted on the WNs

+ Accessible via SSH (specialaccess)

bull 33856 cores (Nov 2020)

- Multi-core allocations 24 or 48logical cores per node

- Multi-node allocations 21nodes per allocation requiredby some queues

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

SDumont LNCC Development

bull External Connectivity Use thepull model

bull Multi-core allocations Use thefat node partitioning variation

bull Multi-node allocations Use thesub-pilots variation

1 Submit pilots via ssh

SDumontWorker nodes

DIRAC

Services

Pilot-Factory

WaitingJobs

2 Fetchjobs

LRMS CVMFS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

SDumont LNCC Status

Done

bull Exploit 2424 and 4848 coresper node

To be done

bull Multi-node allocation shouldhave results soon

bull Dirac Benchmark not adaptedto multi-core allocations 20of the jobs run out of time

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Mare Nostrum BSC

bull Ranked 42nd in Top500 (Nov2020)

+ Accessible via a CE (ARC) andalso SSH

+ Single-core allocation possiblebut not recommended

bull 153216 cores (Nov 2020)

- No network connectivity

- CVMFS not mounted on theWNs

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Mare Nostrum BSC Development

bull No external connectivity Usethe push model

bull No CVMFS mounted on theWNs Use theSubset-CVMFS-Buildervariation

bull To get multi-core allocationsUse the BundleCE variation ARCCE

MareNostrumWorker nodes

DIRAC

Services Waiting

Jobs

1 Fetchjobs BundleCE

CVMFS SubsetBuilder

2 Submitapps

Subset CVMFS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Mare Nostrum BSC Status

Done

bull Prototype to run simplesubmissions (Hello World)

bull First version of theSubset-CVMFS-Builder

To be done

bull CE configuration to run jobswithin Singularity

bull BundleCE to aggregatemultiple jobs in an allocation

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

  • Introduction
  • Table of Contents
  • DIRAC WMS on Grid Sites
  • DIRAC WMS and Supercomputers
  • Tackling the distributed computing challenges
  • References
  • LHCb-supercomputers collaboration
Page 12: Integrating DIRAC work ows in Supercomputers

Tackling the distributed computing challengesPull model Multi-node allocation

Almost integrated in v7r1

Allows to get a large number of resources with a small number ofallocations

Sub-Pilots (specific to SLURM currently)

bull One sub-pilot per fatnode allocated pilotssharing a same idstatus and output

bull In the Queue conf addParallelLibrary=PL

andNumberOfNodes=Nlt-Mgt

Shared FS

Multi-Nodeallocation

CVMFS endpointmounted on the nodes

Internet

External connectivity(Fetch jobs upload outputs)

LRMS accessible from outside(push pilots from the DIRAC server)

Edge Node

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1221

Tackling the distributed computing challengesPull model CVMFS not mounted on WNs

Not Integrated VO action

SC by default do not provide CVMFS on the WNs

CVMFS-exec on the shared FS [2]

bull Mount CVMFS as anunprivileged user

bull Purely a siteadminVOaction actually mightneed to add aparameter in DIRAC toease the process

CVMFS endpoint

Internet

External connectivity(Fetch jobs upload outputs)

LRMS accessible from outside(push pilots from the DIRAC server)

CVMFS notmounted

CVMFS-exec on theshared file system

Edge Node

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1321

Tackling the distributed computing challengesPull model No remote access to LRMS

Integrated since v7r0 VO action

Some SC can only be accessed via a VPN (No CE no direct SSH)

Site Director on the edge node

bull Directly submit pilotsfrom the edge node

bull Would need to beallowed to executeagents on the edgenode

bull Would need to beupdated manually

LRMS not accessiblefrom outside

Internet

External connectivity(Fetch jobs upload outputs)

LRMS accessible from outside(push pilots from the DIRAC server)

Pilot-Factory onthe edge node

CVMFS endpointmounted on the nodes

Shared FS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1421

Tackling the distributed computing challengesPull model Ext connectivity only from the edge node

Not Integrated VO action

Some SC only provide external connectivity from the edge node Pilotscannot directly interact with DIRAC services in this case

Gateway

bull Would be installed onthe edge node (ifpossible)

bull Would capture the Pilotand Job calls and wouldredirect them

Shared FS

External connectivityonly from the edge node

Internet

External connectivity(Fetch jobs upload outputs)

Gateway serviceon the edge node

captures calls andredirects them

LRMS accessible from outside(push pilots from the DIRAC server)

CVMFS endpointmounted on the nodes

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1521

Tackling the distributed computing challengesPush model No Ext connectivity

In Progress v7r2

Some SC do not provide any external connectivity at all neither on theWNs or the edge node

PushJobAgent

bull Works like a pilotoutside of the SC

bull Fetches jobs deals withinputs and outputssubmits the applicationpart to a SC

bull Require a direct accessto the LRMS

Shared FS

No externalConnectivity

CVMFS endpointmounted on the nodes

Internet

1 Fetch a joband get inputs

3 Get outputsand store them

DIRAC 2 Submitthe

Application

Computing Element

Edge Node

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1621

Tackling the distributed computing challengesPush model No Ext connectivity Multi-corenode

In Progress v7r2

BundleCE

bull Would aggregatemultiple applicationsinto one allocation

Shared FS

No externalConnectivity

Multi-core Multi-node

Internet

DIRAC

1 SubmitApplications

2 AggregateApplications

3 Submitthem as 1multi-core

job

Computing Element

Bundle CE

CVMFS endpointmounted on the nodes

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1721

Tackling the distributed computing challenges

Push model No Ext connectivity No CVMFS

In Progress v7r2 VO action

As it was already said SC do not provide CVMFS by defaultCVMFS-exec cannot be used in this context

Subset-CVMFS-Builder

bull Run amp extract CVMFSdependencies of givenjobs

bull UseCVMFS-Shrinkwrapper[1] to make a subset ofCVMFS

bull Test it amp deploy it onthe SC shared FS

No externalConnectivity

No CVMFS

Internet

Build and deploy asubset of CVMFS on

the shared FS

CVMFS SubsetBuilder

CVMFS Subset

CVMFSendpoint

Edge NodeDIRAC

Computing Element

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1821

Conclusion

Status

bull Support many SC with external connectivity (multi-core allocations)

bull Tools to exploit SC with no external connectivity are in progress

Next Steps

bull Provide the push model solution and its variations

bull Work on DB12 (CPU Power computation) support for multi-coreallocations

bull Provide a complete documentation about SC integration

bull Provide side projects to minimize VO actions(Subset-CVMFS-Builder)

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1921

Thanks

Any questions Comments

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2021

CVMFS cvmfs-shrinkwrap utilityhttpscvmfsreadthedocsioenstablecpt-

shrinkwraphtmlcpt-shrinkwrap Online accessed 4May 2021 2021

CVMFS cvmfsexechttpsgithubcomcvmfscvmfsexec Online accessed4 May 2021 2021

Federico Stagni Andrea Valassi and Vladimir RomanovskiyldquoIntegrating LHCb workflows on HPC resources status andstrategiesrdquo In arXiv200613603 [hep-ex physicsphysics](June 2020) arXiv 200613603 urlhttparxivorgabs200613603

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

Backup Slides

Real use-cases we have in LHCb

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Available HPCs

bull Piz Daint in CSCS (Suisse)

bull Marconi-A2 in CINECA (Italy) ndash not used anymore

bull SDumont in LNCC (Brazil)

bull MareNostrum in BSC (Spain)

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Piz Daint CSCS

bull Ranked 12th in Top500 (Nov 2020)

bull 387872 cores (Nov 2020)

+ Collaboration with the local System Administrators allows atraditional Grid Site usage

rArr No change required

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Marconi-A2 CINECA

bull Ranked 19th in Top500 (Nov2019)

+ External connectivity from theWNs

+ CVMS mounted on the WNs

+ Accessible via a CE

bull 348000 cores (Nov 2019)

- Multi-core allocations 272logical cores per node (IntelKNL)

- Low memorycore

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Marconi-A2 CINECA Development

bull External Connectivity Use thepull model

bull Multi-core allocations Use thefat node partitioning variation

HTCondorCE

1 Submit pilots

Marconi-A2Worker nodes

DIRAC

Services

Pilot-Factory

WaitingJobs

2 Fetchjobs

LRMS CVMFS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Marconi-A2 CINECA Status

rArr More details about LHCb work on CINECA [3]

rArr Marconi-A2 has been replaced by Marconi-100 V100 GPUs andPower9 CPUs cluster

Done

bull Exploited 68272 cores pernode not enough memory formore jobs

To be done

bull Nothing to do Marconi-A2disappeared

bull LHCb software not ready forGPUs

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

SDumont LNCC

bull Ranked 277th in Top500 (Nov2020)

+ External connectivity from theWNs

+ CVMFS mounted on the WNs

+ Accessible via SSH (specialaccess)

bull 33856 cores (Nov 2020)

- Multi-core allocations 24 or 48logical cores per node

- Multi-node allocations 21nodes per allocation requiredby some queues

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

SDumont LNCC Development

bull External Connectivity Use thepull model

bull Multi-core allocations Use thefat node partitioning variation

bull Multi-node allocations Use thesub-pilots variation

1 Submit pilots via ssh

SDumontWorker nodes

DIRAC

Services

Pilot-Factory

WaitingJobs

2 Fetchjobs

LRMS CVMFS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

SDumont LNCC Status

Done

bull Exploit 2424 and 4848 coresper node

To be done

bull Multi-node allocation shouldhave results soon

bull Dirac Benchmark not adaptedto multi-core allocations 20of the jobs run out of time

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Mare Nostrum BSC

bull Ranked 42nd in Top500 (Nov2020)

+ Accessible via a CE (ARC) andalso SSH

+ Single-core allocation possiblebut not recommended

bull 153216 cores (Nov 2020)

- No network connectivity

- CVMFS not mounted on theWNs

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Mare Nostrum BSC Development

bull No external connectivity Usethe push model

bull No CVMFS mounted on theWNs Use theSubset-CVMFS-Buildervariation

bull To get multi-core allocationsUse the BundleCE variation ARCCE

MareNostrumWorker nodes

DIRAC

Services Waiting

Jobs

1 Fetchjobs BundleCE

CVMFS SubsetBuilder

2 Submitapps

Subset CVMFS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Mare Nostrum BSC Status

Done

bull Prototype to run simplesubmissions (Hello World)

bull First version of theSubset-CVMFS-Builder

To be done

bull CE configuration to run jobswithin Singularity

bull BundleCE to aggregatemultiple jobs in an allocation

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

  • Introduction
  • Table of Contents
  • DIRAC WMS on Grid Sites
  • DIRAC WMS and Supercomputers
  • Tackling the distributed computing challenges
  • References
  • LHCb-supercomputers collaboration
Page 13: Integrating DIRAC work ows in Supercomputers

Tackling the distributed computing challengesPull model CVMFS not mounted on WNs

Not Integrated VO action

SC by default do not provide CVMFS on the WNs

CVMFS-exec on the shared FS [2]

bull Mount CVMFS as anunprivileged user

bull Purely a siteadminVOaction actually mightneed to add aparameter in DIRAC toease the process

CVMFS endpoint

Internet

External connectivity(Fetch jobs upload outputs)

LRMS accessible from outside(push pilots from the DIRAC server)

CVMFS notmounted

CVMFS-exec on theshared file system

Edge Node

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1321

Tackling the distributed computing challengesPull model No remote access to LRMS

Integrated since v7r0 VO action

Some SC can only be accessed via a VPN (No CE no direct SSH)

Site Director on the edge node

bull Directly submit pilotsfrom the edge node

bull Would need to beallowed to executeagents on the edgenode

bull Would need to beupdated manually

LRMS not accessiblefrom outside

Internet

External connectivity(Fetch jobs upload outputs)

LRMS accessible from outside(push pilots from the DIRAC server)

Pilot-Factory onthe edge node

CVMFS endpointmounted on the nodes

Shared FS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1421

Tackling the distributed computing challengesPull model Ext connectivity only from the edge node

Not Integrated VO action

Some SC only provide external connectivity from the edge node Pilotscannot directly interact with DIRAC services in this case

Gateway

bull Would be installed onthe edge node (ifpossible)

bull Would capture the Pilotand Job calls and wouldredirect them

Shared FS

External connectivityonly from the edge node

Internet

External connectivity(Fetch jobs upload outputs)

Gateway serviceon the edge node

captures calls andredirects them

LRMS accessible from outside(push pilots from the DIRAC server)

CVMFS endpointmounted on the nodes

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1521

Tackling the distributed computing challengesPush model No Ext connectivity

In Progress v7r2

Some SC do not provide any external connectivity at all neither on theWNs or the edge node

PushJobAgent

bull Works like a pilotoutside of the SC

bull Fetches jobs deals withinputs and outputssubmits the applicationpart to a SC

bull Require a direct accessto the LRMS

Shared FS

No externalConnectivity

CVMFS endpointmounted on the nodes

Internet

1 Fetch a joband get inputs

3 Get outputsand store them

DIRAC 2 Submitthe

Application

Computing Element

Edge Node

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1621

Tackling the distributed computing challengesPush model No Ext connectivity Multi-corenode

In Progress v7r2

BundleCE

bull Would aggregatemultiple applicationsinto one allocation

Shared FS

No externalConnectivity

Multi-core Multi-node

Internet

DIRAC

1 SubmitApplications

2 AggregateApplications

3 Submitthem as 1multi-core

job

Computing Element

Bundle CE

CVMFS endpointmounted on the nodes

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1721

Tackling the distributed computing challenges

Push model No Ext connectivity No CVMFS

In Progress v7r2 VO action

As it was already said SC do not provide CVMFS by defaultCVMFS-exec cannot be used in this context

Subset-CVMFS-Builder

bull Run amp extract CVMFSdependencies of givenjobs

bull UseCVMFS-Shrinkwrapper[1] to make a subset ofCVMFS

bull Test it amp deploy it onthe SC shared FS

No externalConnectivity

No CVMFS

Internet

Build and deploy asubset of CVMFS on

the shared FS

CVMFS SubsetBuilder

CVMFS Subset

CVMFSendpoint

Edge NodeDIRAC

Computing Element

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1821

Conclusion

Status

bull Support many SC with external connectivity (multi-core allocations)

bull Tools to exploit SC with no external connectivity are in progress

Next Steps

bull Provide the push model solution and its variations

bull Work on DB12 (CPU Power computation) support for multi-coreallocations

bull Provide a complete documentation about SC integration

bull Provide side projects to minimize VO actions(Subset-CVMFS-Builder)

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1921

Thanks

Any questions Comments

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2021

CVMFS cvmfs-shrinkwrap utilityhttpscvmfsreadthedocsioenstablecpt-

shrinkwraphtmlcpt-shrinkwrap Online accessed 4May 2021 2021

CVMFS cvmfsexechttpsgithubcomcvmfscvmfsexec Online accessed4 May 2021 2021

Federico Stagni Andrea Valassi and Vladimir RomanovskiyldquoIntegrating LHCb workflows on HPC resources status andstrategiesrdquo In arXiv200613603 [hep-ex physicsphysics](June 2020) arXiv 200613603 urlhttparxivorgabs200613603

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

Backup Slides

Real use-cases we have in LHCb

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Available HPCs

bull Piz Daint in CSCS (Suisse)

bull Marconi-A2 in CINECA (Italy) ndash not used anymore

bull SDumont in LNCC (Brazil)

bull MareNostrum in BSC (Spain)

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Piz Daint CSCS

bull Ranked 12th in Top500 (Nov 2020)

bull 387872 cores (Nov 2020)

+ Collaboration with the local System Administrators allows atraditional Grid Site usage

rArr No change required

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Marconi-A2 CINECA

bull Ranked 19th in Top500 (Nov2019)

+ External connectivity from theWNs

+ CVMS mounted on the WNs

+ Accessible via a CE

bull 348000 cores (Nov 2019)

- Multi-core allocations 272logical cores per node (IntelKNL)

- Low memorycore

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Marconi-A2 CINECA Development

bull External Connectivity Use thepull model

bull Multi-core allocations Use thefat node partitioning variation

HTCondorCE

1 Submit pilots

Marconi-A2Worker nodes

DIRAC

Services

Pilot-Factory

WaitingJobs

2 Fetchjobs

LRMS CVMFS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Marconi-A2 CINECA Status

rArr More details about LHCb work on CINECA [3]

rArr Marconi-A2 has been replaced by Marconi-100 V100 GPUs andPower9 CPUs cluster

Done

bull Exploited 68272 cores pernode not enough memory formore jobs

To be done

bull Nothing to do Marconi-A2disappeared

bull LHCb software not ready forGPUs

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

SDumont LNCC

bull Ranked 277th in Top500 (Nov2020)

+ External connectivity from theWNs

+ CVMFS mounted on the WNs

+ Accessible via SSH (specialaccess)

bull 33856 cores (Nov 2020)

- Multi-core allocations 24 or 48logical cores per node

- Multi-node allocations 21nodes per allocation requiredby some queues

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

SDumont LNCC Development

bull External Connectivity Use thepull model

bull Multi-core allocations Use thefat node partitioning variation

bull Multi-node allocations Use thesub-pilots variation

1 Submit pilots via ssh

SDumontWorker nodes

DIRAC

Services

Pilot-Factory

WaitingJobs

2 Fetchjobs

LRMS CVMFS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

SDumont LNCC Status

Done

bull Exploit 2424 and 4848 coresper node

To be done

bull Multi-node allocation shouldhave results soon

bull Dirac Benchmark not adaptedto multi-core allocations 20of the jobs run out of time

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Mare Nostrum BSC

bull Ranked 42nd in Top500 (Nov2020)

+ Accessible via a CE (ARC) andalso SSH

+ Single-core allocation possiblebut not recommended

bull 153216 cores (Nov 2020)

- No network connectivity

- CVMFS not mounted on theWNs

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Mare Nostrum BSC Development

bull No external connectivity Usethe push model

bull No CVMFS mounted on theWNs Use theSubset-CVMFS-Buildervariation

bull To get multi-core allocationsUse the BundleCE variation ARCCE

MareNostrumWorker nodes

DIRAC

Services Waiting

Jobs

1 Fetchjobs BundleCE

CVMFS SubsetBuilder

2 Submitapps

Subset CVMFS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Mare Nostrum BSC Status

Done

bull Prototype to run simplesubmissions (Hello World)

bull First version of theSubset-CVMFS-Builder

To be done

bull CE configuration to run jobswithin Singularity

bull BundleCE to aggregatemultiple jobs in an allocation

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

  • Introduction
  • Table of Contents
  • DIRAC WMS on Grid Sites
  • DIRAC WMS and Supercomputers
  • Tackling the distributed computing challenges
  • References
  • LHCb-supercomputers collaboration
Page 14: Integrating DIRAC work ows in Supercomputers

Tackling the distributed computing challengesPull model No remote access to LRMS

Integrated since v7r0 VO action

Some SC can only be accessed via a VPN (No CE no direct SSH)

Site Director on the edge node

bull Directly submit pilotsfrom the edge node

bull Would need to beallowed to executeagents on the edgenode

bull Would need to beupdated manually

LRMS not accessiblefrom outside

Internet

External connectivity(Fetch jobs upload outputs)

LRMS accessible from outside(push pilots from the DIRAC server)

Pilot-Factory onthe edge node

CVMFS endpointmounted on the nodes

Shared FS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1421

Tackling the distributed computing challengesPull model Ext connectivity only from the edge node

Not Integrated VO action

Some SC only provide external connectivity from the edge node Pilotscannot directly interact with DIRAC services in this case

Gateway

bull Would be installed onthe edge node (ifpossible)

bull Would capture the Pilotand Job calls and wouldredirect them

Shared FS

External connectivityonly from the edge node

Internet

External connectivity(Fetch jobs upload outputs)

Gateway serviceon the edge node

captures calls andredirects them

LRMS accessible from outside(push pilots from the DIRAC server)

CVMFS endpointmounted on the nodes

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1521

Tackling the distributed computing challengesPush model No Ext connectivity

In Progress v7r2

Some SC do not provide any external connectivity at all neither on theWNs or the edge node

PushJobAgent

bull Works like a pilotoutside of the SC

bull Fetches jobs deals withinputs and outputssubmits the applicationpart to a SC

bull Require a direct accessto the LRMS

Shared FS

No externalConnectivity

CVMFS endpointmounted on the nodes

Internet

1 Fetch a joband get inputs

3 Get outputsand store them

DIRAC 2 Submitthe

Application

Computing Element

Edge Node

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1621

Tackling the distributed computing challengesPush model No Ext connectivity Multi-corenode

In Progress v7r2

BundleCE

bull Would aggregatemultiple applicationsinto one allocation

Shared FS

No externalConnectivity

Multi-core Multi-node

Internet

DIRAC

1 SubmitApplications

2 AggregateApplications

3 Submitthem as 1multi-core

job

Computing Element

Bundle CE

CVMFS endpointmounted on the nodes

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1721

Tackling the distributed computing challenges

Push model No Ext connectivity No CVMFS

In Progress v7r2 VO action

As it was already said SC do not provide CVMFS by defaultCVMFS-exec cannot be used in this context

Subset-CVMFS-Builder

bull Run amp extract CVMFSdependencies of givenjobs

bull UseCVMFS-Shrinkwrapper[1] to make a subset ofCVMFS

bull Test it amp deploy it onthe SC shared FS

No externalConnectivity

No CVMFS

Internet

Build and deploy asubset of CVMFS on

the shared FS

CVMFS SubsetBuilder

CVMFS Subset

CVMFSendpoint

Edge NodeDIRAC

Computing Element

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1821

Conclusion

Status

bull Support many SC with external connectivity (multi-core allocations)

bull Tools to exploit SC with no external connectivity are in progress

Next Steps

bull Provide the push model solution and its variations

bull Work on DB12 (CPU Power computation) support for multi-coreallocations

bull Provide a complete documentation about SC integration

bull Provide side projects to minimize VO actions(Subset-CVMFS-Builder)

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1921

Thanks

Any questions Comments

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2021

CVMFS cvmfs-shrinkwrap utilityhttpscvmfsreadthedocsioenstablecpt-

shrinkwraphtmlcpt-shrinkwrap Online accessed 4May 2021 2021

CVMFS cvmfsexechttpsgithubcomcvmfscvmfsexec Online accessed4 May 2021 2021

Federico Stagni Andrea Valassi and Vladimir RomanovskiyldquoIntegrating LHCb workflows on HPC resources status andstrategiesrdquo In arXiv200613603 [hep-ex physicsphysics](June 2020) arXiv 200613603 urlhttparxivorgabs200613603

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

Backup Slides

Real use-cases we have in LHCb

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Available HPCs

bull Piz Daint in CSCS (Suisse)

bull Marconi-A2 in CINECA (Italy) ndash not used anymore

bull SDumont in LNCC (Brazil)

bull MareNostrum in BSC (Spain)

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Piz Daint CSCS

bull Ranked 12th in Top500 (Nov 2020)

bull 387872 cores (Nov 2020)

+ Collaboration with the local System Administrators allows atraditional Grid Site usage

rArr No change required

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Marconi-A2 CINECA

bull Ranked 19th in Top500 (Nov2019)

+ External connectivity from theWNs

+ CVMS mounted on the WNs

+ Accessible via a CE

bull 348000 cores (Nov 2019)

- Multi-core allocations 272logical cores per node (IntelKNL)

- Low memorycore

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Marconi-A2 CINECA Development

bull External Connectivity Use thepull model

bull Multi-core allocations Use thefat node partitioning variation

HTCondorCE

1 Submit pilots

Marconi-A2Worker nodes

DIRAC

Services

Pilot-Factory

WaitingJobs

2 Fetchjobs

LRMS CVMFS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Marconi-A2 CINECA Status

rArr More details about LHCb work on CINECA [3]

rArr Marconi-A2 has been replaced by Marconi-100 V100 GPUs andPower9 CPUs cluster

Done

bull Exploited 68272 cores pernode not enough memory formore jobs

To be done

bull Nothing to do Marconi-A2disappeared

bull LHCb software not ready forGPUs

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

SDumont LNCC

bull Ranked 277th in Top500 (Nov2020)

+ External connectivity from theWNs

+ CVMFS mounted on the WNs

+ Accessible via SSH (specialaccess)

bull 33856 cores (Nov 2020)

- Multi-core allocations 24 or 48logical cores per node

- Multi-node allocations 21nodes per allocation requiredby some queues

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

SDumont LNCC Development

bull External Connectivity Use thepull model

bull Multi-core allocations Use thefat node partitioning variation

bull Multi-node allocations Use thesub-pilots variation

1 Submit pilots via ssh

SDumontWorker nodes

DIRAC

Services

Pilot-Factory

WaitingJobs

2 Fetchjobs

LRMS CVMFS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

SDumont LNCC Status

Done

bull Exploit 2424 and 4848 coresper node

To be done

bull Multi-node allocation shouldhave results soon

bull Dirac Benchmark not adaptedto multi-core allocations 20of the jobs run out of time

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Mare Nostrum BSC

bull Ranked 42nd in Top500 (Nov2020)

+ Accessible via a CE (ARC) andalso SSH

+ Single-core allocation possiblebut not recommended

bull 153216 cores (Nov 2020)

- No network connectivity

- CVMFS not mounted on theWNs

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Mare Nostrum BSC Development

bull No external connectivity Usethe push model

bull No CVMFS mounted on theWNs Use theSubset-CVMFS-Buildervariation

bull To get multi-core allocationsUse the BundleCE variation ARCCE

MareNostrumWorker nodes

DIRAC

Services Waiting

Jobs

1 Fetchjobs BundleCE

CVMFS SubsetBuilder

2 Submitapps

Subset CVMFS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Mare Nostrum BSC Status

Done

bull Prototype to run simplesubmissions (Hello World)

bull First version of theSubset-CVMFS-Builder

To be done

bull CE configuration to run jobswithin Singularity

bull BundleCE to aggregatemultiple jobs in an allocation

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

  • Introduction
  • Table of Contents
  • DIRAC WMS on Grid Sites
  • DIRAC WMS and Supercomputers
  • Tackling the distributed computing challenges
  • References
  • LHCb-supercomputers collaboration
Page 15: Integrating DIRAC work ows in Supercomputers

Tackling the distributed computing challengesPull model Ext connectivity only from the edge node

Not Integrated VO action

Some SC only provide external connectivity from the edge node Pilotscannot directly interact with DIRAC services in this case

Gateway

bull Would be installed onthe edge node (ifpossible)

bull Would capture the Pilotand Job calls and wouldredirect them

Shared FS

External connectivityonly from the edge node

Internet

External connectivity(Fetch jobs upload outputs)

Gateway serviceon the edge node

captures calls andredirects them

LRMS accessible from outside(push pilots from the DIRAC server)

CVMFS endpointmounted on the nodes

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1521

Tackling the distributed computing challengesPush model No Ext connectivity

In Progress v7r2

Some SC do not provide any external connectivity at all neither on theWNs or the edge node

PushJobAgent

bull Works like a pilotoutside of the SC

bull Fetches jobs deals withinputs and outputssubmits the applicationpart to a SC

bull Require a direct accessto the LRMS

Shared FS

No externalConnectivity

CVMFS endpointmounted on the nodes

Internet

1 Fetch a joband get inputs

3 Get outputsand store them

DIRAC 2 Submitthe

Application

Computing Element

Edge Node

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1621

Tackling the distributed computing challengesPush model No Ext connectivity Multi-corenode

In Progress v7r2

BundleCE

bull Would aggregatemultiple applicationsinto one allocation

Shared FS

No externalConnectivity

Multi-core Multi-node

Internet

DIRAC

1 SubmitApplications

2 AggregateApplications

3 Submitthem as 1multi-core

job

Computing Element

Bundle CE

CVMFS endpointmounted on the nodes

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1721

Tackling the distributed computing challenges

Push model No Ext connectivity No CVMFS

In Progress v7r2 VO action

As it was already said SC do not provide CVMFS by defaultCVMFS-exec cannot be used in this context

Subset-CVMFS-Builder

bull Run amp extract CVMFSdependencies of givenjobs

bull UseCVMFS-Shrinkwrapper[1] to make a subset ofCVMFS

bull Test it amp deploy it onthe SC shared FS

No externalConnectivity

No CVMFS

Internet

Build and deploy asubset of CVMFS on

the shared FS

CVMFS SubsetBuilder

CVMFS Subset

CVMFSendpoint

Edge NodeDIRAC

Computing Element

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1821

Conclusion

Status

bull Support many SC with external connectivity (multi-core allocations)

bull Tools to exploit SC with no external connectivity are in progress

Next Steps

bull Provide the push model solution and its variations

bull Work on DB12 (CPU Power computation) support for multi-coreallocations

bull Provide a complete documentation about SC integration

bull Provide side projects to minimize VO actions(Subset-CVMFS-Builder)

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1921

Thanks

Any questions Comments

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2021

CVMFS cvmfs-shrinkwrap utilityhttpscvmfsreadthedocsioenstablecpt-

shrinkwraphtmlcpt-shrinkwrap Online accessed 4May 2021 2021

CVMFS cvmfsexechttpsgithubcomcvmfscvmfsexec Online accessed4 May 2021 2021

Federico Stagni Andrea Valassi and Vladimir RomanovskiyldquoIntegrating LHCb workflows on HPC resources status andstrategiesrdquo In arXiv200613603 [hep-ex physicsphysics](June 2020) arXiv 200613603 urlhttparxivorgabs200613603

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

Backup Slides

Real use-cases we have in LHCb

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Available HPCs

bull Piz Daint in CSCS (Suisse)

bull Marconi-A2 in CINECA (Italy) ndash not used anymore

bull SDumont in LNCC (Brazil)

bull MareNostrum in BSC (Spain)

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Piz Daint CSCS

bull Ranked 12th in Top500 (Nov 2020)

bull 387872 cores (Nov 2020)

+ Collaboration with the local System Administrators allows atraditional Grid Site usage

rArr No change required

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Marconi-A2 CINECA

bull Ranked 19th in Top500 (Nov2019)

+ External connectivity from theWNs

+ CVMS mounted on the WNs

+ Accessible via a CE

bull 348000 cores (Nov 2019)

- Multi-core allocations 272logical cores per node (IntelKNL)

- Low memorycore

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Marconi-A2 CINECA Development

bull External Connectivity Use thepull model

bull Multi-core allocations Use thefat node partitioning variation

HTCondorCE

1 Submit pilots

Marconi-A2Worker nodes

DIRAC

Services

Pilot-Factory

WaitingJobs

2 Fetchjobs

LRMS CVMFS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Marconi-A2 CINECA Status

rArr More details about LHCb work on CINECA [3]

rArr Marconi-A2 has been replaced by Marconi-100 V100 GPUs andPower9 CPUs cluster

Done

bull Exploited 68272 cores pernode not enough memory formore jobs

To be done

bull Nothing to do Marconi-A2disappeared

bull LHCb software not ready forGPUs

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

SDumont LNCC

bull Ranked 277th in Top500 (Nov2020)

+ External connectivity from theWNs

+ CVMFS mounted on the WNs

+ Accessible via SSH (specialaccess)

bull 33856 cores (Nov 2020)

- Multi-core allocations 24 or 48logical cores per node

- Multi-node allocations 21nodes per allocation requiredby some queues

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

SDumont LNCC Development

bull External Connectivity Use thepull model

bull Multi-core allocations Use thefat node partitioning variation

bull Multi-node allocations Use thesub-pilots variation

1 Submit pilots via ssh

SDumontWorker nodes

DIRAC

Services

Pilot-Factory

WaitingJobs

2 Fetchjobs

LRMS CVMFS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

SDumont LNCC Status

Done

bull Exploit 2424 and 4848 coresper node

To be done

bull Multi-node allocation shouldhave results soon

bull Dirac Benchmark not adaptedto multi-core allocations 20of the jobs run out of time

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Mare Nostrum BSC

bull Ranked 42nd in Top500 (Nov2020)

+ Accessible via a CE (ARC) andalso SSH

+ Single-core allocation possiblebut not recommended

bull 153216 cores (Nov 2020)

- No network connectivity

- CVMFS not mounted on theWNs

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Mare Nostrum BSC Development

bull No external connectivity Usethe push model

bull No CVMFS mounted on theWNs Use theSubset-CVMFS-Buildervariation

bull To get multi-core allocationsUse the BundleCE variation ARCCE

MareNostrumWorker nodes

DIRAC

Services Waiting

Jobs

1 Fetchjobs BundleCE

CVMFS SubsetBuilder

2 Submitapps

Subset CVMFS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Mare Nostrum BSC Status

Done

bull Prototype to run simplesubmissions (Hello World)

bull First version of theSubset-CVMFS-Builder

To be done

bull CE configuration to run jobswithin Singularity

bull BundleCE to aggregatemultiple jobs in an allocation

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

  • Introduction
  • Table of Contents
  • DIRAC WMS on Grid Sites
  • DIRAC WMS and Supercomputers
  • Tackling the distributed computing challenges
  • References
  • LHCb-supercomputers collaboration
Page 16: Integrating DIRAC work ows in Supercomputers

Tackling the distributed computing challengesPush model No Ext connectivity

In Progress v7r2

Some SC do not provide any external connectivity at all neither on theWNs or the edge node

PushJobAgent

bull Works like a pilotoutside of the SC

bull Fetches jobs deals withinputs and outputssubmits the applicationpart to a SC

bull Require a direct accessto the LRMS

Shared FS

No externalConnectivity

CVMFS endpointmounted on the nodes

Internet

1 Fetch a joband get inputs

3 Get outputsand store them

DIRAC 2 Submitthe

Application

Computing Element

Edge Node

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1621

Tackling the distributed computing challengesPush model No Ext connectivity Multi-corenode

In Progress v7r2

BundleCE

bull Would aggregatemultiple applicationsinto one allocation

Shared FS

No externalConnectivity

Multi-core Multi-node

Internet

DIRAC

1 SubmitApplications

2 AggregateApplications

3 Submitthem as 1multi-core

job

Computing Element

Bundle CE

CVMFS endpointmounted on the nodes

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1721

Tackling the distributed computing challenges

Push model No Ext connectivity No CVMFS

In Progress v7r2 VO action

As it was already said SC do not provide CVMFS by defaultCVMFS-exec cannot be used in this context

Subset-CVMFS-Builder

bull Run amp extract CVMFSdependencies of givenjobs

bull UseCVMFS-Shrinkwrapper[1] to make a subset ofCVMFS

bull Test it amp deploy it onthe SC shared FS

No externalConnectivity

No CVMFS

Internet

Build and deploy asubset of CVMFS on

the shared FS

CVMFS SubsetBuilder

CVMFS Subset

CVMFSendpoint

Edge NodeDIRAC

Computing Element

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1821

Conclusion

Status

bull Support many SC with external connectivity (multi-core allocations)

bull Tools to exploit SC with no external connectivity are in progress

Next Steps

bull Provide the push model solution and its variations

bull Work on DB12 (CPU Power computation) support for multi-coreallocations

bull Provide a complete documentation about SC integration

bull Provide side projects to minimize VO actions(Subset-CVMFS-Builder)

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1921

Thanks

Any questions Comments

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2021

CVMFS cvmfs-shrinkwrap utilityhttpscvmfsreadthedocsioenstablecpt-

shrinkwraphtmlcpt-shrinkwrap Online accessed 4May 2021 2021

CVMFS cvmfsexechttpsgithubcomcvmfscvmfsexec Online accessed4 May 2021 2021

Federico Stagni Andrea Valassi and Vladimir RomanovskiyldquoIntegrating LHCb workflows on HPC resources status andstrategiesrdquo In arXiv200613603 [hep-ex physicsphysics](June 2020) arXiv 200613603 urlhttparxivorgabs200613603

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

Backup Slides

Real use-cases we have in LHCb

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Available HPCs

bull Piz Daint in CSCS (Suisse)

bull Marconi-A2 in CINECA (Italy) ndash not used anymore

bull SDumont in LNCC (Brazil)

bull MareNostrum in BSC (Spain)

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Piz Daint CSCS

bull Ranked 12th in Top500 (Nov 2020)

bull 387872 cores (Nov 2020)

+ Collaboration with the local System Administrators allows atraditional Grid Site usage

rArr No change required

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Marconi-A2 CINECA

bull Ranked 19th in Top500 (Nov2019)

+ External connectivity from theWNs

+ CVMS mounted on the WNs

+ Accessible via a CE

bull 348000 cores (Nov 2019)

- Multi-core allocations 272logical cores per node (IntelKNL)

- Low memorycore

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Marconi-A2 CINECA Development

bull External Connectivity Use thepull model

bull Multi-core allocations Use thefat node partitioning variation

HTCondorCE

1 Submit pilots

Marconi-A2Worker nodes

DIRAC

Services

Pilot-Factory

WaitingJobs

2 Fetchjobs

LRMS CVMFS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Marconi-A2 CINECA Status

rArr More details about LHCb work on CINECA [3]

rArr Marconi-A2 has been replaced by Marconi-100 V100 GPUs andPower9 CPUs cluster

Done

bull Exploited 68272 cores pernode not enough memory formore jobs

To be done

bull Nothing to do Marconi-A2disappeared

bull LHCb software not ready forGPUs

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

SDumont LNCC

bull Ranked 277th in Top500 (Nov2020)

+ External connectivity from theWNs

+ CVMFS mounted on the WNs

+ Accessible via SSH (specialaccess)

bull 33856 cores (Nov 2020)

- Multi-core allocations 24 or 48logical cores per node

- Multi-node allocations 21nodes per allocation requiredby some queues

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

SDumont LNCC Development

bull External Connectivity Use thepull model

bull Multi-core allocations Use thefat node partitioning variation

bull Multi-node allocations Use thesub-pilots variation

1 Submit pilots via ssh

SDumontWorker nodes

DIRAC

Services

Pilot-Factory

WaitingJobs

2 Fetchjobs

LRMS CVMFS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

SDumont LNCC Status

Done

bull Exploit 2424 and 4848 coresper node

To be done

bull Multi-node allocation shouldhave results soon

bull Dirac Benchmark not adaptedto multi-core allocations 20of the jobs run out of time

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Mare Nostrum BSC

bull Ranked 42nd in Top500 (Nov2020)

+ Accessible via a CE (ARC) andalso SSH

+ Single-core allocation possiblebut not recommended

bull 153216 cores (Nov 2020)

- No network connectivity

- CVMFS not mounted on theWNs

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Mare Nostrum BSC Development

bull No external connectivity Usethe push model

bull No CVMFS mounted on theWNs Use theSubset-CVMFS-Buildervariation

bull To get multi-core allocationsUse the BundleCE variation ARCCE

MareNostrumWorker nodes

DIRAC

Services Waiting

Jobs

1 Fetchjobs BundleCE

CVMFS SubsetBuilder

2 Submitapps

Subset CVMFS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Mare Nostrum BSC Status

Done

bull Prototype to run simplesubmissions (Hello World)

bull First version of theSubset-CVMFS-Builder

To be done

bull CE configuration to run jobswithin Singularity

bull BundleCE to aggregatemultiple jobs in an allocation

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

  • Introduction
  • Table of Contents
  • DIRAC WMS on Grid Sites
  • DIRAC WMS and Supercomputers
  • Tackling the distributed computing challenges
  • References
  • LHCb-supercomputers collaboration
Page 17: Integrating DIRAC work ows in Supercomputers

Tackling the distributed computing challengesPush model No Ext connectivity Multi-corenode

In Progress v7r2

BundleCE

bull Would aggregatemultiple applicationsinto one allocation

Shared FS

No externalConnectivity

Multi-core Multi-node

Internet

DIRAC

1 SubmitApplications

2 AggregateApplications

3 Submitthem as 1multi-core

job

Computing Element

Bundle CE

CVMFS endpointmounted on the nodes

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1721

Tackling the distributed computing challenges

Push model No Ext connectivity No CVMFS

In Progress v7r2 VO action

As it was already said SC do not provide CVMFS by defaultCVMFS-exec cannot be used in this context

Subset-CVMFS-Builder

bull Run amp extract CVMFSdependencies of givenjobs

bull UseCVMFS-Shrinkwrapper[1] to make a subset ofCVMFS

bull Test it amp deploy it onthe SC shared FS

No externalConnectivity

No CVMFS

Internet

Build and deploy asubset of CVMFS on

the shared FS

CVMFS SubsetBuilder

CVMFS Subset

CVMFSendpoint

Edge NodeDIRAC

Computing Element

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1821

Conclusion

Status

bull Support many SC with external connectivity (multi-core allocations)

bull Tools to exploit SC with no external connectivity are in progress

Next Steps

bull Provide the push model solution and its variations

bull Work on DB12 (CPU Power computation) support for multi-coreallocations

bull Provide a complete documentation about SC integration

bull Provide side projects to minimize VO actions(Subset-CVMFS-Builder)

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1921

Thanks

Any questions Comments

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2021

CVMFS cvmfs-shrinkwrap utilityhttpscvmfsreadthedocsioenstablecpt-

shrinkwraphtmlcpt-shrinkwrap Online accessed 4May 2021 2021

CVMFS cvmfsexechttpsgithubcomcvmfscvmfsexec Online accessed4 May 2021 2021

Federico Stagni Andrea Valassi and Vladimir RomanovskiyldquoIntegrating LHCb workflows on HPC resources status andstrategiesrdquo In arXiv200613603 [hep-ex physicsphysics](June 2020) arXiv 200613603 urlhttparxivorgabs200613603

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

Backup Slides

Real use-cases we have in LHCb

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Available HPCs

bull Piz Daint in CSCS (Suisse)

bull Marconi-A2 in CINECA (Italy) ndash not used anymore

bull SDumont in LNCC (Brazil)

bull MareNostrum in BSC (Spain)

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Piz Daint CSCS

bull Ranked 12th in Top500 (Nov 2020)

bull 387872 cores (Nov 2020)

+ Collaboration with the local System Administrators allows atraditional Grid Site usage

rArr No change required

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Marconi-A2 CINECA

bull Ranked 19th in Top500 (Nov2019)

+ External connectivity from theWNs

+ CVMS mounted on the WNs

+ Accessible via a CE

bull 348000 cores (Nov 2019)

- Multi-core allocations 272logical cores per node (IntelKNL)

- Low memorycore

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Marconi-A2 CINECA Development

bull External Connectivity Use thepull model

bull Multi-core allocations Use thefat node partitioning variation

HTCondorCE

1 Submit pilots

Marconi-A2Worker nodes

DIRAC

Services

Pilot-Factory

WaitingJobs

2 Fetchjobs

LRMS CVMFS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Marconi-A2 CINECA Status

rArr More details about LHCb work on CINECA [3]

rArr Marconi-A2 has been replaced by Marconi-100 V100 GPUs andPower9 CPUs cluster

Done

bull Exploited 68272 cores pernode not enough memory formore jobs

To be done

bull Nothing to do Marconi-A2disappeared

bull LHCb software not ready forGPUs

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

SDumont LNCC

bull Ranked 277th in Top500 (Nov2020)

+ External connectivity from theWNs

+ CVMFS mounted on the WNs

+ Accessible via SSH (specialaccess)

bull 33856 cores (Nov 2020)

- Multi-core allocations 24 or 48logical cores per node

- Multi-node allocations 21nodes per allocation requiredby some queues

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

SDumont LNCC Development

bull External Connectivity Use thepull model

bull Multi-core allocations Use thefat node partitioning variation

bull Multi-node allocations Use thesub-pilots variation

1 Submit pilots via ssh

SDumontWorker nodes

DIRAC

Services

Pilot-Factory

WaitingJobs

2 Fetchjobs

LRMS CVMFS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

SDumont LNCC Status

Done

bull Exploit 2424 and 4848 coresper node

To be done

bull Multi-node allocation shouldhave results soon

bull Dirac Benchmark not adaptedto multi-core allocations 20of the jobs run out of time

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Mare Nostrum BSC

bull Ranked 42nd in Top500 (Nov2020)

+ Accessible via a CE (ARC) andalso SSH

+ Single-core allocation possiblebut not recommended

bull 153216 cores (Nov 2020)

- No network connectivity

- CVMFS not mounted on theWNs

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Mare Nostrum BSC Development

bull No external connectivity Usethe push model

bull No CVMFS mounted on theWNs Use theSubset-CVMFS-Buildervariation

bull To get multi-core allocationsUse the BundleCE variation ARCCE

MareNostrumWorker nodes

DIRAC

Services Waiting

Jobs

1 Fetchjobs BundleCE

CVMFS SubsetBuilder

2 Submitapps

Subset CVMFS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Mare Nostrum BSC Status

Done

bull Prototype to run simplesubmissions (Hello World)

bull First version of theSubset-CVMFS-Builder

To be done

bull CE configuration to run jobswithin Singularity

bull BundleCE to aggregatemultiple jobs in an allocation

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

  • Introduction
  • Table of Contents
  • DIRAC WMS on Grid Sites
  • DIRAC WMS and Supercomputers
  • Tackling the distributed computing challenges
  • References
  • LHCb-supercomputers collaboration
Page 18: Integrating DIRAC work ows in Supercomputers

Tackling the distributed computing challenges

Push model No Ext connectivity No CVMFS

In Progress v7r2 VO action

As it was already said SC do not provide CVMFS by defaultCVMFS-exec cannot be used in this context

Subset-CVMFS-Builder

bull Run amp extract CVMFSdependencies of givenjobs

bull UseCVMFS-Shrinkwrapper[1] to make a subset ofCVMFS

bull Test it amp deploy it onthe SC shared FS

No externalConnectivity

No CVMFS

Internet

Build and deploy asubset of CVMFS on

the shared FS

CVMFS SubsetBuilder

CVMFS Subset

CVMFSendpoint

Edge NodeDIRAC

Computing Element

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1821

Conclusion

Status

bull Support many SC with external connectivity (multi-core allocations)

bull Tools to exploit SC with no external connectivity are in progress

Next Steps

bull Provide the push model solution and its variations

bull Work on DB12 (CPU Power computation) support for multi-coreallocations

bull Provide a complete documentation about SC integration

bull Provide side projects to minimize VO actions(Subset-CVMFS-Builder)

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1921

Thanks

Any questions Comments

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2021

CVMFS cvmfs-shrinkwrap utilityhttpscvmfsreadthedocsioenstablecpt-

shrinkwraphtmlcpt-shrinkwrap Online accessed 4May 2021 2021

CVMFS cvmfsexechttpsgithubcomcvmfscvmfsexec Online accessed4 May 2021 2021

Federico Stagni Andrea Valassi and Vladimir RomanovskiyldquoIntegrating LHCb workflows on HPC resources status andstrategiesrdquo In arXiv200613603 [hep-ex physicsphysics](June 2020) arXiv 200613603 urlhttparxivorgabs200613603

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

Backup Slides

Real use-cases we have in LHCb

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Available HPCs

bull Piz Daint in CSCS (Suisse)

bull Marconi-A2 in CINECA (Italy) ndash not used anymore

bull SDumont in LNCC (Brazil)

bull MareNostrum in BSC (Spain)

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Piz Daint CSCS

bull Ranked 12th in Top500 (Nov 2020)

bull 387872 cores (Nov 2020)

+ Collaboration with the local System Administrators allows atraditional Grid Site usage

rArr No change required

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Marconi-A2 CINECA

bull Ranked 19th in Top500 (Nov2019)

+ External connectivity from theWNs

+ CVMS mounted on the WNs

+ Accessible via a CE

bull 348000 cores (Nov 2019)

- Multi-core allocations 272logical cores per node (IntelKNL)

- Low memorycore

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Marconi-A2 CINECA Development

bull External Connectivity Use thepull model

bull Multi-core allocations Use thefat node partitioning variation

HTCondorCE

1 Submit pilots

Marconi-A2Worker nodes

DIRAC

Services

Pilot-Factory

WaitingJobs

2 Fetchjobs

LRMS CVMFS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Marconi-A2 CINECA Status

rArr More details about LHCb work on CINECA [3]

rArr Marconi-A2 has been replaced by Marconi-100 V100 GPUs andPower9 CPUs cluster

Done

bull Exploited 68272 cores pernode not enough memory formore jobs

To be done

bull Nothing to do Marconi-A2disappeared

bull LHCb software not ready forGPUs

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

SDumont LNCC

bull Ranked 277th in Top500 (Nov2020)

+ External connectivity from theWNs

+ CVMFS mounted on the WNs

+ Accessible via SSH (specialaccess)

bull 33856 cores (Nov 2020)

- Multi-core allocations 24 or 48logical cores per node

- Multi-node allocations 21nodes per allocation requiredby some queues

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

SDumont LNCC Development

bull External Connectivity Use thepull model

bull Multi-core allocations Use thefat node partitioning variation

bull Multi-node allocations Use thesub-pilots variation

1 Submit pilots via ssh

SDumontWorker nodes

DIRAC

Services

Pilot-Factory

WaitingJobs

2 Fetchjobs

LRMS CVMFS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

SDumont LNCC Status

Done

bull Exploit 2424 and 4848 coresper node

To be done

bull Multi-node allocation shouldhave results soon

bull Dirac Benchmark not adaptedto multi-core allocations 20of the jobs run out of time

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Mare Nostrum BSC

bull Ranked 42nd in Top500 (Nov2020)

+ Accessible via a CE (ARC) andalso SSH

+ Single-core allocation possiblebut not recommended

bull 153216 cores (Nov 2020)

- No network connectivity

- CVMFS not mounted on theWNs

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Mare Nostrum BSC Development

bull No external connectivity Usethe push model

bull No CVMFS mounted on theWNs Use theSubset-CVMFS-Buildervariation

bull To get multi-core allocationsUse the BundleCE variation ARCCE

MareNostrumWorker nodes

DIRAC

Services Waiting

Jobs

1 Fetchjobs BundleCE

CVMFS SubsetBuilder

2 Submitapps

Subset CVMFS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Mare Nostrum BSC Status

Done

bull Prototype to run simplesubmissions (Hello World)

bull First version of theSubset-CVMFS-Builder

To be done

bull CE configuration to run jobswithin Singularity

bull BundleCE to aggregatemultiple jobs in an allocation

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

  • Introduction
  • Table of Contents
  • DIRAC WMS on Grid Sites
  • DIRAC WMS and Supercomputers
  • Tackling the distributed computing challenges
  • References
  • LHCb-supercomputers collaboration
Page 19: Integrating DIRAC work ows in Supercomputers

Conclusion

Status

bull Support many SC with external connectivity (multi-core allocations)

bull Tools to exploit SC with no external connectivity are in progress

Next Steps

bull Provide the push model solution and its variations

bull Work on DB12 (CPU Power computation) support for multi-coreallocations

bull Provide a complete documentation about SC integration

bull Provide side projects to minimize VO actions(Subset-CVMFS-Builder)

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1921

Thanks

Any questions Comments

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2021

CVMFS cvmfs-shrinkwrap utilityhttpscvmfsreadthedocsioenstablecpt-

shrinkwraphtmlcpt-shrinkwrap Online accessed 4May 2021 2021

CVMFS cvmfsexechttpsgithubcomcvmfscvmfsexec Online accessed4 May 2021 2021

Federico Stagni Andrea Valassi and Vladimir RomanovskiyldquoIntegrating LHCb workflows on HPC resources status andstrategiesrdquo In arXiv200613603 [hep-ex physicsphysics](June 2020) arXiv 200613603 urlhttparxivorgabs200613603

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

Backup Slides

Real use-cases we have in LHCb

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Available HPCs

bull Piz Daint in CSCS (Suisse)

bull Marconi-A2 in CINECA (Italy) ndash not used anymore

bull SDumont in LNCC (Brazil)

bull MareNostrum in BSC (Spain)

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Piz Daint CSCS

bull Ranked 12th in Top500 (Nov 2020)

bull 387872 cores (Nov 2020)

+ Collaboration with the local System Administrators allows atraditional Grid Site usage

rArr No change required

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Marconi-A2 CINECA

bull Ranked 19th in Top500 (Nov2019)

+ External connectivity from theWNs

+ CVMS mounted on the WNs

+ Accessible via a CE

bull 348000 cores (Nov 2019)

- Multi-core allocations 272logical cores per node (IntelKNL)

- Low memorycore

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Marconi-A2 CINECA Development

bull External Connectivity Use thepull model

bull Multi-core allocations Use thefat node partitioning variation

HTCondorCE

1 Submit pilots

Marconi-A2Worker nodes

DIRAC

Services

Pilot-Factory

WaitingJobs

2 Fetchjobs

LRMS CVMFS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Marconi-A2 CINECA Status

rArr More details about LHCb work on CINECA [3]

rArr Marconi-A2 has been replaced by Marconi-100 V100 GPUs andPower9 CPUs cluster

Done

bull Exploited 68272 cores pernode not enough memory formore jobs

To be done

bull Nothing to do Marconi-A2disappeared

bull LHCb software not ready forGPUs

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

SDumont LNCC

bull Ranked 277th in Top500 (Nov2020)

+ External connectivity from theWNs

+ CVMFS mounted on the WNs

+ Accessible via SSH (specialaccess)

bull 33856 cores (Nov 2020)

- Multi-core allocations 24 or 48logical cores per node

- Multi-node allocations 21nodes per allocation requiredby some queues

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

SDumont LNCC Development

bull External Connectivity Use thepull model

bull Multi-core allocations Use thefat node partitioning variation

bull Multi-node allocations Use thesub-pilots variation

1 Submit pilots via ssh

SDumontWorker nodes

DIRAC

Services

Pilot-Factory

WaitingJobs

2 Fetchjobs

LRMS CVMFS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

SDumont LNCC Status

Done

bull Exploit 2424 and 4848 coresper node

To be done

bull Multi-node allocation shouldhave results soon

bull Dirac Benchmark not adaptedto multi-core allocations 20of the jobs run out of time

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Mare Nostrum BSC

bull Ranked 42nd in Top500 (Nov2020)

+ Accessible via a CE (ARC) andalso SSH

+ Single-core allocation possiblebut not recommended

bull 153216 cores (Nov 2020)

- No network connectivity

- CVMFS not mounted on theWNs

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Mare Nostrum BSC Development

bull No external connectivity Usethe push model

bull No CVMFS mounted on theWNs Use theSubset-CVMFS-Buildervariation

bull To get multi-core allocationsUse the BundleCE variation ARCCE

MareNostrumWorker nodes

DIRAC

Services Waiting

Jobs

1 Fetchjobs BundleCE

CVMFS SubsetBuilder

2 Submitapps

Subset CVMFS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Mare Nostrum BSC Status

Done

bull Prototype to run simplesubmissions (Hello World)

bull First version of theSubset-CVMFS-Builder

To be done

bull CE configuration to run jobswithin Singularity

bull BundleCE to aggregatemultiple jobs in an allocation

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

  • Introduction
  • Table of Contents
  • DIRAC WMS on Grid Sites
  • DIRAC WMS and Supercomputers
  • Tackling the distributed computing challenges
  • References
  • LHCb-supercomputers collaboration
Page 20: Integrating DIRAC work ows in Supercomputers

Thanks

Any questions Comments

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2021

CVMFS cvmfs-shrinkwrap utilityhttpscvmfsreadthedocsioenstablecpt-

shrinkwraphtmlcpt-shrinkwrap Online accessed 4May 2021 2021

CVMFS cvmfsexechttpsgithubcomcvmfscvmfsexec Online accessed4 May 2021 2021

Federico Stagni Andrea Valassi and Vladimir RomanovskiyldquoIntegrating LHCb workflows on HPC resources status andstrategiesrdquo In arXiv200613603 [hep-ex physicsphysics](June 2020) arXiv 200613603 urlhttparxivorgabs200613603

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

Backup Slides

Real use-cases we have in LHCb

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Available HPCs

bull Piz Daint in CSCS (Suisse)

bull Marconi-A2 in CINECA (Italy) ndash not used anymore

bull SDumont in LNCC (Brazil)

bull MareNostrum in BSC (Spain)

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Piz Daint CSCS

bull Ranked 12th in Top500 (Nov 2020)

bull 387872 cores (Nov 2020)

+ Collaboration with the local System Administrators allows atraditional Grid Site usage

rArr No change required

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Marconi-A2 CINECA

bull Ranked 19th in Top500 (Nov2019)

+ External connectivity from theWNs

+ CVMS mounted on the WNs

+ Accessible via a CE

bull 348000 cores (Nov 2019)

- Multi-core allocations 272logical cores per node (IntelKNL)

- Low memorycore

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Marconi-A2 CINECA Development

bull External Connectivity Use thepull model

bull Multi-core allocations Use thefat node partitioning variation

HTCondorCE

1 Submit pilots

Marconi-A2Worker nodes

DIRAC

Services

Pilot-Factory

WaitingJobs

2 Fetchjobs

LRMS CVMFS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Marconi-A2 CINECA Status

rArr More details about LHCb work on CINECA [3]

rArr Marconi-A2 has been replaced by Marconi-100 V100 GPUs andPower9 CPUs cluster

Done

bull Exploited 68272 cores pernode not enough memory formore jobs

To be done

bull Nothing to do Marconi-A2disappeared

bull LHCb software not ready forGPUs

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

SDumont LNCC

bull Ranked 277th in Top500 (Nov2020)

+ External connectivity from theWNs

+ CVMFS mounted on the WNs

+ Accessible via SSH (specialaccess)

bull 33856 cores (Nov 2020)

- Multi-core allocations 24 or 48logical cores per node

- Multi-node allocations 21nodes per allocation requiredby some queues

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

SDumont LNCC Development

bull External Connectivity Use thepull model

bull Multi-core allocations Use thefat node partitioning variation

bull Multi-node allocations Use thesub-pilots variation

1 Submit pilots via ssh

SDumontWorker nodes

DIRAC

Services

Pilot-Factory

WaitingJobs

2 Fetchjobs

LRMS CVMFS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

SDumont LNCC Status

Done

bull Exploit 2424 and 4848 coresper node

To be done

bull Multi-node allocation shouldhave results soon

bull Dirac Benchmark not adaptedto multi-core allocations 20of the jobs run out of time

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Mare Nostrum BSC

bull Ranked 42nd in Top500 (Nov2020)

+ Accessible via a CE (ARC) andalso SSH

+ Single-core allocation possiblebut not recommended

bull 153216 cores (Nov 2020)

- No network connectivity

- CVMFS not mounted on theWNs

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Mare Nostrum BSC Development

bull No external connectivity Usethe push model

bull No CVMFS mounted on theWNs Use theSubset-CVMFS-Buildervariation

bull To get multi-core allocationsUse the BundleCE variation ARCCE

MareNostrumWorker nodes

DIRAC

Services Waiting

Jobs

1 Fetchjobs BundleCE

CVMFS SubsetBuilder

2 Submitapps

Subset CVMFS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Mare Nostrum BSC Status

Done

bull Prototype to run simplesubmissions (Hello World)

bull First version of theSubset-CVMFS-Builder

To be done

bull CE configuration to run jobswithin Singularity

bull BundleCE to aggregatemultiple jobs in an allocation

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

  • Introduction
  • Table of Contents
  • DIRAC WMS on Grid Sites
  • DIRAC WMS and Supercomputers
  • Tackling the distributed computing challenges
  • References
  • LHCb-supercomputers collaboration
Page 21: Integrating DIRAC work ows in Supercomputers

CVMFS cvmfs-shrinkwrap utilityhttpscvmfsreadthedocsioenstablecpt-

shrinkwraphtmlcpt-shrinkwrap Online accessed 4May 2021 2021

CVMFS cvmfsexechttpsgithubcomcvmfscvmfsexec Online accessed4 May 2021 2021

Federico Stagni Andrea Valassi and Vladimir RomanovskiyldquoIntegrating LHCb workflows on HPC resources status andstrategiesrdquo In arXiv200613603 [hep-ex physicsphysics](June 2020) arXiv 200613603 urlhttparxivorgabs200613603

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

Backup Slides

Real use-cases we have in LHCb

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Available HPCs

bull Piz Daint in CSCS (Suisse)

bull Marconi-A2 in CINECA (Italy) ndash not used anymore

bull SDumont in LNCC (Brazil)

bull MareNostrum in BSC (Spain)

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Piz Daint CSCS

bull Ranked 12th in Top500 (Nov 2020)

bull 387872 cores (Nov 2020)

+ Collaboration with the local System Administrators allows atraditional Grid Site usage

rArr No change required

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Marconi-A2 CINECA

bull Ranked 19th in Top500 (Nov2019)

+ External connectivity from theWNs

+ CVMS mounted on the WNs

+ Accessible via a CE

bull 348000 cores (Nov 2019)

- Multi-core allocations 272logical cores per node (IntelKNL)

- Low memorycore

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Marconi-A2 CINECA Development

bull External Connectivity Use thepull model

bull Multi-core allocations Use thefat node partitioning variation

HTCondorCE

1 Submit pilots

Marconi-A2Worker nodes

DIRAC

Services

Pilot-Factory

WaitingJobs

2 Fetchjobs

LRMS CVMFS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Marconi-A2 CINECA Status

rArr More details about LHCb work on CINECA [3]

rArr Marconi-A2 has been replaced by Marconi-100 V100 GPUs andPower9 CPUs cluster

Done

bull Exploited 68272 cores pernode not enough memory formore jobs

To be done

bull Nothing to do Marconi-A2disappeared

bull LHCb software not ready forGPUs

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

SDumont LNCC

bull Ranked 277th in Top500 (Nov2020)

+ External connectivity from theWNs

+ CVMFS mounted on the WNs

+ Accessible via SSH (specialaccess)

bull 33856 cores (Nov 2020)

- Multi-core allocations 24 or 48logical cores per node

- Multi-node allocations 21nodes per allocation requiredby some queues

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

SDumont LNCC Development

bull External Connectivity Use thepull model

bull Multi-core allocations Use thefat node partitioning variation

bull Multi-node allocations Use thesub-pilots variation

1 Submit pilots via ssh

SDumontWorker nodes

DIRAC

Services

Pilot-Factory

WaitingJobs

2 Fetchjobs

LRMS CVMFS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

SDumont LNCC Status

Done

bull Exploit 2424 and 4848 coresper node

To be done

bull Multi-node allocation shouldhave results soon

bull Dirac Benchmark not adaptedto multi-core allocations 20of the jobs run out of time

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Mare Nostrum BSC

bull Ranked 42nd in Top500 (Nov2020)

+ Accessible via a CE (ARC) andalso SSH

+ Single-core allocation possiblebut not recommended

bull 153216 cores (Nov 2020)

- No network connectivity

- CVMFS not mounted on theWNs

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Mare Nostrum BSC Development

bull No external connectivity Usethe push model

bull No CVMFS mounted on theWNs Use theSubset-CVMFS-Buildervariation

bull To get multi-core allocationsUse the BundleCE variation ARCCE

MareNostrumWorker nodes

DIRAC

Services Waiting

Jobs

1 Fetchjobs BundleCE

CVMFS SubsetBuilder

2 Submitapps

Subset CVMFS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Mare Nostrum BSC Status

Done

bull Prototype to run simplesubmissions (Hello World)

bull First version of theSubset-CVMFS-Builder

To be done

bull CE configuration to run jobswithin Singularity

bull BundleCE to aggregatemultiple jobs in an allocation

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

  • Introduction
  • Table of Contents
  • DIRAC WMS on Grid Sites
  • DIRAC WMS and Supercomputers
  • Tackling the distributed computing challenges
  • References
  • LHCb-supercomputers collaboration
Page 22: Integrating DIRAC work ows in Supercomputers

Backup Slides

Real use-cases we have in LHCb

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Available HPCs

bull Piz Daint in CSCS (Suisse)

bull Marconi-A2 in CINECA (Italy) ndash not used anymore

bull SDumont in LNCC (Brazil)

bull MareNostrum in BSC (Spain)

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Piz Daint CSCS

bull Ranked 12th in Top500 (Nov 2020)

bull 387872 cores (Nov 2020)

+ Collaboration with the local System Administrators allows atraditional Grid Site usage

rArr No change required

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Marconi-A2 CINECA

bull Ranked 19th in Top500 (Nov2019)

+ External connectivity from theWNs

+ CVMS mounted on the WNs

+ Accessible via a CE

bull 348000 cores (Nov 2019)

- Multi-core allocations 272logical cores per node (IntelKNL)

- Low memorycore

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Marconi-A2 CINECA Development

bull External Connectivity Use thepull model

bull Multi-core allocations Use thefat node partitioning variation

HTCondorCE

1 Submit pilots

Marconi-A2Worker nodes

DIRAC

Services

Pilot-Factory

WaitingJobs

2 Fetchjobs

LRMS CVMFS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Marconi-A2 CINECA Status

rArr More details about LHCb work on CINECA [3]

rArr Marconi-A2 has been replaced by Marconi-100 V100 GPUs andPower9 CPUs cluster

Done

bull Exploited 68272 cores pernode not enough memory formore jobs

To be done

bull Nothing to do Marconi-A2disappeared

bull LHCb software not ready forGPUs

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

SDumont LNCC

bull Ranked 277th in Top500 (Nov2020)

+ External connectivity from theWNs

+ CVMFS mounted on the WNs

+ Accessible via SSH (specialaccess)

bull 33856 cores (Nov 2020)

- Multi-core allocations 24 or 48logical cores per node

- Multi-node allocations 21nodes per allocation requiredby some queues

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

SDumont LNCC Development

bull External Connectivity Use thepull model

bull Multi-core allocations Use thefat node partitioning variation

bull Multi-node allocations Use thesub-pilots variation

1 Submit pilots via ssh

SDumontWorker nodes

DIRAC

Services

Pilot-Factory

WaitingJobs

2 Fetchjobs

LRMS CVMFS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

SDumont LNCC Status

Done

bull Exploit 2424 and 4848 coresper node

To be done

bull Multi-node allocation shouldhave results soon

bull Dirac Benchmark not adaptedto multi-core allocations 20of the jobs run out of time

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Mare Nostrum BSC

bull Ranked 42nd in Top500 (Nov2020)

+ Accessible via a CE (ARC) andalso SSH

+ Single-core allocation possiblebut not recommended

bull 153216 cores (Nov 2020)

- No network connectivity

- CVMFS not mounted on theWNs

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Mare Nostrum BSC Development

bull No external connectivity Usethe push model

bull No CVMFS mounted on theWNs Use theSubset-CVMFS-Buildervariation

bull To get multi-core allocationsUse the BundleCE variation ARCCE

MareNostrumWorker nodes

DIRAC

Services Waiting

Jobs

1 Fetchjobs BundleCE

CVMFS SubsetBuilder

2 Submitapps

Subset CVMFS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Mare Nostrum BSC Status

Done

bull Prototype to run simplesubmissions (Hello World)

bull First version of theSubset-CVMFS-Builder

To be done

bull CE configuration to run jobswithin Singularity

bull BundleCE to aggregatemultiple jobs in an allocation

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

  • Introduction
  • Table of Contents
  • DIRAC WMS on Grid Sites
  • DIRAC WMS and Supercomputers
  • Tackling the distributed computing challenges
  • References
  • LHCb-supercomputers collaboration
Page 23: Integrating DIRAC work ows in Supercomputers

LHCb-supercomputers collaboration

Available HPCs

bull Piz Daint in CSCS (Suisse)

bull Marconi-A2 in CINECA (Italy) ndash not used anymore

bull SDumont in LNCC (Brazil)

bull MareNostrum in BSC (Spain)

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Piz Daint CSCS

bull Ranked 12th in Top500 (Nov 2020)

bull 387872 cores (Nov 2020)

+ Collaboration with the local System Administrators allows atraditional Grid Site usage

rArr No change required

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Marconi-A2 CINECA

bull Ranked 19th in Top500 (Nov2019)

+ External connectivity from theWNs

+ CVMS mounted on the WNs

+ Accessible via a CE

bull 348000 cores (Nov 2019)

- Multi-core allocations 272logical cores per node (IntelKNL)

- Low memorycore

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Marconi-A2 CINECA Development

bull External Connectivity Use thepull model

bull Multi-core allocations Use thefat node partitioning variation

HTCondorCE

1 Submit pilots

Marconi-A2Worker nodes

DIRAC

Services

Pilot-Factory

WaitingJobs

2 Fetchjobs

LRMS CVMFS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Marconi-A2 CINECA Status

rArr More details about LHCb work on CINECA [3]

rArr Marconi-A2 has been replaced by Marconi-100 V100 GPUs andPower9 CPUs cluster

Done

bull Exploited 68272 cores pernode not enough memory formore jobs

To be done

bull Nothing to do Marconi-A2disappeared

bull LHCb software not ready forGPUs

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

SDumont LNCC

bull Ranked 277th in Top500 (Nov2020)

+ External connectivity from theWNs

+ CVMFS mounted on the WNs

+ Accessible via SSH (specialaccess)

bull 33856 cores (Nov 2020)

- Multi-core allocations 24 or 48logical cores per node

- Multi-node allocations 21nodes per allocation requiredby some queues

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

SDumont LNCC Development

bull External Connectivity Use thepull model

bull Multi-core allocations Use thefat node partitioning variation

bull Multi-node allocations Use thesub-pilots variation

1 Submit pilots via ssh

SDumontWorker nodes

DIRAC

Services

Pilot-Factory

WaitingJobs

2 Fetchjobs

LRMS CVMFS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

SDumont LNCC Status

Done

bull Exploit 2424 and 4848 coresper node

To be done

bull Multi-node allocation shouldhave results soon

bull Dirac Benchmark not adaptedto multi-core allocations 20of the jobs run out of time

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Mare Nostrum BSC

bull Ranked 42nd in Top500 (Nov2020)

+ Accessible via a CE (ARC) andalso SSH

+ Single-core allocation possiblebut not recommended

bull 153216 cores (Nov 2020)

- No network connectivity

- CVMFS not mounted on theWNs

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Mare Nostrum BSC Development

bull No external connectivity Usethe push model

bull No CVMFS mounted on theWNs Use theSubset-CVMFS-Buildervariation

bull To get multi-core allocationsUse the BundleCE variation ARCCE

MareNostrumWorker nodes

DIRAC

Services Waiting

Jobs

1 Fetchjobs BundleCE

CVMFS SubsetBuilder

2 Submitapps

Subset CVMFS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Mare Nostrum BSC Status

Done

bull Prototype to run simplesubmissions (Hello World)

bull First version of theSubset-CVMFS-Builder

To be done

bull CE configuration to run jobswithin Singularity

bull BundleCE to aggregatemultiple jobs in an allocation

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

  • Introduction
  • Table of Contents
  • DIRAC WMS on Grid Sites
  • DIRAC WMS and Supercomputers
  • Tackling the distributed computing challenges
  • References
  • LHCb-supercomputers collaboration
Page 24: Integrating DIRAC work ows in Supercomputers

LHCb-supercomputers collaboration

Piz Daint CSCS

bull Ranked 12th in Top500 (Nov 2020)

bull 387872 cores (Nov 2020)

+ Collaboration with the local System Administrators allows atraditional Grid Site usage

rArr No change required

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Marconi-A2 CINECA

bull Ranked 19th in Top500 (Nov2019)

+ External connectivity from theWNs

+ CVMS mounted on the WNs

+ Accessible via a CE

bull 348000 cores (Nov 2019)

- Multi-core allocations 272logical cores per node (IntelKNL)

- Low memorycore

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Marconi-A2 CINECA Development

bull External Connectivity Use thepull model

bull Multi-core allocations Use thefat node partitioning variation

HTCondorCE

1 Submit pilots

Marconi-A2Worker nodes

DIRAC

Services

Pilot-Factory

WaitingJobs

2 Fetchjobs

LRMS CVMFS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Marconi-A2 CINECA Status

rArr More details about LHCb work on CINECA [3]

rArr Marconi-A2 has been replaced by Marconi-100 V100 GPUs andPower9 CPUs cluster

Done

bull Exploited 68272 cores pernode not enough memory formore jobs

To be done

bull Nothing to do Marconi-A2disappeared

bull LHCb software not ready forGPUs

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

SDumont LNCC

bull Ranked 277th in Top500 (Nov2020)

+ External connectivity from theWNs

+ CVMFS mounted on the WNs

+ Accessible via SSH (specialaccess)

bull 33856 cores (Nov 2020)

- Multi-core allocations 24 or 48logical cores per node

- Multi-node allocations 21nodes per allocation requiredby some queues

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

SDumont LNCC Development

bull External Connectivity Use thepull model

bull Multi-core allocations Use thefat node partitioning variation

bull Multi-node allocations Use thesub-pilots variation

1 Submit pilots via ssh

SDumontWorker nodes

DIRAC

Services

Pilot-Factory

WaitingJobs

2 Fetchjobs

LRMS CVMFS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

SDumont LNCC Status

Done

bull Exploit 2424 and 4848 coresper node

To be done

bull Multi-node allocation shouldhave results soon

bull Dirac Benchmark not adaptedto multi-core allocations 20of the jobs run out of time

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Mare Nostrum BSC

bull Ranked 42nd in Top500 (Nov2020)

+ Accessible via a CE (ARC) andalso SSH

+ Single-core allocation possiblebut not recommended

bull 153216 cores (Nov 2020)

- No network connectivity

- CVMFS not mounted on theWNs

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Mare Nostrum BSC Development

bull No external connectivity Usethe push model

bull No CVMFS mounted on theWNs Use theSubset-CVMFS-Buildervariation

bull To get multi-core allocationsUse the BundleCE variation ARCCE

MareNostrumWorker nodes

DIRAC

Services Waiting

Jobs

1 Fetchjobs BundleCE

CVMFS SubsetBuilder

2 Submitapps

Subset CVMFS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Mare Nostrum BSC Status

Done

bull Prototype to run simplesubmissions (Hello World)

bull First version of theSubset-CVMFS-Builder

To be done

bull CE configuration to run jobswithin Singularity

bull BundleCE to aggregatemultiple jobs in an allocation

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

  • Introduction
  • Table of Contents
  • DIRAC WMS on Grid Sites
  • DIRAC WMS and Supercomputers
  • Tackling the distributed computing challenges
  • References
  • LHCb-supercomputers collaboration
Page 25: Integrating DIRAC work ows in Supercomputers

LHCb-supercomputers collaboration

Marconi-A2 CINECA

bull Ranked 19th in Top500 (Nov2019)

+ External connectivity from theWNs

+ CVMS mounted on the WNs

+ Accessible via a CE

bull 348000 cores (Nov 2019)

- Multi-core allocations 272logical cores per node (IntelKNL)

- Low memorycore

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Marconi-A2 CINECA Development

bull External Connectivity Use thepull model

bull Multi-core allocations Use thefat node partitioning variation

HTCondorCE

1 Submit pilots

Marconi-A2Worker nodes

DIRAC

Services

Pilot-Factory

WaitingJobs

2 Fetchjobs

LRMS CVMFS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Marconi-A2 CINECA Status

rArr More details about LHCb work on CINECA [3]

rArr Marconi-A2 has been replaced by Marconi-100 V100 GPUs andPower9 CPUs cluster

Done

bull Exploited 68272 cores pernode not enough memory formore jobs

To be done

bull Nothing to do Marconi-A2disappeared

bull LHCb software not ready forGPUs

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

SDumont LNCC

bull Ranked 277th in Top500 (Nov2020)

+ External connectivity from theWNs

+ CVMFS mounted on the WNs

+ Accessible via SSH (specialaccess)

bull 33856 cores (Nov 2020)

- Multi-core allocations 24 or 48logical cores per node

- Multi-node allocations 21nodes per allocation requiredby some queues

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

SDumont LNCC Development

bull External Connectivity Use thepull model

bull Multi-core allocations Use thefat node partitioning variation

bull Multi-node allocations Use thesub-pilots variation

1 Submit pilots via ssh

SDumontWorker nodes

DIRAC

Services

Pilot-Factory

WaitingJobs

2 Fetchjobs

LRMS CVMFS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

SDumont LNCC Status

Done

bull Exploit 2424 and 4848 coresper node

To be done

bull Multi-node allocation shouldhave results soon

bull Dirac Benchmark not adaptedto multi-core allocations 20of the jobs run out of time

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Mare Nostrum BSC

bull Ranked 42nd in Top500 (Nov2020)

+ Accessible via a CE (ARC) andalso SSH

+ Single-core allocation possiblebut not recommended

bull 153216 cores (Nov 2020)

- No network connectivity

- CVMFS not mounted on theWNs

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Mare Nostrum BSC Development

bull No external connectivity Usethe push model

bull No CVMFS mounted on theWNs Use theSubset-CVMFS-Buildervariation

bull To get multi-core allocationsUse the BundleCE variation ARCCE

MareNostrumWorker nodes

DIRAC

Services Waiting

Jobs

1 Fetchjobs BundleCE

CVMFS SubsetBuilder

2 Submitapps

Subset CVMFS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Mare Nostrum BSC Status

Done

bull Prototype to run simplesubmissions (Hello World)

bull First version of theSubset-CVMFS-Builder

To be done

bull CE configuration to run jobswithin Singularity

bull BundleCE to aggregatemultiple jobs in an allocation

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

  • Introduction
  • Table of Contents
  • DIRAC WMS on Grid Sites
  • DIRAC WMS and Supercomputers
  • Tackling the distributed computing challenges
  • References
  • LHCb-supercomputers collaboration
Page 26: Integrating DIRAC work ows in Supercomputers

LHCb-supercomputers collaboration

Marconi-A2 CINECA Development

bull External Connectivity Use thepull model

bull Multi-core allocations Use thefat node partitioning variation

HTCondorCE

1 Submit pilots

Marconi-A2Worker nodes

DIRAC

Services

Pilot-Factory

WaitingJobs

2 Fetchjobs

LRMS CVMFS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Marconi-A2 CINECA Status

rArr More details about LHCb work on CINECA [3]

rArr Marconi-A2 has been replaced by Marconi-100 V100 GPUs andPower9 CPUs cluster

Done

bull Exploited 68272 cores pernode not enough memory formore jobs

To be done

bull Nothing to do Marconi-A2disappeared

bull LHCb software not ready forGPUs

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

SDumont LNCC

bull Ranked 277th in Top500 (Nov2020)

+ External connectivity from theWNs

+ CVMFS mounted on the WNs

+ Accessible via SSH (specialaccess)

bull 33856 cores (Nov 2020)

- Multi-core allocations 24 or 48logical cores per node

- Multi-node allocations 21nodes per allocation requiredby some queues

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

SDumont LNCC Development

bull External Connectivity Use thepull model

bull Multi-core allocations Use thefat node partitioning variation

bull Multi-node allocations Use thesub-pilots variation

1 Submit pilots via ssh

SDumontWorker nodes

DIRAC

Services

Pilot-Factory

WaitingJobs

2 Fetchjobs

LRMS CVMFS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

SDumont LNCC Status

Done

bull Exploit 2424 and 4848 coresper node

To be done

bull Multi-node allocation shouldhave results soon

bull Dirac Benchmark not adaptedto multi-core allocations 20of the jobs run out of time

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Mare Nostrum BSC

bull Ranked 42nd in Top500 (Nov2020)

+ Accessible via a CE (ARC) andalso SSH

+ Single-core allocation possiblebut not recommended

bull 153216 cores (Nov 2020)

- No network connectivity

- CVMFS not mounted on theWNs

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Mare Nostrum BSC Development

bull No external connectivity Usethe push model

bull No CVMFS mounted on theWNs Use theSubset-CVMFS-Buildervariation

bull To get multi-core allocationsUse the BundleCE variation ARCCE

MareNostrumWorker nodes

DIRAC

Services Waiting

Jobs

1 Fetchjobs BundleCE

CVMFS SubsetBuilder

2 Submitapps

Subset CVMFS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Mare Nostrum BSC Status

Done

bull Prototype to run simplesubmissions (Hello World)

bull First version of theSubset-CVMFS-Builder

To be done

bull CE configuration to run jobswithin Singularity

bull BundleCE to aggregatemultiple jobs in an allocation

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

  • Introduction
  • Table of Contents
  • DIRAC WMS on Grid Sites
  • DIRAC WMS and Supercomputers
  • Tackling the distributed computing challenges
  • References
  • LHCb-supercomputers collaboration
Page 27: Integrating DIRAC work ows in Supercomputers

LHCb-supercomputers collaboration

Marconi-A2 CINECA Status

rArr More details about LHCb work on CINECA [3]

rArr Marconi-A2 has been replaced by Marconi-100 V100 GPUs andPower9 CPUs cluster

Done

bull Exploited 68272 cores pernode not enough memory formore jobs

To be done

bull Nothing to do Marconi-A2disappeared

bull LHCb software not ready forGPUs

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

SDumont LNCC

bull Ranked 277th in Top500 (Nov2020)

+ External connectivity from theWNs

+ CVMFS mounted on the WNs

+ Accessible via SSH (specialaccess)

bull 33856 cores (Nov 2020)

- Multi-core allocations 24 or 48logical cores per node

- Multi-node allocations 21nodes per allocation requiredby some queues

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

SDumont LNCC Development

bull External Connectivity Use thepull model

bull Multi-core allocations Use thefat node partitioning variation

bull Multi-node allocations Use thesub-pilots variation

1 Submit pilots via ssh

SDumontWorker nodes

DIRAC

Services

Pilot-Factory

WaitingJobs

2 Fetchjobs

LRMS CVMFS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

SDumont LNCC Status

Done

bull Exploit 2424 and 4848 coresper node

To be done

bull Multi-node allocation shouldhave results soon

bull Dirac Benchmark not adaptedto multi-core allocations 20of the jobs run out of time

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Mare Nostrum BSC

bull Ranked 42nd in Top500 (Nov2020)

+ Accessible via a CE (ARC) andalso SSH

+ Single-core allocation possiblebut not recommended

bull 153216 cores (Nov 2020)

- No network connectivity

- CVMFS not mounted on theWNs

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Mare Nostrum BSC Development

bull No external connectivity Usethe push model

bull No CVMFS mounted on theWNs Use theSubset-CVMFS-Buildervariation

bull To get multi-core allocationsUse the BundleCE variation ARCCE

MareNostrumWorker nodes

DIRAC

Services Waiting

Jobs

1 Fetchjobs BundleCE

CVMFS SubsetBuilder

2 Submitapps

Subset CVMFS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Mare Nostrum BSC Status

Done

bull Prototype to run simplesubmissions (Hello World)

bull First version of theSubset-CVMFS-Builder

To be done

bull CE configuration to run jobswithin Singularity

bull BundleCE to aggregatemultiple jobs in an allocation

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

  • Introduction
  • Table of Contents
  • DIRAC WMS on Grid Sites
  • DIRAC WMS and Supercomputers
  • Tackling the distributed computing challenges
  • References
  • LHCb-supercomputers collaboration
Page 28: Integrating DIRAC work ows in Supercomputers

LHCb-supercomputers collaboration

SDumont LNCC

bull Ranked 277th in Top500 (Nov2020)

+ External connectivity from theWNs

+ CVMFS mounted on the WNs

+ Accessible via SSH (specialaccess)

bull 33856 cores (Nov 2020)

- Multi-core allocations 24 or 48logical cores per node

- Multi-node allocations 21nodes per allocation requiredby some queues

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

SDumont LNCC Development

bull External Connectivity Use thepull model

bull Multi-core allocations Use thefat node partitioning variation

bull Multi-node allocations Use thesub-pilots variation

1 Submit pilots via ssh

SDumontWorker nodes

DIRAC

Services

Pilot-Factory

WaitingJobs

2 Fetchjobs

LRMS CVMFS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

SDumont LNCC Status

Done

bull Exploit 2424 and 4848 coresper node

To be done

bull Multi-node allocation shouldhave results soon

bull Dirac Benchmark not adaptedto multi-core allocations 20of the jobs run out of time

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Mare Nostrum BSC

bull Ranked 42nd in Top500 (Nov2020)

+ Accessible via a CE (ARC) andalso SSH

+ Single-core allocation possiblebut not recommended

bull 153216 cores (Nov 2020)

- No network connectivity

- CVMFS not mounted on theWNs

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Mare Nostrum BSC Development

bull No external connectivity Usethe push model

bull No CVMFS mounted on theWNs Use theSubset-CVMFS-Buildervariation

bull To get multi-core allocationsUse the BundleCE variation ARCCE

MareNostrumWorker nodes

DIRAC

Services Waiting

Jobs

1 Fetchjobs BundleCE

CVMFS SubsetBuilder

2 Submitapps

Subset CVMFS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Mare Nostrum BSC Status

Done

bull Prototype to run simplesubmissions (Hello World)

bull First version of theSubset-CVMFS-Builder

To be done

bull CE configuration to run jobswithin Singularity

bull BundleCE to aggregatemultiple jobs in an allocation

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

  • Introduction
  • Table of Contents
  • DIRAC WMS on Grid Sites
  • DIRAC WMS and Supercomputers
  • Tackling the distributed computing challenges
  • References
  • LHCb-supercomputers collaboration
Page 29: Integrating DIRAC work ows in Supercomputers

LHCb-supercomputers collaboration

SDumont LNCC Development

bull External Connectivity Use thepull model

bull Multi-core allocations Use thefat node partitioning variation

bull Multi-node allocations Use thesub-pilots variation

1 Submit pilots via ssh

SDumontWorker nodes

DIRAC

Services

Pilot-Factory

WaitingJobs

2 Fetchjobs

LRMS CVMFS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

SDumont LNCC Status

Done

bull Exploit 2424 and 4848 coresper node

To be done

bull Multi-node allocation shouldhave results soon

bull Dirac Benchmark not adaptedto multi-core allocations 20of the jobs run out of time

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Mare Nostrum BSC

bull Ranked 42nd in Top500 (Nov2020)

+ Accessible via a CE (ARC) andalso SSH

+ Single-core allocation possiblebut not recommended

bull 153216 cores (Nov 2020)

- No network connectivity

- CVMFS not mounted on theWNs

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Mare Nostrum BSC Development

bull No external connectivity Usethe push model

bull No CVMFS mounted on theWNs Use theSubset-CVMFS-Buildervariation

bull To get multi-core allocationsUse the BundleCE variation ARCCE

MareNostrumWorker nodes

DIRAC

Services Waiting

Jobs

1 Fetchjobs BundleCE

CVMFS SubsetBuilder

2 Submitapps

Subset CVMFS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Mare Nostrum BSC Status

Done

bull Prototype to run simplesubmissions (Hello World)

bull First version of theSubset-CVMFS-Builder

To be done

bull CE configuration to run jobswithin Singularity

bull BundleCE to aggregatemultiple jobs in an allocation

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

  • Introduction
  • Table of Contents
  • DIRAC WMS on Grid Sites
  • DIRAC WMS and Supercomputers
  • Tackling the distributed computing challenges
  • References
  • LHCb-supercomputers collaboration
Page 30: Integrating DIRAC work ows in Supercomputers

LHCb-supercomputers collaboration

SDumont LNCC Status

Done

bull Exploit 2424 and 4848 coresper node

To be done

bull Multi-node allocation shouldhave results soon

bull Dirac Benchmark not adaptedto multi-core allocations 20of the jobs run out of time

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Mare Nostrum BSC

bull Ranked 42nd in Top500 (Nov2020)

+ Accessible via a CE (ARC) andalso SSH

+ Single-core allocation possiblebut not recommended

bull 153216 cores (Nov 2020)

- No network connectivity

- CVMFS not mounted on theWNs

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Mare Nostrum BSC Development

bull No external connectivity Usethe push model

bull No CVMFS mounted on theWNs Use theSubset-CVMFS-Buildervariation

bull To get multi-core allocationsUse the BundleCE variation ARCCE

MareNostrumWorker nodes

DIRAC

Services Waiting

Jobs

1 Fetchjobs BundleCE

CVMFS SubsetBuilder

2 Submitapps

Subset CVMFS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Mare Nostrum BSC Status

Done

bull Prototype to run simplesubmissions (Hello World)

bull First version of theSubset-CVMFS-Builder

To be done

bull CE configuration to run jobswithin Singularity

bull BundleCE to aggregatemultiple jobs in an allocation

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

  • Introduction
  • Table of Contents
  • DIRAC WMS on Grid Sites
  • DIRAC WMS and Supercomputers
  • Tackling the distributed computing challenges
  • References
  • LHCb-supercomputers collaboration
Page 31: Integrating DIRAC work ows in Supercomputers

LHCb-supercomputers collaboration

Mare Nostrum BSC

bull Ranked 42nd in Top500 (Nov2020)

+ Accessible via a CE (ARC) andalso SSH

+ Single-core allocation possiblebut not recommended

bull 153216 cores (Nov 2020)

- No network connectivity

- CVMFS not mounted on theWNs

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Mare Nostrum BSC Development

bull No external connectivity Usethe push model

bull No CVMFS mounted on theWNs Use theSubset-CVMFS-Buildervariation

bull To get multi-core allocationsUse the BundleCE variation ARCCE

MareNostrumWorker nodes

DIRAC

Services Waiting

Jobs

1 Fetchjobs BundleCE

CVMFS SubsetBuilder

2 Submitapps

Subset CVMFS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Mare Nostrum BSC Status

Done

bull Prototype to run simplesubmissions (Hello World)

bull First version of theSubset-CVMFS-Builder

To be done

bull CE configuration to run jobswithin Singularity

bull BundleCE to aggregatemultiple jobs in an allocation

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

  • Introduction
  • Table of Contents
  • DIRAC WMS on Grid Sites
  • DIRAC WMS and Supercomputers
  • Tackling the distributed computing challenges
  • References
  • LHCb-supercomputers collaboration
Page 32: Integrating DIRAC work ows in Supercomputers

LHCb-supercomputers collaboration

Mare Nostrum BSC Development

bull No external connectivity Usethe push model

bull No CVMFS mounted on theWNs Use theSubset-CVMFS-Buildervariation

bull To get multi-core allocationsUse the BundleCE variation ARCCE

MareNostrumWorker nodes

DIRAC

Services Waiting

Jobs

1 Fetchjobs BundleCE

CVMFS SubsetBuilder

2 Submitapps

Subset CVMFS

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

LHCb-supercomputers collaboration

Mare Nostrum BSC Status

Done

bull Prototype to run simplesubmissions (Hello World)

bull First version of theSubset-CVMFS-Builder

To be done

bull CE configuration to run jobswithin Singularity

bull BundleCE to aggregatemultiple jobs in an allocation

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

  • Introduction
  • Table of Contents
  • DIRAC WMS on Grid Sites
  • DIRAC WMS and Supercomputers
  • Tackling the distributed computing challenges
  • References
  • LHCb-supercomputers collaboration
Page 33: Integrating DIRAC work ows in Supercomputers

LHCb-supercomputers collaboration

Mare Nostrum BSC Status

Done

bull Prototype to run simplesubmissions (Hello World)

bull First version of theSubset-CVMFS-Builder

To be done

bull CE configuration to run jobswithin Singularity

bull BundleCE to aggregatemultiple jobs in an allocation

Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121

  • Introduction
  • Table of Contents
  • DIRAC WMS on Grid Sites
  • DIRAC WMS and Supercomputers
  • Tackling the distributed computing challenges
  • References
  • LHCb-supercomputers collaboration