hpc@uantwerp introduction

Introduction to HPC 04/03/2021

1

vscentrum.be

HPC@UAntwerp introductionStefan Becuwe, Franky Backeljauw, Kurt Lust, Bert Tijskens, Carl Mensch, Robin Verschoren

Version Spring 2021

Introduction to the VSC

3

6


2

VSC – Flemish Supercomputer Center

➢ Vlaams Supercomputer Centrum (VSC)

➢ Established in December 2007

➢ Partnership between 5 University associations:Antwerp, Brussels, Ghent, Hasselt, Leuven

➢ Funded by Flemish Government through the FWO (Research Fund – Flanders)

➢ Goal: make HPC available to all researchers in Flanders (academic and industrial)

➢ Local infrastructure (Tier-2) in Antwerp:

HPC core facility CalcUA

VSC: An integrated infrastructure according to the European model

Tier-3

desktop

Tier-0

European

Tier-1(national)regional

Tier-2(regional)university

VSC

net

work

Stora

ge

Job m

anag

emen

t

BEgrid

EGI

Tier-2(regional)university

Tier-1(national)regional

Tier-0

European

Tier-3desktop

PR

ACE

7

8


3

UAntwerp hardware

leibniz (2017) vaughan (2020)

• 152 compute nodes • 104 compute nodes

o 144*128GB, 8*256GB o256GB

o2 2.4GHz 14-core Broadwell CPUs (E5-2680v4) o2 2.35GHz 32-core AMD Rome CPUs (7452)

oNo swap, 10GB local tmp oNo swap, 10GB local tmp

• 2 GPU compute nodes (dual Tesla P100)

• 1 NEC SX Aurora node

• 2 login nodes • 2 login nodes

• 1 visualisation node (hardware accelerated OpenGL)

➢ The storage is a joint setup for leibniz and vaughan with over 650TB capacity spread over a number of volumes

The processorleibniz vaughan

CPU Intel Broadwell E5-2680v4 AMD Rome 7452

Base clock speed 2.4 GHz 2.35 GHz -2%

Cores/node 28 64 +130%

Vector instructions AVX2+FMA AVX2+FMA

DP flops/core/cycle 16 (e.g., FMA instruction) 16 (e.g., FMA instruction) +0%

Memory speed 8*64b*2400MHz DDR4 16*64b*3200MHz DDR4 +166%

Theoretical peak/node 851/1075/1254 Gflops(base/2.4GHz/turbo)

2400(base)

+182% - +152%

DGEMM Gflops 50k x 50k 995 Gflops Approximately 2 Tflops

GMPbench 3230/4409 (turbo mode)

➢ Note that core speed didn’t increase anymore; the number of cores did

➢ Expect similar to slightly higher IPC on vaughan

➢ No major changes to the instruction set this time

10

12


4

Tier-1 infrastructure

➢ inaugurated October 2016

➢ 580 compute nodes with 2 14-core Broadwell processors

128 (435 nodes) or 256 (145 nodes) GB memory, EDR-IB network

o nodes are essentially the same as on Leibniz, but the interconnect is a half generation older

➢ 408 compute nodes with 2 14-core Skylake processors (621/841 Tflops theoretical peak performance

base/turbo)

➢ Approximately 1.2 PB net storage capacity, peak bandwidth 40 GB/s

➢ Successor end of 2020 – early 2021, Broadwell nodes will likely be decommissioned then

BrENIAC (KU Leuven)

# 196 in Top 500 on June 2016


BrENIAC (KU Leuven)

15

16


5


BrENIAC (KU Leuven)

Characteristics of a HPC cluster

➢ Shared infrastructure, used by multiple users simultaneously

o So you have to request the amount of resources you need

o And you may have to wait a little

➢ Expensive infrastructure

o So software efficiency matters

o Can’t afford idling for user input

➢ Therefore mostly for batch applications rather than interactive applications

➢ Built for parallel jobs. No parallelism = no supercomputing. Not meant for running a single single-core job.

➢ Remote use model

o But you’re running (a different kind of) remote applications all the time on your phone or web browser…

➢ Linux-based

o so no Windows or macOS software,

o but many standard Linux applications will run.

17

18


6

Getting a VSC account

Tutorial Chapter 2

SSH keys

➢ Almost all communication with the cluster happens trough SSH = Secure SHell

o Protocol to log in to a remote computer

o Protocol to transfer files (sftp)

o Protocol to tunnel IP traffic to/from a remote host through firewalls

➢ We use public/private key pairs rather than passwords to enhance security

o Private key is never passed over the internet if you do things right

➢ Keys:

o The private key is the part of the key pair that stays on your local machine and that you have to keep very secure

▪ Protect with a passphrase!

o The public key is put on the computer you want to access

o A hacker can do nothing with your public key without the private key

▪ And can do nothing with your private key file without the passphrase

21

22


7

Required software – Options on Windows

➢ Windows 10 version 1803 and later come with a Linux-style ssh client in PowerShell

o Works as the Linux client

o In combination with other ssh ports, getting the right permissions on .ssh may be a bit tricky

o Combine with the Windows Terminal (store app)

➢ PuTTY is a popular GUI SSH client

➢ MobaXterm combines a SSH/SFTP client, X server and VNC server in one

o Free edition (“Home Edition”) with limited number of simultaneous connections

o Professional version with unlimited number of connections (€ 60/year)

o Uses PuTTY keys

➢ Cygwin Linux emulation libraries

➢ Windows Subsystem for Linux (WSL)

o Since Windows 10 version 1709, now a WSL 2 since Windows 10 version 2004

o Enable in “Additional Windows components”

o Install your favourite (often free) Linux distribution from the Windows store

Required software – macOS and Linux

➢ macOS

o SSH client included with the OS

o Terminal package also included with the OS

o iTerm2 is a free and more sophisticated terminal emulator

o XQuartz with local xterm’s is also an option

➢ Linux

➢ SSH client included with the OS (though you may need to install it as a separate package)

➢ Choice of terminal emulation packages

23

24

https://www.putty.org/

https://mobaxterm.mobatek.net/

https://www.cygwin.com/

https://docs.microsoft.com/en-us/windows/wsl/install-win10

https://www.iterm2.com/

https://www.xquartz.org/


8

Keep safe!

Create SSH keys

➢ On Windows: PuTTY with puttygen key generator:

o But one file with public+private key (extension .ppk), stays on the PC

o And a second file with the public key (extension .pub)

o Can convert to and from OpenSSH key files

ssh-keygen(OpenSSH)

/home/<username>/.ssh/id_rsa

this is your private key

this is your public key

/home/<username>/.ssh/id_rsa.pub

VSC user directory (LDAP)

Create SSH keysA technical note

➢ We currently support two cryptography technologies for keys at the VSC

o RSA, but it should use at least 4096 bitsssh-keygen -t rsa -b 4096

o ed25519 keys

➢ Our advice is to stick to RSA 4096 bit keys. Some client software does not support ed25519 keys and even longer RSA keys are also often not supported.

➢ Some SSH implementations use a different file format (PuTTY, SSH2), and those keys may need conversion.

25

26


9

Register your account

➢ Web-based procedure

your VSC username is vsc20xxx

Connecting to the HPC infrastructure

Tutorial Chapter 3

Tutorial Chapter 6

27

31


10

A typical workflow

32

1. Connect to the cluster2. Transfer your files to the clusters3. Select software and build your environment4. Define and submit your job5. Wait while

➢ your job gets scheduled➢ your job gets executed➢ your job finishes

6. Move your results

Connecting to the cluster

➢ Required software:

o See slides with SSH implementations

o When connecting from outside Belgium, you need a VPN client to connect to the UAntwerpennetwork first (University service, not provided by CalcUA/VSC)

▪ Use the Cisco AnyConnect Client on vpn.uantwerpen.be or iOS/Android stores. OS-builtinclients fail to connect from some sites.

➢ A cluster has

o login nodes: The part of the cluster to which you connect and where you launch your jobs

o compute nodes: The part of the cluster where the actual work is done

o storage and management nodes: As a regular user, you don’t get on them

➢ Generic names of the login nodes:

o VSC standard: login.hpc.uantwerpen.be ⇒ Leibniz

o Leibniz: login-leibniz.hpc.uantwerpen.be

o Vaughan: login-vaughan.hpc.uantwerpen.be (currently limited access)

32

33


11

Connecting to the cluster

➢ Windows with PuTTY:

o Open PuTTY

o Follow the instructions on the VSC web site to fill in all fields

➢ OpenSSH-derived clients

o Open a terminal

▪ Windows 10: PowerShell or a bash command prompt in a terminal emulator

▪ macOS: Terminal which comes with macOS or iTerm2

▪ Ubuntu: Applications > Accessories > Terminal or CTRL+Alt+T

o Login via secure shell:

(if your private key file has the standard filename, otherwise point to it with –i)

$ ssh [email protected]

The login nodes

➢ Currently restricted access

o only accessible from within Belgium

o external access from elsewhere requires using the UA-VPN service to connect to the Campus network

➢ Use for:

o editing and submitting jobs, also via some GUI applications

o small compile jobs – submit large compiles as a regular job

o visualisation (we even have a special login node for that purpose)

Do not use the login nodes for running your jobs

34

35

mailto:https://www.chiark.greenend.org.uk/~sgtatham/putty/


12

The login nodes

➢ In rare cases you may need specific login names rather than the generic ones:

o login1-leibniz.hpc.uantwerpen.be (and login2-)

o viz1-leibniz.hpc.uantwerpen.be (for hardware accelerated visualisation using TurboVNC)

o login1-vaughan.hpc.uantwerpen.be (and login2-)

➢ Some cases where you may need those:

o when using VNC for GUI access

o when using the screen or tmux commands and you want to reconnect

o Eclipse remote projects

➢ These are names for external access. Once on the cluster, you can also use internal names:

o ln1.leibniz.antwerpen.vsc (and ln2)

o viz1.leibniz.antwerpen.vsc

o login1.vaughan.antwerpen.vsc (and login2)

➢ These names can also be used to connect from other VSC clusters

A typical workflow

37




36

37


13

Transfer your files

➢ sftp protocol – ftp-like protocol based on ssh, using the same keys

➢ Graphical ssh/sftp file managers for Windows, macOS, Linux

o Windows: PuTTY, WinSCP (PuTTY keys), MobaXterm (PuTTY keys)

o Multiplatform: FileZilla (OpenSSH keys)

o macOS: CyberDuck

➢ Command line: scp - secure copy

o copy from the local computer to the cluster

o copy from the clusters to the local computer

➢ Consider Globus (globus.org) for big transfers

o Requires software on the computer from/to where you want to transfer files

o Support data and scratch file systems on UAntwerp clusters

$ scp file.ext [email protected]:

$ scp [email protected]:file.ext .

Directories

➢ /scratch/antwerpen/2xx/vsc2xxyyo fast scratch for temporary fileso per cluster/siteo highest performance, especially for bigger fileso highest capacity (0.6 PB)o does not support hard links

➢ /data/antwerpen/2xx/vsc2xxyyo to store long-time datao exported to other VSC siteso slower and much smaller (60 TB) than /scratcho prefer no jobs running in this directory unless

explicitly told to do so

➢ /user/antwerpen/2xx/vsc2xxyyo only for account configuration fileso exported to other VSC siteso never run your jobs in this directory!o very small but good performance for its

purpose

$VSC_SCRATCH

$VSC_DATA

$VSC_HOME

38

39

mailto:https://www.chiark.greenend.org.uk/~sgtatham/putty/

mailto:https://winscp.net/


mailto:https://filezilla-project.org/

mailto:https://cyberduck.io/

https://www.globus.org/


14

Directories (2)

➢ /scratch uses a parallel filesystem (BeeGFS), hence is more optimized for large files.

➢ /data and /user use a standard Linux filesystem (xfs) and have better performance for small files (a few kB) which is why we suggest to install packages for Python, perl, R, etc. on /data.

o The underlying storage technology of /user is very expensive hence it should only be used for what it is meant to be used

➢ Some fancy software installers still think it is a good idea to use a lot of hard links. These don’t work on /scratch

o So Conda-software should be installed on /data as it will not work on /scratch unless you force conda to use symbolic links instead

➢ Currently we do take a backup of /data, but in the future we may exclude certain directories if the volume and number of files becomes too large

o Conda-installed pythons are an example of directories that may be excluded as every conda installation tries to install half an OS.

Directories: Quota

➢ Block quota: Limits the amount of data you can store on a file system

➢ File quota: Limits the number of files you can store on a file system

➢ Current defaults:

➢ Quota are shown when you log on onto the login nodes. (Well, will be shown as we need to re-implement this functionality for the new storage.)

block quota file quota

/scratch 50 GB 100,000?

/data 25 GB 100,000

/home 3 GB 20,000

40

41


15

Globus

➢ Service to transfer large amounts of data between computers

o It is possible to initiate a direct transfer between two remote computers from your laptop (no software needed on your laptop except for a recent web browser)

o It is also possible to initiate a transfer to your laptop

▪ Globus software needed on your laptop

▪ After disconnecting, the transfer will be resumed automatically

➢ Via web site: globus.org

o It is possible to sign in with you UAntwerp account

o You can also create a Globus account and link that to your UAntwerp account

➢ You do need a UAntwerp account to access data on our servers

o Data sharing features not enabled in our license

➢ Collection to look for in the Globus web app: VSC UAntwerpen Tier2

o From there access to /data and /scratch

o Note: VSC is also Vienna Scientific Cluster

Globus (2)

➢ Use for

o Transfer large amounts of data to/from your laptop/desktop

▪ Advantage over sftp: auto-restart

o Transfer large amounts of data to a server in your department

▪ Working on obtaining a campus license so that all departments can install a fully functional version

▪ Advantage over sftp: You can manage the tranfer from your laptop/desktop/smartphone

o Transfer data between different scratch volumes on VSC clusters

o Transfer data between scratch volumes of other VSC clusters and your data volume

▪ Often more efficient than using Linux cp.

o Transfer data to/from external sources

43

44


16

Best practices for file storage

➢ The cluster is not for long-term file storage. You have to move back your results to your laptop or server in your department.

o There is a backup for the home and data file systems. We can only continue to offer this if these file systems are used in the proper way, i.e., not for very volatile data

o Old data on scratch can be deleted if scratch fills up (so far this has never happened at UAntwerp, but it is done regularly on some other VSC clusters)

➢ Remember that the cluster is optimised for parallel access to big files, not for using tons of small files (e.g., one per MPI process).

o If you run out of block quota, you’ll get more without too much motivation (at least on /scratch)

o If you run out of file quota, you’ll have a lot of explaining to do before getting more.

➢ Text files are good for summary output while your job is running, or data that you’ll read into a spreadsheet, but not for storing 1000x1000-matrices. Use binary files for that!

Running batch jobsBuilding your environment

Tutorial Chapter 4

45

48


17

A typical workflow

49




System software

➢ Operating system : CentOS 7.8 on leibniz, CentOS 8.2 on vaughan

o Red Hat Enterprise Linux (RHEL) 7/8 clone

o Leibniz and vaughan will be kept in sync as much as possible. Leibniz will be upgraded to CentOS 8 in 2021

➢ Resource management and job scheduling on leibniz:

o Torque : resource manager (based on PBS)

o MOAB : job scheduler and management tools

➢ Resource management/scheduler on vaughan: SLURM

o Leibniz will follow in 2021.

49

50


18

Development software

➢ C/C++/Fortran compilers: Intel Parallel Studio XE Cluster Edition and GCC

o Intel compiler comes with several tools for performance analysis and assistance in vectorisation/parallelisation of code

o Fairly up-to-date OpenMP support in both compilers

o Support for other shared memory parallelisation technologies such as Cilk Plus, other sets of directives, etc. to a varying level in both compilers

o PGAS Co-Array Fortran support in Intel Compiler (and some in GNU Fortran)

o Due to some VSC policies, support for the latest versions may be lacking, but you can request them if you need them

➢ Message passing libraries: Intel MPI (MPICH-based), Open MPI (standard and Mellanox)

➢ Mathematical libraries: Intel MKL, OpenBLAS, FFTW, MUMPS, GSL, …

➢ Many other libraries, including HDF5, NetCDF, Metis, ...

➢ Python

Application software

➢ ABINIT, CP2K, OpenMX, QuantumESPRESSO, Gaussian

➢ VASP, CHARMM, GAMESS-US, Siesta, Molpro, CPMD

➢ Gromacs, LAMMPS, NAMD2

➢ Telemac, FINE-Marine

➢ OMNIS/LB

➢ SAMtools, TopHat, Bowtie, BLAST, MaSuRCA

➢ Gurobi

➢ Comsol, R, MATLAB, …

➢ Often multiple versions of the same package

Additional software can be installed on demand

51

52


19

Application softwareThe fine print

➢ Restrictions on commercial software

o 5 licenses for Intel Parallel Studio

o MATLAB: Toolboxes limited to those in the campus agreement

o Other licenses paid for and provided by the users, owner of the license decides who can use the license (see next slide)

➢ Typical user applications don’t come as part of the system software, but

o Additional software can be installed on demand

o Take into account the system requirements

o Provide working building instructions and licenses for a cluster. RPMs rarely work on a cluster.

o Since we cannot have domain knowledge in all science domains, users must do the testing with us

➢ Compilation support is limited

o Best effort, no code fixing

o An issue nowadays as too many packages are only tested with a single compiler…

➢ Installed in /apps/antwerpen

Using licensed software

➢ VSC or campus-wide license: You can just use the software

o Some restrictions may apply if you don’t work at UAntwerp but at some of the institutions we have agreements with (ITG, VITO) or at a company

➢ Restricted licenses

o Typically licenses paid for by one or more research groups, sometimes just other license restrictions that must be respected

o Talk to the owner of the license first. If no one in your research group is using a particular package, your group probably doesn’t contribute to the license cost.

o Access controlled via UNIX groups

▪ E.g., avasp : VASP users from UAntwerp

▪ To get access: Request group membership via account.vscentrum.be, “New/Join group” menu item. The group moderator will then grant or refuse access

53

54

https://account.vscentrum.be/


20

Building your environment:The module concept

➢ Dynamic software management through modules

o Allows to select the version that you need and avoids version conflicts

o Loading a module sets some environment variables :

▪ $PATH, $LD_LIBRARY_PATH, ...

▪ Application-specific environment variables

▪ VSC-specific variables that you can use to point to the location of a package (typically start with EB)

o Automatically pulls in required dependencies

➢ Software on our system is organised in toolchains

o System toolchain: Some packages are compiled against system libraries or installed from binaries

o Compiler toolchains: Most packages are compiled from source with up-to-date compilers for better performance

o Packages in the same compiler toolchain will interoperate and can be combined with packages from the system toolchain

Building your environment:Toolchains

➢ Compiler toolchains:

o “intel” toolchain: Intel and GNU compilers, Intel MPI and math libraries.

o “foss” toolchain: GNU compilers, Open MPI, OpenBLAS, FFTW, ...

o “GCCcore”: modules work with both a specific intel and specific foss toolchain

o VSC-wide and refreshed twice per year with more recent compiler versions: 2016b, 2017a, ...

▪ We sometimes do silent updates within a toolchain at UAntwerp

➢ intel/2018a, intel/2018b, intel/2019b and intel/2020a are now our main toolchains and kept synchronous on all systems.

o We have skipped 2017b at UAntwerp as it offers no advantages over our 2017a toolchain

o And skipped 2019a because there were too many technical problems

o Software that did not compile correctly with the 2017a or newer toolchains or showed a large performance regression, is installed in the 2016b toolchain

o 2020a is the base toolchain for CentOS 8.

55

56


21

Building your environment:The module command

➢ Module name is typically of the form: <name of package>/<version>-<toolchain info>-<additional info>, e.g.

o intel/2020a : The Intel toolchain package, version 2020a

o Boost/1.73.0-intel-2020a: The Boost libraries, version 1.73.0, build with the intel/2020a toolchain

o SuiteSparse/5.7.1-intel-2020a-METIS-5.1.0: SuiteSparse library, version 5.7.1, build with the Intel compiler, and using METIS 5.1.0.

▪ In <additional info>, we don’t add all additional components, but these that matter to distinguish between different available modules for the same package.

➢ No two modules with the same name can be loaded simultaneously

Building your environment:Cluster modules and application modules

➢ Cluster modules:

o calcua/supported: Enables all currently supported application modules: up to 4 toolchain versions and the system toolchain modules

o calcua/2020a: Enables only the 2020a compiler toolchain modules and the system toolchain modules

o So it is a good practice to always load the appropriate cluster module first before loading any other module!

o On leibniz, leibniz/supported is a synonym for calcua/supported, similarly on vaughan

57

58


22

Building your environment:Cluster modules and application modules

➢ 3 types of application modules

o Built with a specific version of a compiler toolchain, e.g., 2020a

▪ Modules in a subdirectory that contains the toolchain name

▪ Try to support a toolchain for 2 years

o Applications installed in the system toolchain (compiled with the system compilers)

▪ Modules in subdirectory system

▪ For tools that don’t take much compute time or are needed to bootstrap the regular toolchain

o Generic binaries for 64-bit Intel-compatible systems

▪ Typically applications installed from binary packages

▪ Modules in subdirectory software-x86_64

Building your environment:Discovering software

➢ module av : List all available moduleso Depends on the cluster modules you have loaded

➢ module av matlab : List all available modules whose name contains matlab (case insensitive)➢ module spider : list installed software packageso Does not depend on the cluster module you have loaded

➢ module spider matlab : search for all modules whose name contains matlab, not case-sensitive➢ module spider MATLAB/R2017a : display additional information about the MATLAB/R2017a module,

including other modules you may need to load➢ module keyword matlab : uses information in the module definition to find a matcho Does not depend on the cluster module you have loaded

➢ module help : Display help about the module command➢ module help baselibs : show help about a given module o Specify a version, e.g., baselibs/2020a to get the most specific information

➢ module whatis baselibs : shows information about the module, but less then helpo But this is the information that module keyword uses

➢ module -t av |& sort -f : produce a sorted list of all available modules, case-insensitive

59

60


23

Building your environment:Enabling and disabling packages

➢ Loading cluster modules and packages:

o module load calcua/2020a : load a cluster module (list of modules)

o module load MATLAB/R2020a : load a specific version

▪ Package has to be available before it can be loaded!

o module load MATLAB : load the default version (usually the most recent one)

▪ But be prepared for change if we change the calcua/supported!

➢ Unloading packages:

o module unload MATLAB : unload MATLAB settings

▪ Does not automatically unload the dependencies

o module purge : unload all modules (incl. dependencies and cluster module)

➢ module list : list all loaded modules in the current session

➢ Shortcut:

o ml : without arguments: shortcut for module listo ml calcua/2020a : With arguments: Load the modules

Modules and reproducibility

➢ On supercomputers, most software is recompiled at installation to ensure proper optimization for the system.

➢ Most supercomputer centres use a specialised installation tool to manage the installation process and the module system. The two most popular ones are EasyBuild and Spack.

o We use EasyBuild

o Those tools work with “recipes” that provide all installation instructions in a concise way

➢ What we do:

o We keep the source files that we used when installing software

o The recipe is stored with the software for all software installed through EasyBuild and accessible to users (not on all VSC systems).

▪ E.g., $EBROOTGROMACS/easybuild contains the recipe and the installation log file which will tell you which version of EasyBuild was used.

o Even after removal from the system, we keep that information on backup for a number of years.

61

62


24

The myth of absolute reproducibility

➢ Algorithms that rely on random numbers can only reproduce statistics

➢ Floating point computations are not 100% reproducible

o Different hardware has a different roundoff error pattern

o Different order of operations that are mathematically equivalent are not equivalent in floating point computations, and this is an issue with all parallel software

➢ Not all algorithms are designed to be reproducible in a parallel environment and may not even produce exactly the same result twice on the same hardware

➢ The operating system may also influence the results that your software generates as all software uses some OS-provided libraries

o And there is no way that we can stop OS upgrades as this has severe security implications

➢ Even a cloud environment would not give you complete reproducibility

o You can reproduce the exact same OS environment (by storing your virtual machine)

o But even a virtual machine cannot abstract away from processor behaviour unless you would emulate the processor

o And even then, non-deterministic behaviour in a parallel environment remains an issue.

Exercises (block1)

➢ Link VSC-tutorials: calcua.uantwerpen.be/support and look for HPC tutorials

➢ Getting ready (see also §3.1):

o Log on to the cluster

o Go to your scratch directory: cd $VSC_SCRATCH

o Copy the examples for the tutorial to that directory: typecp -r /apps/antwerpen/tutorials/Intro-HPC/examples .

➢ Follow the text from the tutorial in §4.1 “Modules” and try out the module command.

➢ Load the cluster module calcua/2019b and check the difference between module av and module spider.

o Try to find modules for the HDF5 library. Did you notice the difference between module av and module spider?

➢ Which module offers support for VNC? Which command does that module offer?

➢ Which module(s) offer the CMake tool?

➢ What is the difference between the HDF5/1.8.20-intel-2018a-MPI HDF5/1.8.20-intel-2018a-noMPI (well, you can guess it from the name but how can you find more information?)

63

65

http://hpc.uantwerpen.be/support


25

Running batch jobsDefining and submitting jobs

Tutorial Chapter 4

A typical workflow

68




67

68


26

Why batch?

➢ A cluster is a large and expensive machine

o So the cluster has to be used as efficiently as possible

o Which implies that we cannot loose time waiting for input as in an interactive program

➢ And few programs can use the whole capacity (also depends on the problem to solve)

o So the cluster is a shared resource, each simultaneous user gets a fraction of the machine depending on his/her requirements

➢ Moreover there are a lot of users, so one sometimes has to wait a little.

➢ Hence batch jobs (script with resource specifications) submitted to a queueing system with a scheduler to select the next job in a fair way based on available resources and scheduling policies.

#!/bin/bash#PBS -o stdout.$PBS_JOBID#PBS -e stderr.$PBS_JOBIDmodule load calcua/supportedmodule load MATLABcd ”$PBS_O_WORKDIR”matlab -r fibo

Job workflow with Torque & MoabWhat happens behind the scenes

Submit JobQuery Cluster

Torque

Moab

Don’t worry too much who does what…

Scheduler

Resource manager (server)

Queue manager

Scheduling

policy

Resource manager (client)




Users

69

70


27

#!/bin/bash#SBATCH -o stdout.%j#SBATCH -e stderr.%jmodule purgemodule load calcua/supportedmodule load MATLAB

matlab -r fibo

Coming up: Job workflow in SlurmWhat happens behind the scenes


SLURM

Scheduler plugin

Resource manager (server)

Partition manager

Scheduling

policy





Users

The job script

➢ Specifies the resources needed for the job and other instructions for the resource manager:

o Number and type of CPUs

o Amount of compute time

o Amount of (RAM) memory

o Output files (stdout and stderr) – optional

o Can instruct the resource manager to send mail when a job starts or ends

➢ Creates the environment

o Script runs in your home directory, so don’t forget to go to the appropriate directory

o Load the modules you need for your job

➢ Then does the actual work

o Then specify all the commands you want to execute

o And these are really all the commands that you would also do interactively or in a regular script to perform the tasks you want to perform

71

72


28

The job script (Torque)

#!/bin/bash#PBS -L tasks=1:lprocs=7:memory=10gb:swap=10gb#PBS -l walltime=1:00:00#PBS -o stdout.$PBS_JOBID#PBS -e stderr.$PBS_JOBID

module load calcua/supportedmodule load MATLABcd ”$PBS_O_WORKDIR”

matlab -r fibo

This is a bash script (but could be, e.g., Perl)

Specify the resources for the job (and some other instructions for the resource manager), but look as comments to Bash

The command(s) that you want toexecute in your job.

Build a suitable environment for your job. The script runs in your home directory!

#!/bin/bash#SBATCH --ntasks=1 --cpus-per-task=7#SBATCH --mem-per-cpu=1g#SBATCH --time=1:00:00#SBATCH -o stdout.%j#SBATCH -e stderr.%j

module purgemodule load calcua/supportedmodule load MATLAB

matlab -r fibo

The job script (Slurm)

This is a bash script (but could be, e.g., Perl)

The command(s) that you want toexecute in your job.

Build a suitable environment for your job. The script runs where you launched the batch job!

Specify the resources for the job (and some other instructions for the resource manager), but look as comments to Bash

73

74


29

Specifying resources in job scripts (Torque)

#!/bin/bash

#PBS -N name_of_job#PBS -L tasks=1:lprocs=1:memory=1gb:swap=1gb

#PBS -l walltime=01:00:00

#PBS -m e#PBS -M <your e-mail address>

#PBS -o stdout.file#PBS -e stderr.file

cd ”$PBS_O_WORKDIR”

module load vsc-tutorial

./hello_world

← the shell (language) in which the script is executed

← “name” of the job (optional)← resource specifications (e.g., tasks, processors per task), amount of resident

and amount of virtual memory per task← requested running time (required)–l

← mail notification at begin, end or abort (optional)← your e-mail address (should always be specified)

← redirect standard output (–o) and standard error (–e)← (optional) if omitted, a generic filename will be used

← change to the current working directory (optional,← if it is the directory where you submitted the job)← load modules you might need

← the program itself← finished

example.sh

$ qsub example.sh

Specifying resources in job scripts (Slurm)

#!/bin/bash

#SBATCH --jobname=name_of_job#SBATCH --ntasks=1 –-cpus-per-task=1#SBATCH –-mem-per-cpu=1g#SBATCH --time=01:00:00

#SBATCH --mail-type=FAIL#SBATCH -–mail-user=<your e-mail address>

#SBATCH -o stdout.file#SBATCH -e stderr.file

module purgemodule load calcua/supported vsc-tutorial

./hello_world

← the shell (language) in which the script is executed

← “name” of the job (optional)← resource specifications (e.g., tasks, processors per task), amount of memory

per processor← requested running time (required)–l

← mail notification at begin, end or abort (optional)← your e-mail address (should always be specified)

← redirect standard output (–o) and standard error (–e)← (optional) if omitted, a generic filename will be used

← Make sure you reload the environment as otherwise you may have strange behaviour of MPI programs

← the program itself← finished

example.sh

$ sbatch example.sh

75

76


30

Specifying resources in job scripts (Torque and Slurm)

#!/bin/bash

#PBS -N name_of_job#PBS -L tasks=1:lprocs=1:memory=1gb:swap=1gb


#PBS -m e#PBS -M <your e-mail address>

#PBS -o stdout.file#PBS -e stderr.file

cd ”$PBS_O_WORKDIR”module load vsc-tutorial

./hello_world

#!/bin/bash

#SBATCH --jobname=name_of_job#SBATCH --nasks=1 –cpus-per-task=1:#SBATCH –-mem-per-cpu=1g#SBATCH --time=01:00:00

#SBATCH --mail-type=FAIL#SBATCH -–mail-user=<your e-mail address>

#SBATCH -o stdout.file#SBATCH -e stderr.file

module purgemodule load calcua/supported vsc-tutorial

./hello_world← finished

example.sh

$ sbatch example.sh$ qsub example.sh

IntermezzoWhy do Torque commands start with PBS?

➢ In the ’90s of the previous century, there was a resource manager called Portable Batch System, developed by a contractor for NASA. This was open-sourced.

➢ But that company was acquired by another company that then sold the rights to Altair Engineering that evolved the product into the closed-source product PBSpro (now open-source again since the summer of 2016).

➢ And the open-source version was forked by what is now Adaptive Computing and became Torque. Torque remained open-source. Torque = Terascale Open-source Resource and QUEuemanager. The name was changed, but the commands remained. Torque is closed source again since June 2018.

➢ Likewise, Moab evolved from MAUI, an open-source scheduler. Adaptive Computing, the company behind Moab, contributed a lot to MAUI but then decided to start over with a closed source product. They still offer MAUI on their website though.

o MAUI was widely used in large USA supercomputer centres, but most now throw their weight behind SLURM with or without another scheduler.

77

78


31

Specifying resources in job scripts:Two versions of the syntax

Two versions of the syntax for specifying CPU and memory:

➢ Version 1 is what you’ll find in most job scripts of colleagues: -l

o Old and buggy implementation that doesn’t reflect well on modern hardware and cluster software

o Was the only option on the previous version of the OS on our clusters

➢ Version 2 is the version we encourage to use at Antwerp: -L

o It is less buggy

o It better mirrors modern cluster hardware as it supports the hierarchy in processing elements on a cluster node (hardware thread, core, NUMA node, socket)

o It was created for use with more advanced resource control mechanisms in new Linux kernels

o Resembles more the ideas behind SLURM, our next resource manager/scheduler which we are using on vaughan and will later implement on leibniz also

Specifying resources in job scripts:Version 2 concepts

➢ Task: A job consists of a number of tasks.

o Think of it as a process (e.g., one rank of a MPI job)

o Must run in a single node (exception: Superdome Flex shared memory system @ KU Leuven)In fact: Must run in a single OS instance

➢ Logical processor: Each task uses a number of logical processors

o Hardware threads or cores

o Specified per task

o In a system with hyperthreading enabled, an additional parameter allows to specify whether the system should use all hardware threads or run one lproc per core.

➢ Placement: Set locality of hardware resources for a task

➢ Feature: Each compute node can have one or more special features. We use this, e.g., for nodes with more memory

79

80


32

Specifying resources in job scripts:Version 2 concepts

➢ Physical/resident memory: RAM per task

o In our setup, not enforced on a per-task but per-node basis

o Processes won’t crash if you exceed this, but will start to use swap space

➢ Virtual memory: RAM + swap space, again per task

o In our setup, not enforced on a per-task but per-node basis

o Processes will crash if you exceed this.

➢ Task group: A group of related tasks, with the same number of cores per task and same amount of memory per task

o E.g., a MPI job

o A job can consist of multiple task groups

▪ Scenario: in-situ visualisation where a simulation code and visualisation code run hand-in-hand

Specifying resources in job scripts:Version 2 syntax: Cores

➢ Specifying the number of tasks and lprocs:#PBS -L tasks=4:lprocs=7

o Requests 4 tasks using 7 lprocs each (cores in the current setup of hopper and leibniz)

o The scheduler should be clever enough to put these 4 tasks on a single 28-core node and assign cores to each task in a clever way

o Each -L line is a task group

➢ Some variants:

o #PBS -L tasks=1:lprocs=all:place=node : Will give you all cores on a node.

▪ Does not work without place=node!

o #PBS -L tasks=4:lprocs=7:usecores : Stresses that cores should be used instead of hardware threads

81

82


33

Specifying resources in job scripts:Version 2 syntax: Memory

➢ Adding physical memory:#PBS -L tasks=4:lprocs=7:memory=20gb

o Requests 4 tasks using 7 lprocs each and 20gb memory per task

➢ Adding virtual memory:#PBS -L tasks=4:lprocs=7:memory=20gb:swap=25gb

o Confusing name! This actually requests 25GB of virtual memory, not 25GB of swap space (which would mean 20GB+25GB=45GB of virtual memory).

o We have no swap space on our clusters, so it is best to take swap=memory (at least if there is enough physical memory).

o And in this case there is no need to specify memory as the default is memory=swap#PBS -L tasks=4:lprocs=7:swap=20gb

o Available memory: 112GB per node on leibniz, 240GB for the 256GB nodes of leibniz, old hopper nodes and nodes of vaughan

Specifying resources in job scripts:Version 2 syntax: Additional features

➢ Adding additional features:#PBS -L tasks=4:lprocs=7:swap=50gb:feature=mem256

o Requests 4 tasks using 7 lprocs each and 50GB virtual memory per task, and instructs the scheduler to use nodes with 256GB of memory so that all 4 tasks can run on the same node rather than using two regular nodes and leave the cores half empty.

o Most important features listed on the VSC documentation website (Tier-2 hardware, UAntwerp section)

o Specifying multiple features is possible, both in an “and” or an “or” combination (see the Torque manual, section “-L NUMA Resource Request”)

83

84

http://docs.adaptivecomputing.com/torque/6-1-2/adminGuide/torque.htm

http://docs.adaptivecomputing.com/torque/6-1-2/adminGuide/torque.htm#topics/NUMA/-Lresource.htm


34

Specifying resources in job scripts:Requesting cores or memory

➢ What happens if you don’t use all memory and cores on a node?

o Other jobs – but only yours – can use it

▪ so that other users cannot accidentally take resources away from your job or crash your job

▪ and so that you have a clean environment for benchmarking without unwanted interference

o You’re likely wasting expensive resources! Don’t complain you have to wait too long until your job runs if you don’t use the cluster efficiently.

➢ If you really require a special setup where no other jobs of you run on the nodes allocated to you, contact user support ([email protected]).

➢ If you submit a lot of 1 core jobs that use more than more than 4 GB each on leibniz: Please specify the “mem256” feature for the nodes so that they get scheduled on nodes with more memory and also fill up as many cores as possible

o And a program requiring 16 GB of memory but using just a single thread does not make any sense on today’s computers

Specifying resources in job scripts:Single node per user policy

➢ Policy motivation :

o parallel jobs can suffer badly if they don’t have exclusive access to the full node as jobs influence each other (L3 cache, memory bandwidth, communication channels, ….)

o sometimes the wrong job gets killed if a user uses more memory than requested and a node runs out of memory

o a user may accidentally spawn more processes/threads and use more cores than expected (depends on the OS and how the process was started)

▪ typical case: a user does not realize that a program is an OpenMP or hybrid MPI/OpenMP program, resulting in running 28 processes with 28 threads each.

➢ Remember: the scheduler will (try to) fill up a node with several jobs from the same user, but could use some help from time to time

if you don’t have enough work for a single node,you need a good PC/workstation and not a supercomputer

85

86


35

Specifying resources in job scripts:Syntax 1.0 (deprecated) - Requesting cores

➢ Used at the other VSC sites (though sometimes through wrappers with restrictions on another scheduler)

➢ #PBS -l nodes=2:ppn=28 : Request 2 nodes, using 28 cores (in fact, hardware threads) per node

o Or in the terminology of the Torque manual: 2 tasks with 28 lprocs (logical processors) each.

o Different node types have a different number of cores per node. On leibniz, the regular nodes have 28 cores but the old hopper nodes integrated in the cluster have only 20.

o Trivial translation to the new syntax:#PBS -L tasks=2:lprocs=28 (even though this does not really follow the concepts, and you shouldn’t do it with Slurm)

➢ #PBS -l nodes=2:ppn=28:broadwell : Request 2 nodes, using 28 cores per node, and those nodes should have the property “broadwell”

➢ See the “UAntwerpen clusters” page in the “Available hardware” section of the User Portal on the VSC website for an overview of properties for various node types.

Specifying resources in job scripts:Syntax 1.0 (deprecated) - Requesting memory

➢ As with the version 2 syntax: 2 kinds of memory:

o Resident memory (RAM)

o Virtual memory (RAM + swap space)

➢ But now 2 ways of requesting memory:

o Specify quantity for the job as a whole and let Torque distribute evenly across tasks/lprocs (or cores): mem for resident memory and vmem for virtual memory (Torque manual advises against its use, but it often works best...).

o Specify the amount of memory per lproc/core: pmem for resident memory and pvmem for virtual memory.

▪ You should not use this for shared memory or hybrid applications as Torque has unexpected behaviour for these applications

➢ It is best not to mix between both options (total and per core). However, if both are specified, the Torque manual promises to actually take the least restrictive of (mem, pmem) for resident memory and of (vmem, pvmem) for virtual memory. However, the implementation is buggy and Torque doesn’t keep its promises...

87

88

https://www.vscentrum.be/infrastructure/hardware/hardware-ua

https://www.vscentrum.be/infrastructure/hardware

https://www.vscentrum.be/en/user-portal

https://www.vscentrum.be/


36


➢ Guidelines:

o If you want to run a single multithreaded process: mem and vmem will work for you (e.g., Gaussian).

o If you want to run a MPI job with single-threaded processes (#cores = # MPI ranks): Any of the above will work most of the time, but remember that mem/vmem is for the whole job, not one node!

o Hybrid MPI/OpenMP job: Best to use mem/vmem

o However, in more complicated resource requests than explained here, mem/vmem may not work as expected either

➢ Source of trouble:

o Linux has two mechanisms to enforce memory limits:

▪ ulimit : Older way, per process on most systems

▪ cgroups: The more modern and robust mechanism, whole process hierarchy

o Torque sets ulimit -m and -v often to a value which is only relevant for non-threaded jobs, or even smaller (the latter due to another bug), and smaller than the (correct) values for the cgroups.


➢ Examples:

o #PBS -l pvmem=2gb : This job needs 2 GB of virtual memory per core

➢ See the “UAntwerpen clusters” page in the “VSC hardware – Tier-2 hardware” section of the VSC documentation for the usable amounts of memory for each node type (which is always less than the physical amount of memory in the node)

➢ As swapping makes no sense, it is best to set pvmem = pmem or mem = vmem, but you still need to specify both parameters.

89

90

https://www.vscentrum.be/infrastructure/hardware/hardware-ua

https://vlaams-supercomputing-centrum-vscdocumentation.readthedocs-hosted.com/en/latest/hardware.html#tier-2-hardware

https://vlaams-supercomputing-centrum-vscdocumentation.readthedocs-hosted.com/en/latest/


37

Specifying resources in job scripts:Requesting compute time

➢ Same procedure for both versions of the resource syntax

➢ #PBS -l walltime=30:25:55 : This job is expected to run for 30h25m55s.#PBS -l walltime=1:6:25:55

➢ What happens if I underestimate the time? Bad luck, your job will be killed when the wall time has expired.

➢ What happens if I overestimate the time? Not much, but…

o Your job may have a lower priority for the scheduler or may not run because we limit the number of very long jobs that can run simultaneously.

o The estimates for the start time of other jobs that the scheduler produces will be even worse than they are.

➢ Maximum wall time is 3 days (1 day on special nodes)

o But remember what we said in last week’s lecture about reliability?

o Longer jobs need a checkpoint-and-restart strategy

▪ And most applications can restart from intermediate data

Submitting jobs

➢ If all resources are specified in the job script, submitting a job is as simple as

qsub returns with a unique jobid consisting of a unique number that you’ll need for further commands and a cluster name.

➢ But all parameters specified in the job script in #PBS lines can also be specified on the command line (and overwrite those in the job script), e.g.,

o x: the number of tasks

o y: the number of logical cores per task(maximum: 20 for hopper, 28 for leibniz until we enable hyperthreading)

o z: the amount of virtual memory needed per task

$ qsub jobscript.pbs

$ qsub -L tasks=x:lprocs=y:swap=zgb example.pbs

91

92


38

Non-resource-related #PBS parametersRedirecting I/O

➢ The output to stdout and stderr are redirected to files by the resource manager

o Place: By default in the directory where you submitted job (“qsub test.pbs”), when the job finishes and if you have enough quota

o Default naming convention

▪ stdout: <job-name>.o<job-ID>

▪ stderr: <job-name>.e<job-ID>

➢ The default job name is the name of the job script. To change it:#PBS -N my_job : Change the job name to my_job

▪ stdout: my_job.o<job-ID>

▪ stderr: my_job.e<job-ID>

➢ Rename output files:#PBS -o stdout.file : Redirect stdout to the file stdout.file instead#PBS -e stderr.file : Redirect stderr to the file stderr.file instead

➢ Other files are put in the current working directory. This is your home directory when your job starts!

Non-resource-related #PBS parametersNotifications

➢ The cluster can notify you by mail when a job starts, ends or is aborted.

o #PBS -m abe -M [email protected] mail when the job is aborted (a), begins (b) or ends (e) to the given mail address.

o Without –M mail will be sent to the account linked to your VSC account

o You may be bombarded with mails if you submit a lot of jobs, but it is useful if you need to monitor a job

➢ All other qsub command line arguments can also be specified in the job script, but these are the most useful ones.

o See the qsub manual page in the Torque manual.

example job scripts in /apps/antwerpen/examples

93

94

http://docs.adaptivecomputing.com/torque/6-1-1/adminGuide/help.htm#topics/torque/commands/qsub.htm

http://docs.adaptivecomputing.com/torque/6-1-1/adminGuide/help.htm


39

Environment variables in job scriptsPredefined variables

➢ The resource manager passes some environment variables

o $PBS_O_WORKDIR : directory from where the script was submitted

o $PBS_JOBID : the job identifier

o $PBS_NUM_PPN : when using resource syntax 1.0: number of cores/hardware threads per node available to the job, i.e., the value of the ppn parameter; otherwise 1

o $PBS_NP : total number of cores/hardware threads for the job

o $PBS_NODES, $PBS_JOBNAME, $PBS_QUEUE, ...

o Look for “Exported batch environment variables” in the Torque manual (currently version 6.1.2)

➢ Some VSC-specific environment variables

o Variables that start with VSC: Point to several file systems that you can use (see later), name of the cluster, etc.

o Variables that start with EB: installation directory of loaded modules

Specifying resources in job scripts

➢ Full list of options in the Torque manual, search for “Requesting resources” and “-L NUMA Resource Request”

o We discussed the most common ones.

➢ Remember that available resources change over time as new hardware is installed and old hardware is decommissioned.

o Share your scripts

o But understand what those scripts do (and how they do it)

o And don’t forget to check the resources from time to time!

o And remember from last week’s lecture: The optimal set of resources also depends on the size of the problem you are solving!

Share experience

95

96

http://docs.adaptivecomputing.com/torque/6-1-2/adminGuide/torque.htm#topics/torque/2-jobs/exportedBatchEnvVar.htm


http://docs.adaptivecomputing.com/torque/6-0-1/help.htm


http://docs.adaptivecomputing.com/torque/6-1-2/adminGuide/torque.htm#topics/torque/2-jobs/requestingRes.htm

http://docs.adaptivecomputing.com/torque/6-1-2/adminGuide/torque.htm#topics/NUMA/-Lresource.htm


40


Queue manager


Torque

Queue manager: Torque

➢ Jobs are submitted to the queue manager

o The queue manager provides functionality to start, delete, hold, cancel and monitor jobs on the queue

➢ Some commands :

o qsub — submit a job

o qdel — delete a submitted (queued and/or running) job

o qstat — get status of your jobs on the system

➢ Under the hood:

o There are multiple queues, but qsub will place the job in the most suitable queue

o On our system, queues are based on the requested wall time

o There are limits on the amount of jobs a user can submit to a queue or have running from one queue

▪ To avoid that a user monopolises the system for a prolonged time

97

98


41

Queue manager: Torqueqdel command

➢ qdel always takes the numeric jobid as an argument (no need to add .leibniz)

$ qdel <jobid>

➢ Extra command line parameters in the manual: If you really need them, it may be time to contact user support

Queue manager: Torqueqstat command

➢ Check the status of your own jobs:$ qstat$ qstat –u vsc20XXXThe latter shows more information

➢ Show a full status display of a job:$ qstat –f <jobid>$ qstat –f1 <jobid>

o Second variant doesn't split long lines in the output to fit in a terminal window

o It also shows the nodes assigned to a job once the job is running

➢ Show the nodes assigned to a job$ qstat –n$ qstat –n <jobid>

➢ Many more options, check the manual page

99

100


42


Scheduler


MOAB

Job scheduling: Moab

➢ Job scheduler

o Decides which job will run next based on (continuously evolving) priorities

o Works with the queue manager

o Works with the resource manager to check available resources

➢ Tries to optimise (long-term) resource use, keeping fairness in mind

o The scheduler associates a priority number to each job

▪ the highest priority job will (usually) be the next one to run, not first come, first served

▪ jobs with negative priority will (temporarily) be blocked

o backfill: allows some jobs to be run “out of order” with respect to the assigned priorities provided they do not delay the highest priority jobs in the queue

➢ Lots of other features : QoS, reservations, ...

101

102


43

Job schedulingScheduler priorities

➢ Currently, the priority calculation is based on :

o the time a job is queued: The longer a job is waiting in the queue, the higher the priority

o fairshare: jobs of users who have recently used a lot of compute time get a lower priority compared to other jobs

▪ And jobs may get blocked for a while if your fair share passes a critical value

EffectiveFairshareFairshare Depth (7 days)

PastPresent

Fairshare U

sage

– 35

– 30

– 25

– 20

– 15

– 10

– 5|6

|5

|4

|3

|2

|1

|0

Fairshare Interval (24 hours)

Fairshare Decay (80%)

Getting information from the scheduler

➢ showq:

o Displays information about the queue as the scheduler sees it (i.e., taking into account the priorities assigned by the scheduler)

o 3 categories: starting/running jobs, eligible jobs and blocked jobs

o You can only see your own jobs!

➢ checkjob <jobid>: Detailed status report for specified job

o Also includes node allocation for running jobs

➢ showstart <jobid>: Show an estimate of when the job can/will start

o This estimate is notoriously inaccurate!

o A job can start earlier than expected because other jobs take less time than requested

o But it can also start later if a higher-priority job enters the queue

➢ showbf: Shows the resources that are immediately available to you (for backfill) should you submit a job that matches those resources

o Does not take into account limit on number of nodes in use, fairshare etc.

103

104


44

Information shown by showq

➢ active jobs

o are running or starting and have resources assigned to them

➢ eligible jobs

o are queued and are considered eligible for scheduling and backfilling

➢ blocked jobs

o jobs that are ineligible to be run or queued

o job states in the ”STATE” column:

▪ idle: job violates a fairness policy

▪ userhold or systemhold: user or administrative hold

▪ deferred: a temporary hold when unable to start after a specified number of attempts

▪ batchhold: requested resources are not available, or repeated failure to start the job

▪ notqueued: e.g., a dependency for a job is not yet fulfilled when using job chains


Resource manager


Torque

105

106


45

Resource manager

➢ The resource manager manages the cluster resources (e.g., CPUs, memory, ...)

➢ This includes the actual job handling at the compute nodes

o Start, hold, cancel and monitor jobs on the compute nodes

o Should enforce resource limits (e.g., available memory for your job, ...)

o Logging and accounting of used resources

➢ Consists of a server component and a small daemon process that runs on each compute node

➢ 1 very technical but sometimes useful resource manager command:

o pbsnodes: returns information on the configuration of all nodes

(There are more commands but not accessible to regular users.)

Exercises (block 2)

➢ Tutorial §4.1: $VSC_SCRATCH/examples/Running-batch-jobs contains a short bash script (fibo.pbs) that will print Fibonacci numbers. Extend the script to request 1 core on 1 node with 2Gb of (virtual) memory and with a wall time of 1 minute in the job script and run the job. Note which files are generated

➢ Extend the script to give the job a name of your choice. Run and note which files are generated.

➢ While the job is running, try some of the q-commands

o Extend the walltime to 10 minutes and put a sleep 300 at the end because otherwise the job will finish so quickly that you don’t get the chance to see anything.

107

109


46

What we’ve covered so far…

➢ Linux (2 lectures) – Some bash scripting skills needed to write job scripts

➢ Cluster hardware and middleware

o CPUs have evolved so software should evolve to, and binaries need to be compiled with proper optimizations

o Overview of middleware used when developing parallel programs and how they may influence starting programs

▪ Options for vectorization: parallelism at the core level

▪ Options for shared memory: OpenMP, Intel TBB or other libraries, native support in languages, posix threads: Usually need to specify the number of threads

▪ Options for distributed memory: MPI, some languages such as Julia and Charm++, PGAS languages and libraries: Usually needs a process starter

▪ Hybrid: combine distributed and shared memory parallelism

What we’ve covered so far… (2)

➢ Logging in to the cluster: ssh and options for various OSes

➢ Transfering files: sftp, scp and Globus

➢ Building your environment: Modules

o Needed as often we need multiple versions of software on the cluster

➢ Job scripts: have three parts

o Resource specifications and other instructions for the resource manager and scheduler

o Create your environment

o Not yet covered: Start your programs

➢ Submit and manage your jobs

113

114


47

Job types

Part B

Agenda part B

➢ Job types

o Working interactively – Tutorial text chapter 5 + UAntwerp-specific

o Parallel jobs – Tutorial text chapter 7

o Job workflows

o Multi-job submission – Not covered in the tutorial text (Chapter 13 discusses a system that we will not be able to support a year from now)

➢ Monitoring jobs – Various chapters

➢ Best practices – Chapter 17

115

116


48

Working interactively

Tutorial Chapter 5 and more

GUI support

➢ The cluster is not your primary GUI working platform, but sometimes running GUI programs is necessary:

o Licensed program with a GUI for setting parameters of a run that cannot be run locally due to licensing restrictions (e.g., NUMECA FINE-Marine)

o Visualisations that need to be done while a job runs or of data that is too large to transfer

➢ Technologies in Linux:

o X Window System: Rendering on your local computer

▪ Sends the drawing commands from the application (X client) to the display device (X server)

▪ Software needed on your PC: X server

▪ Extensions add features, e.g., GLX for OpenGL support

▪ Often too slow due to network latency

o VNC (Virtual Network Computing): Rendering on the remote computer

▪ Rendering on the remote machine by the VNC server, images compressed and sent over the network

▪ Needs a VNC client on your PC

▪ Works pretty well even over not very fast connections, but hard to set up

o NoMachine NX / X2Go : sits between those options

▪ Not supported at UAntwerp, but used on the Tier-1 and UGent

117

118


49

VNC

➢ Leibniz has one visualization node with GPU and a VNC & VirtualGL-based setup

o VNC is the software to create a remote desktop

▪ Acts as a X server to programs running locally

o Simple desktop environment with the xfce desktop/window manager

▪ GNOME may be more popular, but does not work well with VNC

o VirtualGL: To redirect OpenGL calls to the graphics accelerator and send the resulting image to the VNC software. Start OpenGL programs via vglrun.

➢ Same software stack minus the OpenGL libraries on all login nodes of leibniz and vaughan.

o For 2D applications or applications with built-in 3D emulation (e.g., Matlab)

o A future version may support software OpenGL emulation in the VNC server through the GLX extension, so all OpenGL applications should run (albeit slower than on the visualization node)

➢ Need a VNC client on your PC. TurboVNC gives good results with our setup, but others should also work.

VNC (2)

➢ How does VNC work?

o Basic idea: Don’t send graphics commands over the network, but render the image on the cluster and send images.

o Step 1: Load the vsc-vnc module and start the (Turbo)VNC server with the vnc-xfce command. You can then log out again.

o Step 2: Create an SSH tunnel to the port given when starting the server.

▪ Some clients offer built-in SSH support so step 2 and 3 can be combined

o Step 3: Connect to the local end of the tunnel with your VNC client

o Step 4: When you’ve finished, don’t forget to kill the VNC server (with vncserver -kill). Disconnecting will not kill the server!

▪ We’ve configured the desktop in such a way that logging out on the desktop will also kill the VNC server, at least if you started through the vnc-* commands shown in the help of the module (vnc-xfce).

➢ Full instructions: Page “Remote visualization @ UAntwerp” on the VSC documentation site

➢ Do not use the instructions in some versions of the VSC tutorial on the UAntwerp clusters.

119

120

https://vlaams-supercomputing-centrum-vscdocumentation.readthedocs-hosted.com/en/latest/antwerp/remote_visualization_uantwerp.html#remote-visualization-uantwerp


50

GUI programs

➢ Using OpenGL:

o The VNC server currently emulates an X-server without the GLX extension, so OpenGL programs will not run right away.

o On the visualisation node: Start OpenGL programs through the command vglrun

▪ vglrun will redirect OpenGL commands from the application to the GPU on the node

▪ and transfer the rendered 3D image to the VNC server

o E.g.: vglrun glxgears will run a sample OpenGL program

o Even when the VNC server with GLX support will be installed, vglrun will still be needed to use the GPU acceleration

➢ Our default desktop for VNC is xfce.GNOME is not supported due to compatibility problems.

Interactive jobs

➢ Starting an interactive job :

o qsub -I : open a shell on compute node

▪ And add the other resource parameters (wall time, number of processors)

o qsub -I -X : open a shell with X support

▪ must log in to the cluster with X forwarding enabled (ssh -X) or start from a VNC session

▪ works only when you are requesting a single node

➢ When to start an interactive job

o to test and debug your code

o to use (graphical) input/output at runtime if it is not possible to specify that input from a file or store the output in a file

o only for short duration jobs

o to compile on a specific architecture

➢ We (and your fellow users) cannot appreciate idle nodes waiting for interactive commands while other jobs are waiting to get compute time!

121

122


51

Multi-core jobs

Tutorial Chapter 7 with extensions

Why parallel computing?

1. Faster time to solution

o distributing code over N cores

o hope for a speedup by a factor of N

2. Larger problem size

o distributing your code over N nodes

o increase the available memory by a factor N

o hope to tackle problems which are N times bigger

3. In practice

o gain limited due to communication and memory overhead and sequential fractions in the code

o optimal number of cores/nodes is problem-dependent

o but no escape possible as computers don’t really become faster for serial code… Parallel computing is here to stay.

126

127


52

Starting a shared memory job

➢ Shared memory programs start like any other program, but you will likely need to tell the program how many threads it can use (unless it can use the whole node).

o Depends on the program, and the autodetect feature of a program usually only works when the program gets the whole node.

▪ e.g., Matlab: use maxNumCompThreads(N)

o Many OpenMP programs use the environment variable OMP_NUM_THREADS:export OMP_NUM_THREADS=7will tell the program to use 7 threads.

o MKL-based code: MKL_NUM_THREADS can overwrite OMP_NUM_THREADS for MKL operations

o OpenBLAS (FOSS toolchain): OPENBLAS_NUM_THREADS

o Check the manual of the program you use!

▪ Remember the samtools sort manual page, where the default number of threads is actually 1.

▪ NumPy has several options depending on how it was compiled…

Starting a MPI job (Torque)

➢ Distributed memory programs are started with the help of a process starter, e.g., mpirun

➢ (Intel MPI) example script :

generic-mpi.sh#!/bin/bash

#PBS -N mpihello

#PBS -L tasks=56:lprocs=1:swap=1gb ← 56 MPI processes, i.e., 2 nodes with 28 processes each on leibniz


module load calcua/supported ← load the cluster module

module load intel/2020a ← load the Intel toolchain (for the MPI libraries and mpirun)

cd $PBS_O_WORKDIR ← change to the current working directory

mpirun ./mpi_hello ← run the MPI program (mpi_hello)mpirun communicates with the resource manager to acquire the number of nodes, cores per node, node list etc.

128

129


53

Starting a MPI job (Slurm)

➢ Process starter for MPI: srun or mpirun

➢ (Intel MPI) example script :

generic-mpi.sh#!/bin/bash

#SBATCH --jobname mpihello

#SBATCH --ntasks=56 –-cpus-per-task=1#SBATCH –-mem-per-cpu=1g

← 56 MPI processes, i.e., 2 nodes with 28 processes each on leibniz

#SBATCH --time=00:05:00module purgemodule load calcua/supported ← load the cluster module


srun ./mpi_hello ← run the MPI program (mpi_hello)srun communicates with the resource manager to acquire the number of nodes, cores per node, node list etc.

Starting a hybrid MPI job on Torque/Moab: torque-tools

➢ Hybrid programs are distributed memory programs where each process by itself is again a shared memory/multithreaded process.

➢ torque-tools is a module developed at UAntwerp with a number of short scripts to extract information from Torque to help with starting various applications in job scripts that use the version 2.0 syntax for resource requests.

➢ Commands include:

o torque-tasks : The number of tasks in a given or all task groups

o torque-lprocs : The number of lprocs per task in a given or all task groups

o torque-host-per-line : Prints the assigned host for each task in a given or all task groups, one per line

o torque-hoststring : As torque-host-per-line, but now as a single comma-separated string

o torque-hosts-parallel : (Experimental): returns a string suitable for use as the -S argument of GNU parallel

➢ Help for this commands can be accessed through the Linux man command after loading the module.

o Try man torque-tools as a starting point

130

131


54

Starting a hybrid MPI job: torque-toolsVersion 2.0 (NUMA) resource requests

➢ Example scenario: Starting a hybrid OpenMP/MPI job

o In the spirit of the version 2 syntax, we set tasks equal to the number of MPI processes and lprocs equal to the number of OpenMP threads (and thus virtual cores) per MPI process.

o Torque does not pass enough information to mpirun. If we follow the procedure for starting a regular MPI job, Intel MPI will start one process on each core.

o However, we can bypass this mechanism by specifying the -machinefile option, pointing to a file containing one line per MPI process.

o Specifying the number of threads per MPI process is highly program-dependent! E.g., it may be a parameter in the input file (and it often is in computational chemistry packages).

export OMP_NUM_THREADS=$(torque-lprocs)torque-host-per-line >machinefile.$PBS_JOBIDmpirun -machinefile machinefile.$PBS_JOBID <my program>/bin/rm machinefile.$PBS_JOBID

Starting a hybrid MPI job on Torque/Moab: vsc-mympirunVersion 1.0 resource requests

➢ The module vsc-mympirun contains an alternative mpirun command (mympirun) that makes it easier to start hybrid programs and is the recommended way to start such programs when using the version 1.0 syntax (e.g., on the Tier 1-cluster)

o Command line option --hybrid=X to specify the number of processes per node

o Check mympirun --help

➢ The problem is that it does not work reliable if the cores that you request are spread over more nodes than expected, so

o Version 1 syntax: Request full nodes, i.e., ppn=20 on hopper and ppn=28 on leibniz

o Version 2 syntax: To request full nodes, set tasks to the number of nodes and lprocs=20 on hopper or lprocs=28 on leibniz

132

133


55

Starting a hybrid MPI job on Slurm

➢ Contrary to Torque/Moab, we need no additional tools to start hybrid programs.

o srun does all the miracle work (or mpirun in Intel MPI provided the environment is set up correctly)

➢ Example (next slide)

o 8 MPI processes

o Each MPI process has 7 threads => Two full nodes on Leibniz

o In fact, the OMP_NUM_THREADS line isn’t needed with most programs that use the Intel OpenMP implementation

Starting a hybrid MPI job on Slurm (2)

generic-hybrid.slurm#!/bin/bash

#SBATCH --jobname hybrid_demo

#SBATCH --ntasks=8 –-cpus-per-task=7#SBATCH –-mem-per-cpu=1g

← 8 MPI processes with 7 threads each, i.e., 2 nodes with 28 processes each on leibniz

#SBATCH --time=00:05:00module purge

module load calcua/supported ← load the cluster module


export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK ← Set the number of OpenMP threads

srun ./mpi_omp_hello ← run the MPI program (mpi_omp_hello)srun does all the magic, but mpirun would work too with Intel MPI

134

135


56

Workflows

Example workflows

➢ Some scenarios:

o Need to run a sequence of simulations, each one using the result of the previous one, but bumping into the 3-day wall time limit so not possible in a single job

▪ Or need to split a >3 day simulation in shorter ones that run one after another

o Need to run simulations using results of the previous one, but with a different optimal number of nodes

▪ E.g. in CFD: First a coarse grid computation, then refining the solution on a finer grid

o Extensive sequential pre- or postprocessing of a parallel job

o Run a simulation, then apply various perturbations to the solution and run another simulation for each of these perturbations

➢ Support in Torque/Moab

o Passing environment variables to job scripts: Can act like arguments to a procedure

o Specifying dependencies between job scripts with additional qsub arguments

138

139


57

Passing environment variables to job scripts

➢ With the -v argument, it is possible to pass variables from the shell to a job or define new variables.

➢ Example:$ export myvar="This is a variable"$ myothervar="This is another variable"$ qsub -I -L tasks=1:lprocs=1 -l walltime=10:00 -v myvar,myothervar,multiplier=1.005 Next on the command line once the job starts:$ echo $myvarThis is a variable$ echo $myothervar

$ echo $multiplier1.005

➢ What happens?

o myvar is passed from the shell (export)

o myothervar is an empty variable since it was not exported in the shell from which qsub was called

o multiplier is a new variable, its value is defined on the qsub command line

➢ qsub doesn’t like multiple -v arguments, only the last one is used

Dependent jobs

➢ It is also possible to instruct Torque/Moab to only start a job after finishing one or more other jobs:

will only start the job defined by jobdepend.pbs after the jobs with jobid1 and jobid2 have finished

➢ Many other possibilities, including

o Start another job only after a job has ended successfully

o Start another job only after a job has failed

o And many more, check the Moab Workload Manager manual (version 9.1.3) for the section “9.5 Job Dependencies”

➢ Useful to organize jobs and powerful in combination with -v

qsub -W depend=afterany:jobid1:jobid2 jobdepend.pbs

140

141

http://docs.adaptivecomputing.com/9-1-3/MWM/moab.htm

http://docs.adaptivecomputing.com/9-1-3/MWM/moab.htm#topics/moabWorkloadManager/jobAdministration/jobdependencies.html


58

Dependent jobsExample

Perturbation 1Simulation

Simulation

Perturbation 2Simulation

#!/bin/bash#PBS -L tasks=1:lprocs=1:swap=1gb#PBS -l walltime=30:00

cd "$PBS_O_WORKDIR"

echo "10" >outputfile ; sleep 300

multiplier=5mkdir mult-$multiplier ; pushd mult-$multipliernumber=$(cat ../outputfile)echo $(($number*$multiplier)) >outputfile; sleep 300 popd

multiplier=10mkdir mult-$multiplier ; pushd mult-$multipliernumber=$(cat ../outputfile)echo $(($number*$multiplier)) >outputfile; sleep 300

job.pbs


➢ A job whose result is used by 2 other jobs


cd "$PBS_O_WORKDIR"

echo "10" >outputfile ; sleep 300


cd "$PBS_O_WORKDIR"

mkdir mult-$multiplier ; cd mult-$multipliernumber=$(cat ../outputfile)echo $(($number*$multiplier)) >outputfile; sleep 300

#!/bin/bashfirst=$(qsub -N job_leader job_first.pbs)qsub -N job_mult_5 -v multiplier=5 -W depend=afterany:$first job_depend.pbsqsub -N job_mult_10 -v multiplier=10 -W depend=afterany:$first job_depend.pbs

job_first.pbs

job_depend.pbs

job_launch.sh

142

143


59


➢ After start of the first job – qstat -u vsc20259

➢ Some time later:

➢ When finished: Output of ls:

➢ cat outputfile 10

cat mult-5/outputfile 50

cat mult-10/outputfile 100

Multi-Job submission

Different from chapter 13 in the tutorial…

144

145


60

Running a large batch of small jobs

➢ Many users run many serial jobs, e.g., for a parameter exploration

➢ Problem: submitting many short jobs to the queueing system puts quite a burden on the scheduler, so try to avoid it

➢ Solutions:

o Job arrays: Submit a large number of related jobs at once to the scheduler

o atools, a (VSC-developed) package that makes managing array jobs easier

o Worker framework

▪ parameter sweep : many small jobs and a large number of parameter variations. Parameters are passed to the program through command line arguments.

▪ job arrays : many jobs with different input/output files

▪ More efficient for very small jobs

▪ But no support for Slurm at the moment

o GNU parallel (module name: parallel)

▪ Will also work with Slurm

Job array (Torque/Moab)

➢ program will be run for all input files (100)

#!/bin/bash -l#PBS -L tasks=1:lprocs=1 #PBS -l walltime=04:00:00

cd $PBS_O_WORKDIR

INPUT_FILE="input_${PBS_ARRAYID}.dat"OUTPUT_FILE="output_${PBS_ARRAYID}.dat"program -input ${INPUT_FILE} -output ${OUTPUT_FILE}

← for every run, there is a← separate input file and← an associated output file

jobarray.pbs

$ qsub -t 1-100 jobarray.pbs

146

151


61

Job array (Slurm)

➢ program will be run for all input files (100)

#!/bin/bash -l#SBATCH --ntasks=1 --cpus-per-task=1#SBATCH --mem-per-cpu=512M#SBATCH --time 15:00

INPUT_FILE="input_${SLURM_ARRAY_TASK_ID}.dat"OUTPUT_FILE="output_${SLURM_ARRAY_TASK_ID}.dat"

./test_set -input ${INPUT_FILE} -output ${OUTPUT_FILE}

← for every run, there is a← separate input file and← an associated output file

jobarray.slurm

$ sbatch --array 1-100 jobarray.slurm

AtoolsExample: Parameter exploration

➢ Assume we have a range of parameter combinations we want to test in a .csv file (easy to make with Excel)

➢ Help offered by atools:

o How many jobs should we submit? atools has a command that will return the index range based on the .csv file

o How to get parameters from the .csv file to the program? atools offers a command to parse a line from the .csv file and store the values in environment variables.

o How to check if the code produced results for a particular parameter combination? atools provides a logging facility and commands to investigate the logs.

152

153


62

Atools with Torque/MoabExample: Parameter exploration

➢ weather will be run for all data, until all computations are done

➢ Can also run across multiple nodes

temperature, pressure, volume 293.0, 1.0e05, 87..., ..., ...313, 1.0e05, 75

data.csv#!/bin/bash -l#PBS -L tasks=1:lprocs=1:swap=512mb#PBS -l walltime=10:00

cd $PBS_O_WORKDIRml atools/torque

source <(aenv --data data.csv)./weather -t $temperature -p $pressure -v $volume

weather.pbs

(data in CSV format)

$ module load atools/torque$ qsub -t $(arange --data data.csv) weather.pbs

Atools with SlurmExample: Parameter exploration

➢ weather will be run for all data, until all computations are done

➢ Can also run across multiple nodes

temperature, pressure, volume 293.0, 1.0e05, 87..., ..., ...313, 1.0e05, 75

data.csv#!/bin/bash -l#SBATCH --ntasks=1 –cpus-per-task=1#SBATCH –-mem-per-cpu=512m#SBATCH --time=10:00module purgeml calcua/supported atools/slurm

source <(aenv --data data.csv)./weather -t $temperature -p $pressure -v $volume

weather.slurm

(data in CSV format)

$ module load atools/slurm$ sbatch --array $(arange --data data.csv) weather.slurm

154

155


63

AtoolsRemarks

➢ Atools has a nice logging feature that helps to see which work items failed or did not complete and to restart those.

➢ Advanced feature of atools: Limited support for some Map-Reduce scenarios:

o Preparation phase that splits up the data in manageable chunks needs to be done on the login nodes or on separate nodes

o Parallel processing of these chunks

o Atools does offer features to aggregate the results

➢ Atools is really just a bunch of Python scripts and it does rely on the scheduler to start all work items

o It supports all types of jobs the scheduler support

o But it is less efficient than worker for very small jobs as worker does all the job management for the work items (including starting them)

o A version of worker that can be ported to Slurm is under development

Further reading on multi-job submission

➢ atools manual on readthedocs

➢ worker manual on readthedocs

➢ Presentation used in the course @ KULeuven on Worker and atools

156

157

https://atools.readthedocs.io/en/latest/

https://worker.readthedocs.io/en/latest/

https://github.com/gjbex/training-material/blob/master/Misc/worker_atools.pptx


64

Exercises (block 3)

➢ See the exercise sheet and select those that are most relevant to your work on the cluster

➢ Do not forget to go to $VSC_SCRATCH where your files from the previous session are stored!

Job monitoring

158

162


65

FAQ

➢ How do I know how much memory my job used?

➢ How can I check if a job is running properly?

Job epilogue

➢ Your job output file will contain a prologue section with information about the start of the job and a epilogue.

➢ Example epilogue:

===== start of epilogue =====Date : Fri Feb 23 16:56:50 CET 2018Session ID : 2377Resources Used : cput=04:32:11,vmem=10859988kb,walltime=00:54:29,mem=306680kb,energy_used=0Allocated Nodes : r2c04cn3.leibnizJob Exit Code : 0===== end of epilogue =======

➢ cput is the total CPU time used by your application, walltime is the wall time of the job. Idealy the CPU time should be as close as possible to the walltime multiplied by the number of cores requested. Otherwise your job wasn’t using its CPU resources very well.

➢ mem is the total RAM used by all tasks together, vmem is the total virtual memory use.

➢ These values are not accurate if your script starts processes through ssh to one of the compute nodes

o E.g., GNU parallel

163

164


66

Job monitoring commands

➢ showq -r

o One of the columns shows the efficiency

o Ideally this number should be 80% or more

o It is only fairly accurate if the job has been running for a sufficiently long time (half an hour or so), especially for multi-node jobs

o It is also inaccurate if your job script start (directly or indirectly) processes through ssh

o Note that the efficiency measured through Torque is higher than the one we talked about in the “Supercomputers for Starters” lecture as the communication overhead is also included in the CPU time

➢ qstat -f will also show the consumed cpu time and wall time for a job, but then you’d need to do the efficiency computation by hand

o Same limitations as showq -r

Logging on to the compute node(s)

➢ While your job is running, you can log on to the compute nodes assigned to the job through ssh from the login nodes

o Check which compute nodes a job uses: qstat –n, qstat -f or checkjob

➢ Run htop on the compute node and check the use of the cores and memory

o Quit htop: q

o Only show your own processes: Press u, select your userid from the list to the left and press the return key

➢ htop also shows the load average, but this may be misleading

o load = processes that can run

o If this number is much higher than the number of cores: You are typically running several shared memory applications each using all cores without realising it.

o But a load > the number of cores does not always mean something is going wrong

165

166


67

Add monitoring in your job script

➢ The monitor module

o See Tutorial text §12.3.2

o Start your (shared memory) application through the monitor commandmonitor ./eat_mem 10would start eat_mem with the command line argument 10

o Usefull command line arguments:

▪ monitor –d 300 ./eat_mem 10print output every 300 seconds

• It does not make sense to sample a 3-day job every second

▪ monitor –d 300 –n 12 ./eat_mem 10will only show the last 12 values (so roughly the last hour of this job)

➢ Note: eat_mem is available in the vsc-tutorial module

Policies

Tutorial Chapter 10

167

168


68

Some site policies

➢ Our policies on the cluster:

o Single user per node policy

o Priority based scheduling (so not first come – first get). We reserve the freedom to change the parameters of the priority computation at will to ensure the cluster is used well

o Fairshare mechanism – Make sure one user cannot monopolise the cluster

o Crediting system – Dormant, not enforced

➢ Implicit user agreement:

o The cluster is valuable research equipment

o Do not use it for other purposes than your research for the university

▪ No cryptocurrency mining or SETI@home and similar initiatives

▪ Not for private use

o And you have to acknowledge the VSC in your publications (see user documentation on the VSC web site)

➢ Don’t share your account

Credits

➢ At UAntwerp we monitor cluster use and send periodic reports to group leaders.

➢ On the Tier-1 system at KU Leuven, you get a compute time allocation (number of node days). This is enforced through project credits (Moab Accounting Manager).

o A free test ride (motivation required)

o Requesting compute time through a proposal

➢ At KU Leuven Tier-2, you need compute credits which are enforced as on the Tier-1.

o Credits can be bought via the KU Leuven

➢ Charge policy (subject to change) based upon :

o fixed start-up cost

o used resources (number and type of nodes)

o actual duration (used wall time)

169

170

https://vlaams-supercomputing-centrum-vscdocumentation.readthedocs-hosted.com/en/latest/how_do_i_acknowledge_the_vsc_in_publications.html


69

Best practices

Tutorial Chapter 17

Debug & test

➢ Before starting you should always check :

o are there any errors in the script?

o are the required modules loaded?

o Is the correct executable used?

o Did you use the right process starter (mpirun) and start in the right directory?

➢ Check your jobs at runtime

o login to a compute node and use, e.g., “top”, “htop” , “vmstat”, ...

o monitor command (see User Portal information on running jobs)

o showq –r (data not always 100% reliable)

o alternatively: run an interactive job (“qsub -I”) for the first run of a set of similar runs

➢ If you will be doing a lot of runs with a package, try to benchmark the software for scaling issues when using MPI

o I/O issues

▪ If you see that the CPU is idle most of the time that might be the problem

173

174


70

Hints & tips

➢ Job will logon to the first assigned compute node and start executing the commands in the job script

o start in your home directory ($VSC_HOME), sogo to current directory with cd “$PBS_O_WORKDIR”

o don’t forget to load the software with module load

➢ Why is my job not running?

o use “checkjob”

o commands might timeout with overloaded scheduler (rather unlikely recently)

➢ Submit your job and wait (be patient) …

➢ On hopper, using $VSC_SCRATCH_NODE (/tmp on the compute node) might be beneficial for jobs with many small files (even though the bandwidth is much, much lower than on $VSC_SCRATCH).

o But that scratch space is node-specific and hence will disappear when your job ends

o So copy what you need to your data directory or to the shared scratch

o And only available on hopper, not much space on leibniz

Warnings

➢ Avoid submitting many small jobs (in number of cores) by grouping them

o using a job array

o or using the atools or the Worker framework

➢ Runtime is limited by the maximum wall time of the queues

o for longer wall time, use checkpointing

o Properly written applications have built-in checkpoint-and-restart options

➢ Requesting many processors could imply long waiting times (though we’re still trying to favour parallel jobs)

what a cluster is not

175

176


71

User support

User support

➢ Try to get a bit self-organized

o limited resources for user support

o best effort, no code fixing (AI and bioinformatics codes tend to be a disaster to install…)

o contact [email protected], and be as precise as possible

➢ VSC documentation on ReadTheDocs (put a bookmark in your browser, the URL is terrible!)

➢ External documentation : docs.adaptivecomputing.com

o manuals of Torque and MOAB – We’re currently at version 6.1.2 of Torque and version 9.1.3 of Moab

o mailing-lists

CalcUA

calcua.uantwerpen.be

VSC

www.vscentrum.be

177

178

mailto:[email protected]


http://docs.adaptivecomputing.com/


http://docs.adaptivecomputing.com/9-1-3/MWM/moab.htm

https://calcua.uantwerpen.be/



72

User support

➢ mailing-lists for announcements : [email protected](for official announcements and communications)

➢ Every now and then a more formal “HPC newsletter”o We do expect that users read the newsletter and mailing list messages!

➢ contacts : o e-mail : [email protected] phone : +32 3 265 XXXX with XXXX 3860 (Stefan), 3855 (Franky), 3852 (Kurt), 3879 (Bert), 8980 (Carl)

o office : G.309-311 (CMI) – before the pandemic

Don’t be afraid to ask, but being as precise as possible when specifying your problem definitely helps in getting a quick answer.

Evaluation

➢ Please fill in the questionnaire at http://tiny.cc/calcua-hpc-intro-survey by November 11

179

185

http://tiny.cc/calcua-hpc-intro-survey


73

Links

➢ Windows

o Windows Terminal: docs.microsoft.com/en-us/windows/terminal/

o Windows Subsystem for Linux: docs.microsoft.com/en-us/windows/wsl/install-win10

o Cygwin: www.cygwin.com

o MobaXterm: mobaxterm.mobatek.net

o PuTTY: www.chiark.greenend.org.uk/~sgtatham/putty

o WinSCP: winscp.net

Links

➢ macOS:

o iTerm2: iterm2.com

o XQuartz: www.xquartz.org

o CyberDuck: cyberduck.io

➢ All:

o FileZilla: filezilla-project.org

o TurboVNC: www.turbovnc.org

186

187

https://docs.microsoft.com/en-us/windows/terminal/

https://docs.microsoft.com/en-us/windows/wsl/install-win10

https://www.cygwin.com/


https://www.chiark.greenend.org.uk/~sgtatham/putty/

http://www.chiark.greenend.org.uk/~sgtatham/putty

https://winscp.net/eng/index.php

https://iterm2.com/

https://www.xquartz.org/

https://cyberduck.io/

https://filezilla-project.org/

https://www.turbovnc.org/


74

Links

➢ VSC main web site: vscentrum.be

➢ VSC documentation: Too long to type, look for “user portal” op de VSC main web site

o VSC hardware – Tier-2 hardware – UAntwerpen Tier-2 hardware is an important section

➢ Local UAntwerp CalcUA web site: hpc.uantwerpen.be

188



https://www.vscentrum.be/user-portal


https://vlaams-supercomputing-centrum-vscdocumentation.readthedocs-hosted.com/en/latest/antwerp/tier2_hardware.html

https://www.uantwerpen.be/en/core-facilities/calcua/

hpc@uantwerp introduction

Documents