nci.org.aunci.org.au
@NCInews
Using Raijin
ResourcesThese slides:
http://nci.org.au/user-support/training/using-raijin-course/
NCI guides: http://nci.org.au → User Support
Training material: http://nci.org.au/services-support/training/
nci.org.au2/86
Outline
I. Connecting
II. Resource Quotas
III. Software Environment
IV. Filesystems
V. Scheduling Jobs
nci.org.au3/86
Raijin: Unix cluster (CentOS 6.6), 3592 compute nodes.
Interactive terminal (text only):
ssh -l abc123 raijin.nci.org.au, or
Windows ssh clients: Putty, MobaXterm, Xming.
Graphics-enabled session (i.e., X-Windows/X11):
ssh -X ...(PC) ssh -Y ...(Mac)
Remote file transfer: scp/sftp/rsync commands, or graphical
ftp client.
It’s good practice to logout of xterm sessions (or ctrl-d, or
exit).
Connecting: Basics
nci.org.au4/86
Connecting: Login nodes
If you can't connect to raijin.nci.org.au, try raijin1-
raijin6.
Use login nodes for small tasks:
Editing files, submitting jobs, small file transfers,
compiling small programs, etc.
'Intensive' tasks killed automatically (<2GB ram, >30
mins cpu time).
nci.org.au5/86
Connecting: Data-mover nodes
Small file-transfers: raijin.nci.org.au
Small/Large transfers: r-dm.nci.org.au
Can use scp/rsync/sftp. Syntax (scp/rsync):
scp [source-file/dir] [destination-file/dir]
Server: Machine running scp server (usually r-
dm.nci.org.au).
Client: Machine that initiates the copy (usually your PC).
Push: Client → Server.
Pull: Client ← Server.
nci.org.au6/86
Connecting: Using scp (in your own time)
Examples. Use your PC as client (run scp in local terminal,
not on Raijin).
Push, i.e., Client (your PC)→ Server (Raijin):
scp myfile [email protected]:mydir
mydir must exist in your home (~) dir.
scp myfile [email protected]:
Copies to home dir, don't forget colon.
scp -r mydir [email protected]:parentdir
parentdir must exist in your home dir.
Pull, i.e., Client (your PC)← Server (Raijin):
Swap the order of the arguments, e.g.,
scp [email protected]:myfile mydir
parentdir must be in the current dir.
nci.org.au7/86
Connecting: Using rsync (in your own
time)
rsync uses same basic syntax as scp. Many more options:
rsync -avPS myfile [email protected]:mydir
-a: Archive. (Recursive copy, preserves
permissions/owner/group/mtime, . . . )
-P: Resume partial file transfers.
-S: Handle sparse files.
Also, -z to enable compression, --progress to show
progress, etc.
Consult the manual pages: man scp and man rsync.
nci.org.au8/86
Connecting: Passphrase-less access
Sometimes necessary, e.g.. automation of remote data-
transfers.
Don't put password in file/script; use passphrase-less ssh keys.
ssh keys can be figured to…
...allow only certain commands.
...restrict arguments, such as directory names.
Passphrase-less file transfer: use rrsync instead of rsync (also rscp).
Strongly discouraged. Weakens security of NCI and your system.
nci.org.au9/86
Exercise 1: Connecting (1/2)
Login to Raijin:
Service disruptions are reported in the 'Message of the
Day'.
Who/Where am I?
whoami
man hostname # 'man' is Unix's help system, q to exit.
hostname
pwd # shows name of current directory.
echo . # '.' refers to the current directory.
echo ~ # '~' is an alias for your home directory.
cat /etc/motd
You might wish to refer to Handy Unix Commands.
nci.org.au10/86
Exercise 1: Connecting (2/2)
Try to run the xeyes or xclock commands (they won’t
work).
logout (or ctrl-d, or exit), reconnect with x11 forwarding:
(Mac) ssh -Y ... (PC) ssh -X ...
Now try, e.g., xeyes (ctrl-c to stop).
Remote file transfer: (if time permits)
logout and ‘push’ a file from your computer to Raijin,
scp myfile [email protected]:
(You must include the colon!)
Log back in and use ls or ls -la to check the result.
nci.org.au11/86
Outline
I. Connecting
II. Resource Quotas
III. Software Environment
IV. Filesystems
V. Scheduling Jobs
nci.org.au12/86
The two resources that are metred are storage and compute
time.
Compute grant divided into quarterly amounts.
Resource usage is accounted against projects.
Projects implemented as Unix groups.
Users can belong to multiple projects.
Resource Quotas
nci.org.au13/86
Usage is ‘charged’ to your default project, unless otherwise
specified.
Edit your .rashrc to change your default project:
.rashrc is hidden in your home folder (ls -a lists all
files).
Settings in .rashrc are applied each time you login.
Modifying .rashrc is the best way to set your project.
Resource Quotas: Default project
setenv PROJECT c25
setenv SHELL /bin/bash
nci.org.au14/86
$PROJECT: Name of active project (see
environment variables).
newgrp doesn't update $PROJECT, instead use. . .
. . . switchproj. Changes active project for the current session.
. . . nfnewgrp. Changes active project for specified
command.
For more info, run switchproj/nfnewgrp without
arguments.
Most commands relating to project accounting/job-scheduling let you override default project.
Resource Quotas: Overriding default project
Never put switchproj in login scripts. (More on these later).
nci.org.au15/86
Storage grant has two components:
Project’s data usage is based on file ownership.
Location of file has no bearing on quota usage.
Use chgrp to change which project owns a file:
chgrp NameOfNewProject myfile
Output files produced by jobs submitted to scheduler
belong to your default project. . .
. . . unless you specify otherwise (more on jobs later).
Resource Quotas: Data grant (1/2)
1. Amount of data.
2. # of files/dirs (‘inodes’).
nci.org.au16/86
Projects only have storage grant for /short, /g/data,
massdata.
/home capped at 2GB and 80,000 inodes.
Project dirs: /short/projcode, /g/data/projcode,
/massdata/projcode
To access massdata: mdss command or ftp to r-
dm.nci.org.au.
/g/data is symbolic link to /g/data1 or /g/data2.
E.g., /g/data/c25 → /g/data2/c25
Resource Quotas: Data grant (2/2)
nci.org.au17/86
Resource Quotas: Data usage (1/2)
nci_account: Summary of all resources used by a project.
Total disk usage (data, inodes) for main filesystems.
Project (-P) and time period (-p) options:
nci_account -P c25 -p 2014.q4
Lustre filesystem stats can be ~30 minutes old.
lquota: Quotas and overall usage for Lustre filesystems (/home,/short, /g/data).
Shows usage for all of your projects.
Unlike nci_account, lquota queries filesystem directly.
To estimate amount of data in home dir: du ~ -sh, find ~|
wc -l
nci.org.au18/86
Resource Quotas: Data usage (2/2)
short_files_report: Project’s disk usage for /short only.-G and -P options show breakdown by user.Stats can be up to ~1 day old.
Example 1. To show where the data owned by project c25
is:
short_files_report -G c25
Locating misplaced files/files with incorrect
ownership.
Determining project’s overall disk usage.
Determining who is using the most space.
Example 2. To show which projects own the data in
/short/c25:
short_files_report -P c25
Locating misplaced files/files with incorrect
ownership.
Also gdata1_files_report and gdata2_files_report.
nci.org.au19/86
Resource Quotas: Compute policy (1/2)
You are only charged for cpu time used by ‘batch jobs’, i.e.,
jobs
submitted to the scheduler.
This applies to both compute and copy jobs.
All other cpu usage is free. Login nodes subject to
cpu/mem limits.
*ssh can issue noninteractive commands to r-
dm.nci.org.au.
nci.org.au20/86
Resource Quotas: Compute policy (2/2)
Remote transfers to r-dm.nci.org.au ‘unlimited’.
In general, can’t transfer files remotely to/from
massdata.
All other large file-operations must use job scheduler, e.g.,
/g/data → massdata
nci.org.au21/86
Resource Quotas: Compute usage
Compute grant specified as number of ‘Service Units’ (SUs).
SU = one hour of walltime on one cpu.
walltime = real-world time.
Each cluster node has 16 cpus (‘cpu’ = core).
E.g., a job that uses 3 compute nodes for 3 hrs costs 3 × 16
× 3
= 144SUs (excludes ‘express’ jobs, discussed later).
Project’s compute usage is updated after each job finishes.
Use nci_account to view project’s overall compute usage.
Also shows costs of running and queued jobs.
-v option: breakdown of compute usage by user.
nci.org.au22/86
Exercise 2: Accounting
Amount of data in your /home (~) dir: du ~ -sh
Approx. # of files in your /home dir: find ~ | wc -l
(The pipe ‘|’ directs output of find to input of wc. See man
wc
and man find. Try the find command by itself).
nci_account, also try -v and -vv options for information
overload.
lquota
short_files_report -G c25
short_files_report -P c25
Also gdata1_files_report and gdata2_files_report.
Notice that c25 doesn’t have gdata1 allocation.
nci.org.au23/86
Outline
I. Connecting
II. Resource Quotas
III. Software Environment
IV. Filesystems
V. Scheduling Jobs
nci.org.au24/86
Software Environment: The shell
Each terminal runs a separate shell.
The shell interprets and executes commands.
(Handy Unix Commands).
Many shells to choose from.
bash is the most popular (default), followed by
tcsh.
Edit your .rashrc to change your default shell:
Shell commands can be grouped into scripts.
Each script runs in a subshell.
Append & to run a command in background (related to wait,
ctrl-z, fg,bg. See, e.g., man wait).
setenv PROJECT c25
setenv SHELL /bin/bash
nci.org.au25/86
Software Environment: Shell variables
Shell lets you define variables that can be read from the
command line.
N=10 OUTPUTFILE=myfile.out (bash syntax)set N=10 OUTPUTFILE=myfile.out (tcsh syntax)
Retrieve the value by prepending a $.
echo “The value of N is $N”
To make existing variable visible to subshells such as
scripts:
(bash syntax) N=10
export N
export OUTPUTFILE=myfile.out
(tcsh syntax) setenv OUTPUTFILE myfile.out
Useful pre-defined vars: $PROJECT, $USER, $HOME, $0 (shell
type).
Also see Canonical Environment Variables.
nci.org.au26/86
Software Environment: Shell scripts
First line of a script invokes a new shell.
Usually #!/bin/bash or #!/bin/tcsh
Next come user-commands. E.g., (contents of myscript.sh)#!/bin/bashN=10# Anything after a ‘#’ is a comment.echo $N
To make script executable: chmod +x myscript.sh
To run script: myscript.sh or ./myscript.sh
When script finishes, original values of variables are
restored.
To make changes persistent, use source (or ‘.’). E.g.,
. myscript.sh
source replaces parent shell with subshell running
the script.
nci.org.au27/86
System-scripts for setting default environment (‘dot files’). .
.
. . . run each time a shell is created/destroyed.
. . . hidden in your home (~) directory.
When shell is created:
When you logout: .bash_logout (bash), .logout (tcsh).
Raijin: BASH_ENV=.bashrc. Default .profile
executes .bashrc.
Keep # of commands in your dot files to a minimum.
Avoids conflicts/recursive execution of dot files.
Software Environment: Defaults
Type of shell bash tcsh
Login .profile .login, .cshrc
Non-interactive
$BASH_ENV .cshrc
Interactive .bashrc .cshrc
nci.org.au28/86
Software Environment: Editors
Several editors are installed on Raijin.
Convenient for modifying job scripts.
Editors with text-based interfaces:
vi/vim
emacs
nano
vi/vim/emacs are powerful but not intuitive at first.
Editors with graphical interface:
nedit is a simple graphical editor.
emacs (unless -nw option is specified).
Require X-Windows enabled session (ssh -X or -
Y ...).MS Windows uses slightly different text file format. Can use dos2unix/unix2dos to convert your
scripts.
nci.org.au29/86
Software Environment: Modules (1/2)
Many software packages available on Raijin (
software catalogue).
Configuring your environment for each package isn’t trivial.
Modules take care of this for you:
module load/unload SoftwareNamemodule avail shows list of available modules.module avail SoftwareName shows available versions.module show SoftwareName shows changes module
makes.module list lists the modules you have currently loaded.
Some packages require more than one module to be
loaded.E.g., many packages require OpenMPI.
Prerequisites documented in software catalogue.
nci.org.au30/86
Software Environment: Modules (2/2)
Can put module commands in dot files.
. . .preferably in .profile/.login instead
of .bashrc/.cshrc.
Default dot files contain a small number of 'core' modules.
Putting module commands in dot files can lead to conflicts.
It’s better to put such commands in your job scripts
instead.
Putting module purge in dot files can result strange
errors.
You can define your own modules: Module user-guide.
nci.org.au31/86
Exercise 3: Software environment (1/4)
As always, Handy Unix Commands.
Inspect some predefined variables. E.g., echo $HOME
(compare with echo ∼).
printenv shows list of defined environment variables.
Which shell are you running? echo $0
When you enter a command that isn’t built-in, the shell
searches directories named in $PATH:echo $PATH
PATH=$PATH:~/mydir
echo $PATH
Inspect your login scripts:ls -la ~ # What happens if you omit the 'a'?cat ~/.profile # Note the 'default' modules.cat ~/.bashrc # ...or ~/.login and ~/.cshrc if using tcsh.
nci.org.au32/86
Exercise 3: Software environment (2/4)
Try the following module commands:
module avail
module avail python
module show python # Shows what environment vars will be
set
echo $PYTHON_BASE
module load python # Loads the default version
echo $PYTHON_BASE
which python
Use module list to check which version is loaded.
Try loading the module for a different version of python. You
must unload the previous version first:
module unload python # No need to specify which version
module list
echo $PYTHON_BASE
which python
nci.org.au33/86
Exercise 3: Software environment (3/4)
(If time permits)
Make a simple script: nano myscript.sh
(. . . or use, e.g., nedit if X-Windows is enabled).
Insert the following lines:
#!/bin/bash
echo This script is running in a non-interactive subshell.
Save the script (ctrl-o) and exit nano (ctrl-x).
ls -l myscript.sh # Look at file permissions.
chmod +x myscript.sh # Make script executable.
Check the file permissions once more.
nci.org.au34/86
Exercise 3: Software environment (4/4)
Add the following to the end of .profile:echo Starting a new login shell.
Add the following to the end of .bashrc:
if [ -z “$PS1” ]; then # Don’t forget the spaces!
echo This shell is interactive.
else
echo Either default .profile was executed, or this is a non-
interactive shell and BASH_ENV=.bashrc.
fi # This isn’t a typo.
For each of the next three steps, which, if any, dot files are
executed?
1. logout (or ctrl-d) and then log back in.
2. Run ./myscript.sh. Try echo $BASH_ENV.
3. Start an interactive subshell by typing bash.
Type exit to close the subshell and return to the parent
shell.
nci.org.au35/86
Outline
I. Connecting
II. Resource Quotas
III. Software Environment
IV. Filesystems
V. Scheduling Jobs
nci.org.au36/86
Gratituitous slide showing disk capacities.
/g/data1 and /g/data2 together have more capacity than
20,000 laptop hard drives (@700GB ea) combined.
Not too shabby!
Filesystems: Capacities
nci.org.au37/86
Filesystems: Purpose and performance
The purpose of each filesystem is reflected in that FS’s
performance.
/short: Freq. accessed files, esp. large IO
files of running/recent jobs, source-code/libs.
/g/data: Data sets that must be available on
demand. Global: visible to the NCI cloud.
/home: Source-code, scripts, local packages.
JOBFS: Node-local scratch space for each job.
massdata: Archive files. Reads are slow and
not immediate if file isn’t in disk cache.
File IO of running jobs usually directed to /short or /jobfs.
The particular choice depends on the IO pattern (more
in a moment).
nci.org.au38/86
The real-world read/write speeds per-job are:
*Lustre filesystems, conditions apply.
Lustre FS speeds achievable because it. . .. . . is highly-parallel.
. . . communicates over ‘Infiniband’ network (56GbE).
massdata:
Accessed over 10GbE link.
Read speed depends on whether file is in cache or on
tape.
Filesystems: Speeds
/short ~1GB/s*/g/data ~500MB/s*JOBFS ≤ 100MB/s/massdata 0.5-1TB/hr
(write)
nci.org.au39/86
Filesystems: Use and abuse of Lustre
Lustre filesystems (/short, /g/data, /home) are distributed:
Files ‘striped’ across multiple disks for parallel IO.
Each file potentially accessible to many users/cpus.
Consistent view maintained across 1000’s of Lustre clients.
Each file operation generates a lot of metadata.
Filesystem bandwidth is shared by all users.
IO-intensive tasks should. . .
. . .issue file operations no more than once per second.
. . .read/write in ‘large’ chunks (>1MB).
Lustre performance plummets if file ops are small and frequent.
Misusing Lustre degrades performance for everyone.
nci.org.au40/86
Each cluster node has 396GB of local scratch space (JOBFS).
Only available to ‘batch jobs’.
Allocated when job starts. Deleted when job finishes.
Not visible by other jobs/users.
Very frequent, small file-ops. . .
. . .can degrade Lustre performance for everyone.
. . .should be avoided, but handled much better by JOBFS.
Filesystems: High-frequency IO
JOBFS /short
Typical speed ≤ 100MB/s ~1GB/s
Suitable for frequent IO
Yes No
nci.org.au41/86
Intended for archiving large data files.
Increases access times.
‘small’ = avg. size ≤ 1MB.
Bundle small files into archives (.tar files) first.
Use archive option (-t) of netcp/netmv commands(discussed later).
Compress option (-z) is recommended.
Supplying mdss get with a large file-list facilitates parallel retrieval.
Filesystems: Use and abuse of massdata
Storing large numbers of small files is a misuse of massdata.
nci.org.au42/86
Snapshot of /home taken every ~2 days./home snapshots thinned-out over time.Duplicate copy of massdata kept in separate building.
Backing-up data is otherwise your responsibility. Use tar liberally!
Many groups neglect to use massdata because. . .
. . . they don’t have a plan for organising/managing data.
. . . turning large data sets into archive files is laborious.
. . . no one wants to touch other people’s files, esp. large
dirs.
. . . they are unsure how to use massdata.
. . . it’s easier to leave files as they are on /short or
/g/data.
If you are unsure, ask us for assistance.
Filesystems: Data-backup policies
Only /home and massdata are backed-up automatically.
nci.org.au43/86
Filesystems: Accessing (1/2)
Lustre filesystems are used like regular directories:
ls, cd, cp, mkdir, rm, etc.
JOBFS exists for duration of job.
Its location will be stored in $PBS_JOBFS (discussed
later).
Accessing massdata:
mdss command. Provides put/get, rm, ls, etc.
netcp/netmv commands.
More details in a moment!
nci.org.au44/86
Filesystems: Accessing (2/2)
Project directories: filesystem/projcode
filesystem = /short, /g/data, /massdata, etc.
All project members have read/write/execute permissions for
these dirs.
Each person has their own subdirectory:
filesystem/projcode/$USER
To check filesystem status:
/opt/rash/bin/modstatus -n opt
opt = gdata1_status, gdata2_status or mdss_status
Message of the Day (cat /etc/motd).
Emergency downtime notice.
nci.org.au45/86
Filesystems: Accessing massdata (1/3)
Method 1: mdss command.
Provides familiar file operations as subcommands:
mdss ls, mdss mkdir mydir, mdss rm -r mydir, etc.
Login and datamover nodes only.
Latter usually requires job scheduler.
mdss assumes filenames are relative to
/massdata/projcode.
Use -P option to specify project other than default.
nci.org.au46/86
Filesystems: Accessing massdata (2/3)
Method 1: mdss command (cont'd).
put, get, stage subcommands use the same syntax.
E.g., mdss put myfile target, mdss put -r mydir.target optional, and must already exist if it specifies a directory.
stage transfers files to cache for later use.
get stages and retrieves.
mdss dmls similar to mdss ls. Also indicates state:
REG = cached, OFL = on tape, DUL = both
mdss creates checksums (mdss -v to verify). See man mdss!
nci.org.au47/86
Filesystems: Accessing massdata (3/3)
Method 2: The netcp/netmv commands.
cp/mv commands for massdata. E.g.,
netcp myfile target
target is optional.
Can’t be used to retrieve files from massdata.
Can push files to remote ssh servers (requires
passphrase-less access).
Options to archive (-t) and compress (-z):
netcp -z -t myfile.tar mydir
Implemented as copy job:
Requires familiarity with job scheduler (more later).
Uses default resource limits unless -l is specified.
Produces job summary files.
nci.org.au48/86
Exercise 4: Filesystems (1/4)
For the training project, /g/data is a link to /g/data2:
cd /g/data/$PROJECT # All group members can access this.
pwd
ls -la /g/data/$PROJECT
cd /short/$PROJECT/$USER; pwd # Only you can access this.
Try the ls and du subcommands for mdss (mdss assumes
filenames are relative to /massdata/$PROJECT):
mdss ls
mdss ls .. # '..' is the parent directory.
mdss ls -la
mdss du -h # Also mdss du -sh
nci.org.au49/86
Exercise 4: Filesystems (2/4)
Create two test files, and bundle them into a tar file:
cd /short/$PROJECT/$USER
rm * # Remove existing files.
touch file1.$USER file2.$USER
tar cvf testfiles.tar file* # See man tar for c, v, f options.
ls
tar --list -f testfiles.tar # Check contents of the archive.
Create a user directory on massdata:mdss rm -r $USER # Delete the old directory, if it exists.
mdss mkdir $USER
mdss ls
Next, put testfiles.tar into your massdata directory.
Syntax: mdss put [-r] myfile [targetname]
targetname and -r optional, -r copies directory (man
mdss)
Check the result using mdss ls $USER
nci.org.au50/86
Exercise 4: Filesystems (3/4)
Where is the file stored?
mdss dmls -l #REG = cached, OFL = tape, DUL = both
Remove files before retrieving archived copies:
rm file1.$USER file2.$USER testfiles.tar
Use mdss get to retrieve testfiles.tar. Syntax same as
mdss put. See man mdss.
Check the result using ls.
Unpack the archive:
tar xvf testfiles.tar # extract, verbose, filename, see man tar.
ls
nci.org.au51/86
Exercise 4: Filesystems (4/4). If time permits...Use netcp to copy files to massdata:cd /short/$PROJECT/$USER
mkdir ex4; cp file* ex4
mdss rm -r ex4 # Remove the old copy from massdata.
netcp ex4 $USER/ex4
Notice that netcp returns a job ID (creates a copy job).mdss ls # 'ex4' won’t appear until job finishes.
qstat jobid # Displays job status.
watch -n 4 qstat jobid # Might help, see 'man watch'.
When the job finishes, check the result: mdss ls $USER/ex4
Inspect the contents of the job’s output (.o) and error files
(.e).
Repeat the copy, this time using -t (archive) and -z
(compress):mdss rm $USER/ex4/* #clean the directory
netcp -z -t myfile.tar /short/$PROJECT/$USER/ex4 $USER
Check that the copy was successful, then mdss get and
‘untar’
the file.
nci.org.au52/86
Outline
I. Connecting
II. Resource Quotas
III. Software Environment
IV. Filesystems
V. Scheduling Jobs
nci.org.au53/86
Scheduling Jobs: Overview
Tasks that are too large for login nodes must be submitted to jobscheduler (modified version of PBS Pro).
‘large’ means >30 mins cpu time or >2GB mem.
Only tasks submitted to scheduler are charged for compute time!
Scheduler optimises throughput and gives fair share to each
project.
nci.org.au54/86
Scheduling Jobs: Cluster nodes (1/2)
Compute nodes only accessible via job scheduler.
Datamover nodes accessible remotely or via scheduler.
Remote transfers to r-dm.nci.org.au aren’t charged for
compute time.
Long remote transfers are permitted.
nci.org.au55/86
Scheduling Jobs: Cluster nodes (2/2)
Raijin: 3592 compute nodes, 6 datamover nodes.
Each node comprises dual 8-core Intel Xeon Sandy Bridge
2.6 GHz processors, i.e., 16 cores (core = ‘CPU’).
High-speed communication between cluster nodes (
Infiniband).
Compute node memory capacities:Mem Hostname
32GB r1..r2395 67% of all nodes
64GB r2396..r3520 31% of all nodes
126GB r3521..r3592 2% of all nodes
nci.org.au56/86
Scheduling Jobs: Compute jobs vs copy jobs
A job can use compute nodes, or datamover nodes, but not
both.
Compute jobs (compute nodes). . .
. . . can’t access massdata filesystem.
. . . shouldn’t be used for tasks that are mostly disk-based.
. . . can’t access the internet.
Copy jobs (datamover nodes). . .
. . . disk intensive tasks: moving/compressing/tarring large
data.
. . . copying input/output files to/from massdata.
. . . can only use a single CPU.
. . . can access the internet (wget, sftp, svn, git, etc.).
nci.org.au57/86
Scheduling Jobs: Job queues
Jobs submitted from login nodes (use qsub command, more
later).
Three job queues: normal, express, copyq.
Compute jobs: normal, express
Copy jobs: copyq.
normal is the default.
Job waits in queue until resources become available. . .
. . . at which point job is executed on compute or dm
nodes.
nci.org.au58/86
Scheduling Jobs: Which queue? (1/2)
normal (default):
Can request large # of CPUs (10,000+).
Can request any memory type (32/64/126GB nodes).
express:High priority jobs. Often start shortly after
submitted.
50 additional, dedicated nodes.
Charged at three times the rate of other
queues.
E.g., a 5 hr, 2 CPU express job costs 5 × 2 × 3 = 30SUs.
Small per-job resource limits:
≤ 8 nodes, ≤ 32GB mem per node.
walltime ≤ 24 hours (single node).
walltime ≤ 5 hours (multinode).
nci.org.au59/86
Scheduling Jobs: Which queue? (2/2)
copyq:
Intended for manipulation of large files.
Only queue that can access massdata/internet.
Single CPU only.
nf_limits shows project’s walltime limits for the specified
# of
CPUs.
Mem limit equal to maximum available.
We can extend walltime limits on a per job/user/project
basis.
nci.org.au60/86
Scheduling Jobs: Job costs
Cost of job (in SUs, i.e., service units) calculated as:
walltime × # CPUs × W
walltime = real-world time.
normal/copyq queues: W = 1. express queue: W =
3.
Charged for walltime used, not walltime requested.
Try not to request far more than needed.
Charged for # of CPUs (i.e., cores) requested, not # used.
Project’s SU quota is updated after each job finishes.
nci_account also shows. . .
. . . project’s total SU usage.
. . . (with -v option) breakdown of SU usage by user.
. . . cost of running and queued jobs.
nci.org.au61/86
Scheduling Jobs: Queue times (1/2)
Scheduler doesn’t use FIFO policy. However,. . .
. . . older jobs given higher priority.
Jobs wait for resources. Requesting more than needed. . .
. . . increases job’s queue time.
. . . delays other jobs.
. . . wastes resources and compute grant.
express queue jobs often start soon after being submitted.
Requesting higher-mem nodes increases queue time, esp.
126GBnodes.
Jobs won’t start if storage grant exceeded.
nci.org.au62/86
Scheduling Jobs: Queue times (2/2)
Large jobs (≥ 512 cpus) assigned higher priority because. . .
. . . resources would otherwise be tied-up while job is
waiting.
. . . it’s hard for scheduler to fit other jobs around a large
job
(the TETRIS effect).
Priority decreases if project has large # of running jobs.
If you use your allocation too quickly (slowly), priority decreases (increases).
Jobs run with lower priority once grant is exhausted (‘bonus’ jobs).
Detailed scheduling policy.
Load on Raijin spikes at end of each quarter.Don’t leave it to the last minute to use your quarterly
SU grant!
nci.org.au63/86
Scheduling Jobs: Submitting jobsJobs are submitted using qsub options. Returns job ID #.
Use -q normal (default)/express/copyq to specify queue:
qsub -q express ...
Use -l option to specify resources:qsub -l walltime=01:00:00 -l ncpus=32 -l mem=2GB
Licensed software requires -l
software=packagename.
To override default project: -P projectcode.
Non-interactive job: qsub options scriptname
Alternatively, qsub options -- ListOfCommands
Avoid using ‘--’ syntax. Doesn’t ‘source’ dot files.
Interactive job: qsub -I options
nci.org.au64/86
Scheduling Jobs: Non-interactive jobsNon-interactive job: qsub options scriptname.
Script will be executed on first (‘head’) node assigned to job.
Most options can be placed in job script.
When job ends, scheduler creates two summary files (more later).
Job scripts have fixed structure:
1. Shell invocation.
2. PBS directives (essentially qsub options).
3. User commands (must come last).Contents of myscript.sh:
#!/bin/bash
#PBS -l walltime=20:00:00
#PBS -l mem=100MB
#PBS -l ncpus=16
#PBS (other pbs directives)
echo "This job does very little."
nci.org.au65/86
Scheduling Jobs: Interactive jobs
Interactive job: qsub -I options
If walltime not specified, uses queue defaults.
ctrl-c to cancel job before it starts.
Prompted when job starts. . .
. . . commands typed into terminal are executed on
compute/dm nodes.
. . . be sure to use exit command to close session.
For programs that require X-Windows, use qsub’s -X option.
NB. Scheduler won’t save output/error msgs to file.
nci.org.au66/86
Scheduling Jobs: CPU/Mem requests
-l ncpus=
Single-node job: ncpus ≤ 16. Job can share a node if
there is enough free memory and CPUs.
Multinode job: Must request whole nodes (ncpus
multiple of 16). Nodes won’t be shared; can request all
available memory.
-l mem=
Specifies total amount of memory required.
Nodes assigned to job will have same memory
capacity.
Per-node memory request calculated as mem/#nodes.
E.g., mem=80GB, ncpus=32 (2 nodes) ⇒ 40GB mem per node.
∴ job will be assigned to the 64GB memory nodes.
Multinode job, might as well use mem=128GB, i.e., 2×64GB.
Try to choose mem and ncpus so that job uses 32GB
mem nodes.
nci.org.au67/86
Scheduling Jobs: Handy PBS directives(In your own time. See man qsub for more options).
-l wd
At the start of job, working dir set to submission dir.
Job’s ‘.o’ and ‘.e’ summary files placed in this dir.
Otherwise working directory defaults to home dir (∼).
Suppresses execution of login and logout files.
-o filename, -e filename
Tells PBS where to put ‘.o’ and ‘.e’ summary files.
-m EmailEvent
Send email notifications for specific events:
a: job aborted, b: job began, e: job ended.
E.g., -m abe (default -m a).
-M Email1,Email2,...
Recipients for job notification emails.
nci.org.au68/86
Scheduling Jobs: Job environment (1/2)
When job starts, PBS. . .
. . . saves job parameters as environment variables.
. . . executes ‘dot files’, except .bash_logout/.logout.
Logout scripts executed when job ends.
-l wd option suppresses execution of login/logout files.
PBS environment variables. . .
Useful for programs/scripts that require info about
execution environment.
Only visible to job script or terminal running
interactive job.
-V option copies predefined environment variables to job
environment.
nci.org.au69/86
Scheduling Jobs: Job environment (2/2)
Some useful PBS variables:
$PBS_JOBID Job identifier (!).
$PBS_NCPUS # of cpus requested, i.e., ncpus.
$PBS_NODEFILE File that lists nodes assigned to job.
$PBS_JOBFS Job’s assigned JOBFS (scratch)
directory.
$PBS_VMEM Memory request, i.e., mem, not Vmem.
$PBS_O_WORKDIR Name of job submission directory.
qstat -f JobId shows PBS variables for specified job.
Also see PBS Pro manual (there might be small
discrepancies).
nci.org.au70/86
Scheduling Jobs: Postmortems (1/2)
PBS captures standard output/error produced by non-
interactive jobs.
stdout: Jobname.oJobId stderr: Jobname.eJobId
Automatically copied over to working dir when job
ends.
Summary of resource usage is appended to ‘.o’ file.
If PBS detects an error, PBS appends message to ‘.e’ file.
Check these files if job terminates abnormally!
Sometimes OS kills job before PBS realises there’s a
problem,
especially if mem usage spikes.
nci.org.au71/86
Scheduling Jobs: Postmortems (2/2)Contents of myscript.sh.o123456:
==========================================================
Resource Usage on 2013-07-20 12:48:04.355160:
JobId: 123456.r-man2
Project: abc
Exit Status: 0 (Linux Signal 0)
Service Units: 32.00
NCPUs Requested: CPUs Used: 32
CPU Time Used: 18:50:43
Memory Requested: 900mb Memory Used: 80mb
Vmem Used: 94mb
Walltime requested: 02:00:00 Walltime Used: 01:00:00
jobfs request: 100mb jobfs used: 1mb
==========================================================
Memory Used Mem used by the head node.
Vmem Ignore this.
jobfs used JOBFS used by all nodes. Details to come.
CPU utilisation is low if CPU Time ≪ Walltime Used x
NCPUS.
nci.org.au72/86
Scheduling Jobs: JOBFS requests (1/2)
JOBFS: Node-local scratch space, 396GB/node.
Slow. Can outperform /short, /g/data for small/frequent IO.
Only lasts for duration of job.
Don’t write checkpoint files to JOBFS.
qsub option/PBS-directive is -l jobfs=amount.
amount is the total jobfs request. E.g., 100MB,
25GB,...
Per-node jobfs request calculated as amount/#nodes.
PBS stores path to JOBFS in $PBS_JOBFS.
$PBS_JOBFS only visible to your job!
nci.org.au73/86
Scheduling Jobs: JOBFS requests (2/2)
Example JOBFS usage:
Contents of myscript.sh:
#!/bin/bash
#PBS -l ncpus=64
#PBS -l jobfs=2GB
(OTHER PBS DIRECTIVES)
echo The JOBFS directory for this job is $PBS_JOBFS
cp my_input_file $PBS_JOBFS
myprogram $PBS_JOBFS/my_input_file $PBS_JOBFS/my_output_file
cp $PBS_JOBFS/my_output_file /short/c25/$USER
The effective per-node JOBFS request is 2GB/(64/16) =
512MB.
Script executed on head node only. ∴ cp copies to/from
head node only.
mdss and netcp/netmv commands don’t work for JOBFS.
Also see ‘What is the JOBFS filesystem?’.
nci.org.au74/86
Scheduling Jobs: Other filesystems
To prevent job running if /g/data or massdata offline:
#PBS -l other=filesystem
filesystem = gdata1, gdata2, mdss (i.e., massdata).
Not mandatory, but good practice.
massdata not available to compute jobs, i.e.,
normal/express queues* .
*mdss command only works from copyq jobs and login
nodes.
You can also use modstatus to check filesystem availability:
/opt/rash/bin/modstatus -n status
status = gdata1_status, gdata2_status, or mdss_status.
nci.org.au75/86
Scheduling Jobs: Modifying jobs
qalter JobId: Change resource request of jobs waiting to
start.
walltime, mem, ncpus, project,. . .
qdel JobId: Delete queued or running jobs.
exit: Stop currently-running interactive job.
qhold: Prevent queued job from starting, e.g., job
dependencies.
qselect: Lists jobs that meet criteria, e.g., belong to project
X.
We can increase walltime of running jobs (32, 64GB mem
nodes only).
nci.org.au76/86
Scheduling Jobs: Job statusTo display job status: qstat options JobId1 JobId2...
Some useful options (see man qstat for many more):
-u username List user’s queued/running jobs.
-q queuename Show jobs for the specified queue.
-x Include jobs that have finished in the
last day.
-f Show all information about job(s).
Resource usage is aggregate of all nodes.
-s System comments. Good for
troubleshooting.
-n List hostname of nodes assigned to job.
-w Use wider output fields.
nqstat, nqstat_anu: status of jobs belonging to your
projects*. *new jobs might not show up immediately.
nci.org.au77/86
Scheduling Jobs: Job progress
qps: Resources used by job’s processes. Same options as
ps.
qstat –n, qstat –f: Show list of nodes assigned to job.
qstat -f, nqstat_anu: Give rough indication of cpu
utilisation %.
pbs_rusage: Summary of resource usage, as given in ‘.o’
file.
qls: List contents of running job’s JOBFS dir.
qcat: Show job script or std output/error produced so far
(‘.o’ and ‘.e’ files).
qcp: Copy files to/from running job’s JOBFS dir.
nci.org.au78/86
Long run-times expose jobs to system/program instabilities.
You won’t be reimbursed for lost SUs.
Consider implementing a checkpoint mechanism.
Don’t save checkpoint files to JOBFS.
Self-submitting jobs can resume automatically if
interrupted.
Job dependencies: -W depend=type:JobID1 JobId2...
Also on, before, beforeok, etc. See PBS Pro manual.
Multiple levels of dependencies can fail if jobs take too long.
Scheduling Jobs: Checkpoints/Automation
type
after Start after dependencies have started.
afterok Start if dependencies finish successfully.
afternotok Start if dependencies finish with errors.
afterany Start after all dependencies finish.
nci.org.au79/86
Scheduling Jobs: Note on parallelism (1/3)
Many packages take advantage of parallelism automatically.
Options for parallelising custom code:
Option 1. Job-script starts multiple copies of your program.
(a) ‘for’ loop to start processes in background (&), then
wait.
(b) pbsdsh, pbsdsh_anu (like ssh): Can detect multiple
nodes.
(c) pbs_tmrsh (like ssh): Flexible, but must give it node
names
from $PBS_NODEFILE.
pbsdsh, etc., only work from within job script (or interactive
job).
Option 1 works for serial code.
Contention when multiple processes access same file.
1000’s of simultaneous IO ops can degrade Lustre
speed.
Work and memory are replicated unnecessarily.
nci.org.au80/86
Scheduling Jobs: Note on parallelism (2/3)
Option 2. Shared-memory parallelism via OpenMP.
(Not to be confused with OpenMPI).
CPUs must reside on same node. ∴ limited to 16 CPUs.
Imposes parallelism onto serial code via embedded
compiler directives.
Can combine with Option 1 to overcome node limit
(cumbersome).
nci.org.au81/86
Scheduling Jobs: Note on parallelism (3/3)
Option 3. Distributed parallelism via MPI library.
Arbitrary number of CPUs/nodes.
Overcomes limitations of previous two options.
Many programs can be implemented using just the basic
MPI calls.
Highly-optimised version of OpenMPI installed on Raijin.
Once you’re accustomed to MPI you’ll never look back. . .
nci.org.au82/86
Exercise 5. Using job scheduler (1/3)Create the following job script, and call it, e.g.,
exercise5.sh:#!/bin/bash
#PBS -q express
#PBS -l walltime=00:04:00
#PBS -l ncpus=2,mem=10MB,jobfs=10MB
#PBS -l wd
echo "ncpus = $PBS_NCPUS, total mem = $PBS_VMEM bytes"
echo "jobfs dir = $PBS_JOBFS"
echo "Contents of node file:"
cat $PBS_NODEFILE
NUM_NODES=$(cat $PBS_NODEFILE | wc –l)
#NB. $(command) is replaced by output of command.
NODE_NAMES=$(uniq $PBS_NODEFILE) # See man uniq.
echo "# of nodes: $NUM_NODES"
echo "Hostnames of nodes: $NODE_NAMES"
sleep 300 # Sleep for 5 minutes.
echo "Some things just aren't meant to be."
nci.org.au83/86
Exercise 5. Using job scheduler (2/3)
Make the script executable and submit it to the scheduler:chmod +x exercise5.sh
qsub exercise5.sh
Experiment with qstat/nqstat/anu_nqstat, e.g.,qstat -Q, qstat normal, qstat -u $USER,
qstat -saw JobId, nqstat_anu -P $PROJECT
Once job starts, check progress using, e.g.,qstat -f JobId, qps JobId, qcat -o JobId
Wait for job to finish:watch -n 4 qstat JobId # ctrl-c to stop watching.
Did job finish successfully? Inspect .o and .e files, e.g.,cat exercise5.sh.oJobId
nci.org.au84/86
Exercise 5. Using job scheduler (3/3)
If you like, submit interactive job with X-Windows option for
qsub (-X):
Must be connected to Raijin using ssh with -X (PC) or -Y
(Mac) option.
Then,
qsub -I -X -q express -l walltime=00:02:00,ncpus=1
(when job starts) xeyes
ctrl-c to close xeyes.
Make sure you use exit command to end the job!
Just to be certain: qdel JobId
nci.org.au85/86
Finally. . .
Raijin fun facts!
Time-lapse video of Raijin being assembled.
Watch our tape robot at work.