Download - Nci.org.au @NCInews Using Raijin Resources These slides: //nci.org.au/user-support/training/using-raijin-course

nci.org.aunci.org.au

@NCInews

Using Raijin

ResourcesThese slides:

http://nci.org.au/user-support/training/using-raijin-course/

NCI guides: http://nci.org.au → User Support

Training material: http://nci.org.au/services-support/training/

http://nci.org.au/user-support/training/using-raijin-course/

http://nci.org.au/

http://nci.org.au/services-support/training/

nci.org.au2/86

Outline

I. Connecting

II. Resource Quotas

III. Software Environment

IV. Filesystems

V. Scheduling Jobs

nci.org.au3/86

Raijin: Unix cluster (CentOS 6.6), 3592 compute nodes.

Interactive terminal (text only):

ssh -l abc123 raijin.nci.org.au, or

ssh [email protected]

Windows ssh clients: Putty, MobaXterm, Xming.

Graphics-enabled session (i.e., X-Windows/X11):

ssh -X ...(PC) ssh -Y ...(Mac)

Remote file transfer: scp/sftp/rsync commands, or graphical

ftp client.

It’s good practice to logout of xterm sessions (or ctrl-d, or

exit).

Connecting: Basics

http://nci.org.au/nci-systems/national-facility/peak-system/raijin/





nci.org.au4/86

Connecting: Login nodes

If you can't connect to raijin.nci.org.au, try raijin1-

raijin6.

Use login nodes for small tasks:

Editing files, submitting jobs, small file transfers,

compiling small programs, etc.

'Intensive' tasks killed automatically (<2GB ram, >30

mins cpu time).

nci.org.au5/86

Connecting: Data-mover nodes

Small file-transfers: raijin.nci.org.au

Small/Large transfers: r-dm.nci.org.au

Can use scp/rsync/sftp. Syntax (scp/rsync):

scp [source-file/dir] [destination-file/dir]

Server: Machine running scp server (usually r-

dm.nci.org.au).

Client: Machine that initiates the copy (usually your PC).

Push: Client → Server.

Pull: Client ← Server.

nci.org.au6/86

Connecting: Using scp (in your own time)

Examples. Use your PC as client (run scp in local terminal,

not on Raijin).

Push, i.e., Client (your PC)→ Server (Raijin):

scp myfile [email protected]:mydir

mydir must exist in your home (~) dir.

scp myfile [email protected]:

Copies to home dir, don't forget colon.

scp -r mydir [email protected]:parentdir

parentdir must exist in your home dir.

Pull, i.e., Client (your PC)← Server (Raijin):

Swap the order of the arguments, e.g.,

scp [email protected]:myfile mydir

parentdir must be in the current dir.

nci.org.au7/86

Connecting: Using rsync (in your own

time)

rsync uses same basic syntax as scp. Many more options:

rsync -avPS myfile [email protected]:mydir

-a: Archive. (Recursive copy, preserves

permissions/owner/group/mtime, . . . )

-P: Resume partial file transfers.

-S: Handle sparse files.

Also, -z to enable compression, --progress to show

progress, etc.

Consult the manual pages: man scp and man rsync.

nci.org.au8/86

Connecting: Passphrase-less access

Sometimes necessary, e.g.. automation of remote data-

transfers.

Don't put password in file/script; use passphrase-less ssh keys.

ssh keys can be figured to…

...allow only certain commands.

...restrict arguments, such as directory names.

Passphrase-less file transfer: use rrsync instead of rsync (also rscp).

Strongly discouraged. Weakens security of NCI and your system.

http://nci.org.au/services-support/getting-help/using-ssh-keys/



nci.org.au9/86

Exercise 1: Connecting (1/2)

Login to Raijin:

ssh [email protected]

Service disruptions are reported in the 'Message of the

Day'.

Who/Where am I?

whoami

man hostname # 'man' is Unix's help system, q to exit.

hostname

pwd # shows name of current directory.

echo . # '.' refers to the current directory.

echo ~ # '~' is an alias for your home directory.

cat /etc/motd

You might wish to refer to Handy Unix Commands.

http://nci.org.au/services-support/getting-help/unix-quick-reference-guide/

nci.org.au10/86

Exercise 1: Connecting (2/2)

Try to run the xeyes or xclock commands (they won’t

work).

logout (or ctrl-d, or exit), reconnect with x11 forwarding:

(Mac) ssh -Y ... (PC) ssh -X ...

Now try, e.g., xeyes (ctrl-c to stop).

Remote file transfer: (if time permits)

logout and ‘push’ a file from your computer to Raijin,

scp myfile [email protected]:

(You must include the colon!)

Log back in and use ls or ls -la to check the result.

nci.org.au11/86

Outline

I. Connecting

II. Resource Quotas


IV. Filesystems

V. Scheduling Jobs

nci.org.au12/86

The two resources that are metred are storage and compute

time.

Compute grant divided into quarterly amounts.

Resource usage is accounted against projects.

Projects implemented as Unix groups.

Users can belong to multiple projects.

Resource Quotas

nci.org.au13/86

Usage is ‘charged’ to your default project, unless otherwise

specified.

Edit your .rashrc to change your default project:

.rashrc is hidden in your home folder (ls -a lists all

files).

Settings in .rashrc are applied each time you login.

Modifying .rashrc is the best way to set your project.

Resource Quotas: Default project

setenv PROJECT c25

setenv SHELL /bin/bash

nci.org.au14/86

$PROJECT: Name of active project (see

environment variables).

newgrp doesn't update $PROJECT, instead use. . .

. . . switchproj. Changes active project for the current session.

. . . nfnewgrp. Changes active project for specified

command.

For more info, run switchproj/nfnewgrp without

arguments.

Most commands relating to project accounting/job-scheduling let you override default project.

Resource Quotas: Overriding default project

Never put switchproj in login scripts. (More on these later).

nci.org.au15/86

Storage grant has two components:

Project’s data usage is based on file ownership.

Location of file has no bearing on quota usage.

Use chgrp to change which project owns a file:

chgrp NameOfNewProject myfile

Output files produced by jobs submitted to scheduler

belong to your default project. . .

. . . unless you specify otherwise (more on jobs later).

Resource Quotas: Data grant (1/2)

1. Amount of data.

2. # of files/dirs (‘inodes’).

nci.org.au16/86

Projects only have storage grant for /short, /g/data,

massdata.

/home capped at 2GB and 80,000 inodes.

Project dirs: /short/projcode, /g/data/projcode,

/massdata/projcode

To access massdata: mdss command or ftp to r-

dm.nci.org.au.

/g/data is symbolic link to /g/data1 or /g/data2.

E.g., /g/data/c25 → /g/data2/c25

Resource Quotas: Data grant (2/2)

nci.org.au17/86

Resource Quotas: Data usage (1/2)

nci_account: Summary of all resources used by a project.

Total disk usage (data, inodes) for main filesystems.

Project (-P) and time period (-p) options:

nci_account -P c25 -p 2014.q4

Lustre filesystem stats can be ~30 minutes old.

lquota: Quotas and overall usage for Lustre filesystems (/home,/short, /g/data).

Shows usage for all of your projects.

Unlike nci_account, lquota queries filesystem directly.

To estimate amount of data in home dir: du ~ -sh, find ~|

wc -l

nci.org.au18/86

Resource Quotas: Data usage (2/2)

short_files_report: Project’s disk usage for /short only.-G and -P options show breakdown by user.Stats can be up to ~1 day old.

Example 1. To show where the data owned by project c25

is:

short_files_report -G c25

Locating misplaced files/files with incorrect

ownership.

Determining project’s overall disk usage.

Determining who is using the most space.

Example 2. To show which projects own the data in

/short/c25:

short_files_report -P c25

Locating misplaced files/files with incorrect

ownership.

Also gdata1_files_report and gdata2_files_report.

nci.org.au19/86

Resource Quotas: Compute policy (1/2)

You are only charged for cpu time used by ‘batch jobs’, i.e.,

jobs

submitted to the scheduler.

This applies to both compute and copy jobs.

All other cpu usage is free. Login nodes subject to

cpu/mem limits.

*ssh can issue noninteractive commands to r-

dm.nci.org.au.

nci.org.au20/86

Resource Quotas: Compute policy (2/2)

Remote transfers to r-dm.nci.org.au ‘unlimited’.

In general, can’t transfer files remotely to/from

massdata.

All other large file-operations must use job scheduler, e.g.,

/g/data → massdata

nci.org.au21/86

Resource Quotas: Compute usage

Compute grant specified as number of ‘Service Units’ (SUs).

SU = one hour of walltime on one cpu.

walltime = real-world time.

Each cluster node has 16 cpus (‘cpu’ = core).

E.g., a job that uses 3 compute nodes for 3 hrs costs 3 × 16

× 3

= 144SUs (excludes ‘express’ jobs, discussed later).

Project’s compute usage is updated after each job finishes.

Use nci_account to view project’s overall compute usage.

Also shows costs of running and queued jobs.

-v option: breakdown of compute usage by user.

nci.org.au22/86

Exercise 2: Accounting

Amount of data in your /home (~) dir: du ~ -sh

Approx. # of files in your /home dir: find ~ | wc -l

(The pipe ‘|’ directs output of find to input of wc. See man

wc

and man find. Try the find command by itself).

nci_account, also try -v and -vv options for information

overload.

lquota

short_files_report -G c25

short_files_report -P c25

Also gdata1_files_report and gdata2_files_report.

Notice that c25 doesn’t have gdata1 allocation.

nci.org.au23/86

Outline

I. Connecting

II. Resource Quotas


IV. Filesystems

V. Scheduling Jobs

nci.org.au24/86

Software Environment: The shell

Each terminal runs a separate shell.

The shell interprets and executes commands.

(Handy Unix Commands).

Many shells to choose from.

bash is the most popular (default), followed by

tcsh.

Edit your .rashrc to change your default shell:

Shell commands can be grouped into scripts.

Each script runs in a subshell.

Append & to run a command in background (related to wait,

ctrl-z, fg,bg. See, e.g., man wait).

setenv PROJECT c25

setenv SHELL /bin/bash


nci.org.au25/86

Software Environment: Shell variables

Shell lets you define variables that can be read from the

command line.

N=10 OUTPUTFILE=myfile.out (bash syntax)set N=10 OUTPUTFILE=myfile.out (tcsh syntax)

Retrieve the value by prepending a $.

echo “The value of N is $N”

To make existing variable visible to subshells such as

scripts:

(bash syntax) N=10

export N

export OUTPUTFILE=myfile.out

(tcsh syntax) setenv OUTPUTFILE myfile.out

Useful pre-defined vars: $PROJECT, $USER, $HOME, $0 (shell

type).

Also see Canonical Environment Variables.

http://nci.org.au/user-support/getting-help/canonical-user-environment-variables/

nci.org.au26/86

Software Environment: Shell scripts

First line of a script invokes a new shell.

Usually #!/bin/bash or #!/bin/tcsh

Next come user-commands. E.g., (contents of myscript.sh)#!/bin/bashN=10# Anything after a ‘#’ is a comment.echo $N

To make script executable: chmod +x myscript.sh

To run script: myscript.sh or ./myscript.sh

When script finishes, original values of variables are

restored.

To make changes persistent, use source (or ‘.’). E.g.,

. myscript.sh

source replaces parent shell with subshell running

the script.

nci.org.au27/86

System-scripts for setting default environment (‘dot files’). .

.

. . . run each time a shell is created/destroyed.

. . . hidden in your home (~) directory.

When shell is created:

When you logout: .bash_logout (bash), .logout (tcsh).

Raijin: BASH_ENV=.bashrc. Default .profile

executes .bashrc.

Keep # of commands in your dot files to a minimum.

Avoids conflicts/recursive execution of dot files.

Software Environment: Defaults

Type of shell bash tcsh

Login .profile .login, .cshrc

Non-interactive

$BASH_ENV .cshrc

Interactive .bashrc .cshrc

nci.org.au28/86

Software Environment: Editors

Several editors are installed on Raijin.

Convenient for modifying job scripts.

Editors with text-based interfaces:

vi/vim

emacs

nano

vi/vim/emacs are powerful but not intuitive at first.

Editors with graphical interface:

nedit is a simple graphical editor.

emacs (unless -nw option is specified).

Require X-Windows enabled session (ssh -X or -

Y ...).MS Windows uses slightly different text file format. Can use dos2unix/unix2dos to convert your

scripts.

nci.org.au29/86

Software Environment: Modules (1/2)

Many software packages available on Raijin (

software catalogue).

Configuring your environment for each package isn’t trivial.

Modules take care of this for you:

module load/unload SoftwareNamemodule avail shows list of available modules.module avail SoftwareName shows available versions.module show SoftwareName shows changes module

makes.module list lists the modules you have currently loaded.

Some packages require more than one module to be

loaded.E.g., many packages require OpenMPI.

Prerequisites documented in software catalogue.

http://nci.org.au/services-support/getting-help/application-software/


nci.org.au30/86

Software Environment: Modules (2/2)

Can put module commands in dot files.

. . .preferably in .profile/.login instead

of .bashrc/.cshrc.

Default dot files contain a small number of 'core' modules.

Putting module commands in dot files can lead to conflicts.

It’s better to put such commands in your job scripts

instead.

Putting module purge in dot files can result strange

errors.

You can define your own modules: Module user-guide.

http://nci.org.au/services-support/getting-help/environment-modules/

nci.org.au31/86

Exercise 3: Software environment (1/4)

As always, Handy Unix Commands.

Inspect some predefined variables. E.g., echo $HOME

(compare with echo ∼).

printenv shows list of defined environment variables.

Which shell are you running? echo $0

When you enter a command that isn’t built-in, the shell

searches directories named in $PATH:echo $PATH

PATH=$PATH:~/mydir

echo $PATH

Inspect your login scripts:ls -la ~ # What happens if you omit the 'a'?cat ~/.profile # Note the 'default' modules.cat ~/.bashrc # ...or ~/.login and ~/.cshrc if using tcsh.


nci.org.au32/86


Try the following module commands:

module avail

module avail python

module show python # Shows what environment vars will be

set

echo $PYTHON_BASE

module load python # Loads the default version

echo $PYTHON_BASE

which python

Use module list to check which version is loaded.

Try loading the module for a different version of python. You

must unload the previous version first:

module unload python # No need to specify which version

module list

echo $PYTHON_BASE

which python

nci.org.au33/86


(If time permits)

Make a simple script: nano myscript.sh

(. . . or use, e.g., nedit if X-Windows is enabled).

Insert the following lines:

#!/bin/bash

echo This script is running in a non-interactive subshell.

Save the script (ctrl-o) and exit nano (ctrl-x).

ls -l myscript.sh # Look at file permissions.

chmod +x myscript.sh # Make script executable.

Check the file permissions once more.

nci.org.au34/86


Add the following to the end of .profile:echo Starting a new login shell.

Add the following to the end of .bashrc:

if [ -z “$PS1” ]; then # Don’t forget the spaces!

echo This shell is interactive.

else

echo Either default .profile was executed, or this is a non-

interactive shell and BASH_ENV=.bashrc.

fi # This isn’t a typo.

For each of the next three steps, which, if any, dot files are

executed?

1. logout (or ctrl-d) and then log back in.

2. Run ./myscript.sh. Try echo $BASH_ENV.

3. Start an interactive subshell by typing bash.

Type exit to close the subshell and return to the parent

shell.

nci.org.au35/86

Outline

I. Connecting

II. Resource Quotas


IV. Filesystems

V. Scheduling Jobs

nci.org.au36/86

Gratituitous slide showing disk capacities.

/g/data1 and /g/data2 together have more capacity than

20,000 laptop hard drives (@700GB ea) combined.

Not too shabby!

Filesystems: Capacities

nci.org.au37/86

Filesystems: Purpose and performance

The purpose of each filesystem is reflected in that FS’s

performance.

/short: Freq. accessed files, esp. large IO

files of running/recent jobs, source-code/libs.

/g/data: Data sets that must be available on

demand. Global: visible to the NCI cloud.

/home: Source-code, scripts, local packages.

JOBFS: Node-local scratch space for each job.

massdata: Archive files. Reads are slow and

not immediate if file isn’t in disk cache.

File IO of running jobs usually directed to /short or /jobfs.

The particular choice depends on the IO pattern (more

in a moment).

nci.org.au38/86

The real-world read/write speeds per-job are:

*Lustre filesystems, conditions apply.

Lustre FS speeds achievable because it. . .. . . is highly-parallel.

. . . communicates over ‘Infiniband’ network (56GbE).

massdata:

Accessed over 10GbE link.

Read speed depends on whether file is in cache or on

tape.

Filesystems: Speeds

/short ~1GB/s*/g/data ~500MB/s*JOBFS ≤ 100MB/s/massdata 0.5-1TB/hr

(write)

nci.org.au39/86

Filesystems: Use and abuse of Lustre

Lustre filesystems (/short, /g/data, /home) are distributed:

Files ‘striped’ across multiple disks for parallel IO.

Each file potentially accessible to many users/cpus.

Consistent view maintained across 1000’s of Lustre clients.

Each file operation generates a lot of metadata.

Filesystem bandwidth is shared by all users.

IO-intensive tasks should. . .

. . .issue file operations no more than once per second.

. . .read/write in ‘large’ chunks (>1MB).

Lustre performance plummets if file ops are small and frequent.

Misusing Lustre degrades performance for everyone.

nci.org.au40/86

Each cluster node has 396GB of local scratch space (JOBFS).

Only available to ‘batch jobs’.

Allocated when job starts. Deleted when job finishes.

Not visible by other jobs/users.

Very frequent, small file-ops. . .

. . .can degrade Lustre performance for everyone.

. . .should be avoided, but handled much better by JOBFS.

Filesystems: High-frequency IO

JOBFS /short

Typical speed ≤ 100MB/s ~1GB/s

Suitable for frequent IO

Yes No

nci.org.au41/86

Intended for archiving large data files.

Increases access times.

‘small’ = avg. size ≤ 1MB.

Bundle small files into archives (.tar files) first.

Use archive option (-t) of netcp/netmv commands(discussed later).

Compress option (-z) is recommended.

Supplying mdss get with a large file-list facilitates parallel retrieval.

Filesystems: Use and abuse of massdata

Storing large numbers of small files is a misuse of massdata.

nci.org.au42/86

Snapshot of /home taken every ~2 days./home snapshots thinned-out over time.Duplicate copy of massdata kept in separate building.

Backing-up data is otherwise your responsibility. Use tar liberally!

Many groups neglect to use massdata because. . .

. . . they don’t have a plan for organising/managing data.

. . . turning large data sets into archive files is laborious.

. . . no one wants to touch other people’s files, esp. large

dirs.

. . . they are unsure how to use massdata.

. . . it’s easier to leave files as they are on /short or

/g/data.

If you are unsure, ask us for assistance.

Filesystems: Data-backup policies

Only /home and massdata are backed-up automatically.

nci.org.au43/86

Filesystems: Accessing (1/2)

Lustre filesystems are used like regular directories:

ls, cd, cp, mkdir, rm, etc.

JOBFS exists for duration of job.

Its location will be stored in $PBS_JOBFS (discussed

later).

Accessing massdata:

mdss command. Provides put/get, rm, ls, etc.

netcp/netmv commands.

More details in a moment!

nci.org.au44/86

Filesystems: Accessing (2/2)

Project directories: filesystem/projcode

filesystem = /short, /g/data, /massdata, etc.

All project members have read/write/execute permissions for

these dirs.

Each person has their own subdirectory:

filesystem/projcode/$USER

To check filesystem status:

/opt/rash/bin/modstatus -n opt

opt = gdata1_status, gdata2_status or mdss_status

Message of the Day (cat /etc/motd).

Emergency downtime notice.

http://nci.org.au/services-support/emergency-downtime-notices/

nci.org.au45/86

Filesystems: Accessing massdata (1/3)

Method 1: mdss command.

Provides familiar file operations as subcommands:

mdss ls, mdss mkdir mydir, mdss rm -r mydir, etc.

Login and datamover nodes only.

Latter usually requires job scheduler.

mdss assumes filenames are relative to

/massdata/projcode.

Use -P option to specify project other than default.

nci.org.au46/86


Method 1: mdss command (cont'd).

put, get, stage subcommands use the same syntax.

E.g., mdss put myfile target, mdss put -r mydir.target optional, and must already exist if it specifies a directory.

stage transfers files to cache for later use.

get stages and retrieves.

mdss dmls similar to mdss ls. Also indicates state:

REG = cached, OFL = on tape, DUL = both

mdss creates checksums (mdss -v to verify). See man mdss!

nci.org.au47/86


Method 2: The netcp/netmv commands.

cp/mv commands for massdata. E.g.,

netcp myfile target

target is optional.

Can’t be used to retrieve files from massdata.

Can push files to remote ssh servers (requires

passphrase-less access).

Options to archive (-t) and compress (-z):

netcp -z -t myfile.tar mydir

Implemented as copy job:

Requires familiarity with job scheduler (more later).

Uses default resource limits unless -l is specified.

Produces job summary files.


nci.org.au48/86

Exercise 4: Filesystems (1/4)

For the training project, /g/data is a link to /g/data2:

cd /g/data/$PROJECT # All group members can access this.

pwd

ls -la /g/data/$PROJECT

cd /short/$PROJECT/$USER; pwd # Only you can access this.

Try the ls and du subcommands for mdss (mdss assumes

filenames are relative to /massdata/$PROJECT):

mdss ls

mdss ls .. # '..' is the parent directory.

mdss ls -la

mdss du -h # Also mdss du -sh

nci.org.au49/86


Create two test files, and bundle them into a tar file:

cd /short/$PROJECT/$USER

rm * # Remove existing files.

touch file1.$USER file2.$USER

tar cvf testfiles.tar file* # See man tar for c, v, f options.

ls

tar --list -f testfiles.tar # Check contents of the archive.

Create a user directory on massdata:mdss rm -r $USER # Delete the old directory, if it exists.

mdss mkdir $USER

mdss ls

Next, put testfiles.tar into your massdata directory.

Syntax: mdss put [-r] myfile [targetname]

targetname and -r optional, -r copies directory (man

mdss)

Check the result using mdss ls $USER

nci.org.au50/86


Where is the file stored?

mdss dmls -l #REG = cached, OFL = tape, DUL = both

Remove files before retrieving archived copies:

rm file1.$USER file2.$USER testfiles.tar

Use mdss get to retrieve testfiles.tar. Syntax same as

mdss put. See man mdss.

Check the result using ls.

Unpack the archive:

tar xvf testfiles.tar # extract, verbose, filename, see man tar.

ls

nci.org.au51/86

Exercise 4: Filesystems (4/4). If time permits...Use netcp to copy files to massdata:cd /short/$PROJECT/$USER

mkdir ex4; cp file* ex4

mdss rm -r ex4 # Remove the old copy from massdata.

netcp ex4 $USER/ex4

Notice that netcp returns a job ID (creates a copy job).mdss ls # 'ex4' won’t appear until job finishes.

qstat jobid # Displays job status.

watch -n 4 qstat jobid # Might help, see 'man watch'.

When the job finishes, check the result: mdss ls $USER/ex4

Inspect the contents of the job’s output (.o) and error files

(.e).

Repeat the copy, this time using -t (archive) and -z

(compress):mdss rm $USER/ex4/* #clean the directory

netcp -z -t myfile.tar /short/$PROJECT/$USER/ex4 $USER

Check that the copy was successful, then mdss get and

‘untar’

the file.

nci.org.au52/86

Outline

I. Connecting

II. Resource Quotas


IV. Filesystems

V. Scheduling Jobs

nci.org.au53/86

Scheduling Jobs: Overview

Tasks that are too large for login nodes must be submitted to jobscheduler (modified version of PBS Pro).

‘large’ means >30 mins cpu time or >2GB mem.

Only tasks submitted to scheduler are charged for compute time!

Scheduler optimises throughput and gives fair share to each

project.

http://resources.altair.com/pbs/documentation/support/PBSProUserGuide12.1.pdf

nci.org.au54/86

Scheduling Jobs: Cluster nodes (1/2)

Compute nodes only accessible via job scheduler.

Datamover nodes accessible remotely or via scheduler.

Remote transfers to r-dm.nci.org.au aren’t charged for

compute time.

Long remote transfers are permitted.

nci.org.au55/86

Scheduling Jobs: Cluster nodes (2/2)

Raijin: 3592 compute nodes, 6 datamover nodes.

Each node comprises dual 8-core Intel Xeon Sandy Bridge

2.6 GHz processors, i.e., 16 cores (core = ‘CPU’).

High-speed communication between cluster nodes (

Infiniband).

Compute node memory capacities:Mem Hostname

32GB r1..r2395 67% of all nodes




nci.org.au56/86

Scheduling Jobs: Compute jobs vs copy jobs

A job can use compute nodes, or datamover nodes, but not

both.

Compute jobs (compute nodes). . .

. . . can’t access massdata filesystem.

. . . shouldn’t be used for tasks that are mostly disk-based.

. . . can’t access the internet.

Copy jobs (datamover nodes). . .

. . . disk intensive tasks: moving/compressing/tarring large

data.

. . . copying input/output files to/from massdata.

. . . can only use a single CPU.

. . . can access the internet (wget, sftp, svn, git, etc.).

nci.org.au57/86

Scheduling Jobs: Job queues

Jobs submitted from login nodes (use qsub command, more

later).

Three job queues: normal, express, copyq.

Compute jobs: normal, express

Copy jobs: copyq.

normal is the default.

Job waits in queue until resources become available. . .

. . . at which point job is executed on compute or dm

nodes.

nci.org.au58/86

Scheduling Jobs: Which queue? (1/2)

normal (default):

Can request large # of CPUs (10,000+).

Can request any memory type (32/64/126GB nodes).

express:High priority jobs. Often start shortly after

submitted.

50 additional, dedicated nodes.

Charged at three times the rate of other

queues.

E.g., a 5 hr, 2 CPU express job costs 5 × 2 × 3 = 30SUs.

Small per-job resource limits:

≤ 8 nodes, ≤ 32GB mem per node.

walltime ≤ 24 hours (single node).

walltime ≤ 5 hours (multinode).

nci.org.au59/86

Scheduling Jobs: Which queue? (2/2)

copyq:

Intended for manipulation of large files.

Only queue that can access massdata/internet.

Single CPU only.

nf_limits shows project’s walltime limits for the specified

# of

CPUs.

Mem limit equal to maximum available.

We can extend walltime limits on a per job/user/project

basis.

nci.org.au60/86

Scheduling Jobs: Job costs

Cost of job (in SUs, i.e., service units) calculated as:

walltime × # CPUs × W

walltime = real-world time.

normal/copyq queues: W = 1. express queue: W =

3.

Charged for walltime used, not walltime requested.

Try not to request far more than needed.

Charged for # of CPUs (i.e., cores) requested, not # used.

Project’s SU quota is updated after each job finishes.

nci_account also shows. . .

. . . project’s total SU usage.

. . . (with -v option) breakdown of SU usage by user.

. . . cost of running and queued jobs.

nci.org.au61/86

Scheduling Jobs: Queue times (1/2)

Scheduler doesn’t use FIFO policy. However,. . .

. . . older jobs given higher priority.

Jobs wait for resources. Requesting more than needed. . .

. . . increases job’s queue time.

. . . delays other jobs.

. . . wastes resources and compute grant.

express queue jobs often start soon after being submitted.

Requesting higher-mem nodes increases queue time, esp.

126GBnodes.

Jobs won’t start if storage grant exceeded.

nci.org.au62/86

Scheduling Jobs: Queue times (2/2)

Large jobs (≥ 512 cpus) assigned higher priority because. . .

. . . resources would otherwise be tied-up while job is

waiting.

. . . it’s hard for scheduler to fit other jobs around a large

job

(the TETRIS effect).

Priority decreases if project has large # of running jobs.

If you use your allocation too quickly (slowly), priority decreases (increases).

Jobs run with lower priority once grant is exhausted (‘bonus’ jobs).

Detailed scheduling policy.

Load on Raijin spikes at end of each quarter.Don’t leave it to the last minute to use your quarterly

SU grant!

http://nci.org.au/services-support/getting-help/job-scheduling-resource-allocation-policy/

nci.org.au63/86

Scheduling Jobs: Submitting jobsJobs are submitted using qsub options. Returns job ID #.

Use -q normal (default)/express/copyq to specify queue:

qsub -q express ...

Use -l option to specify resources:qsub -l walltime=01:00:00 -l ncpus=32 -l mem=2GB

Licensed software requires -l

software=packagename.

To override default project: -P projectcode.

Non-interactive job: qsub options scriptname

Alternatively, qsub options -- ListOfCommands

Avoid using ‘--’ syntax. Doesn’t ‘source’ dot files.

Interactive job: qsub -I options


nci.org.au64/86

Scheduling Jobs: Non-interactive jobsNon-interactive job: qsub options scriptname.

Script will be executed on first (‘head’) node assigned to job.

Most options can be placed in job script.

When job ends, scheduler creates two summary files (more later).

Job scripts have fixed structure:

1. Shell invocation.

2. PBS directives (essentially qsub options).

3. User commands (must come last).Contents of myscript.sh:

#!/bin/bash

#PBS -l walltime=20:00:00

#PBS -l mem=100MB

#PBS -l ncpus=16

#PBS (other pbs directives)

echo "This job does very little."

nci.org.au65/86

Scheduling Jobs: Interactive jobs

Interactive job: qsub -I options

If walltime not specified, uses queue defaults.

ctrl-c to cancel job before it starts.

Prompted when job starts. . .

. . . commands typed into terminal are executed on

compute/dm nodes.

. . . be sure to use exit command to close session.

For programs that require X-Windows, use qsub’s -X option.

NB. Scheduler won’t save output/error msgs to file.

nci.org.au66/86

Scheduling Jobs: CPU/Mem requests

-l ncpus=

Single-node job: ncpus ≤ 16. Job can share a node if

there is enough free memory and CPUs.

Multinode job: Must request whole nodes (ncpus

multiple of 16). Nodes won’t be shared; can request all

available memory.

-l mem=

Specifies total amount of memory required.

Nodes assigned to job will have same memory

capacity.

Per-node memory request calculated as mem/#nodes.

E.g., mem=80GB, ncpus=32 (2 nodes) ⇒ 40GB mem per node.

∴ job will be assigned to the 64GB memory nodes.

Multinode job, might as well use mem=128GB, i.e., 2×64GB.

Try to choose mem and ncpus so that job uses 32GB

mem nodes.

nci.org.au67/86

Scheduling Jobs: Handy PBS directives(In your own time. See man qsub for more options).

-l wd

At the start of job, working dir set to submission dir.

Job’s ‘.o’ and ‘.e’ summary files placed in this dir.

Otherwise working directory defaults to home dir (∼).

Suppresses execution of login and logout files.

-o filename, -e filename

Tells PBS where to put ‘.o’ and ‘.e’ summary files.

-m EmailEvent

Send email notifications for specific events:

a: job aborted, b: job began, e: job ended.

E.g., -m abe (default -m a).

-M Email1,Email2,...

Recipients for job notification emails.

nci.org.au68/86

Scheduling Jobs: Job environment (1/2)

When job starts, PBS. . .

. . . saves job parameters as environment variables.

. . . executes ‘dot files’, except .bash_logout/.logout.

Logout scripts executed when job ends.

-l wd option suppresses execution of login/logout files.

PBS environment variables. . .

Useful for programs/scripts that require info about

execution environment.

Only visible to job script or terminal running

interactive job.

-V option copies predefined environment variables to job

environment.

nci.org.au69/86

Scheduling Jobs: Job environment (2/2)

Some useful PBS variables:

$PBS_JOBID Job identifier (!).

$PBS_NCPUS # of cpus requested, i.e., ncpus.

$PBS_NODEFILE File that lists nodes assigned to job.

$PBS_JOBFS Job’s assigned JOBFS (scratch)

directory.

$PBS_VMEM Memory request, i.e., mem, not Vmem.

$PBS_O_WORKDIR Name of job submission directory.

qstat -f JobId shows PBS variables for specified job.

Also see PBS Pro manual (there might be small

discrepancies).


nci.org.au70/86

Scheduling Jobs: Postmortems (1/2)

PBS captures standard output/error produced by non-

interactive jobs.

stdout: Jobname.oJobId stderr: Jobname.eJobId

Automatically copied over to working dir when job

ends.

Summary of resource usage is appended to ‘.o’ file.

If PBS detects an error, PBS appends message to ‘.e’ file.

Check these files if job terminates abnormally!

Sometimes OS kills job before PBS realises there’s a

problem,

especially if mem usage spikes.

nci.org.au71/86

Scheduling Jobs: Postmortems (2/2)Contents of myscript.sh.o123456:

==========================================================

Resource Usage on 2013-07-20 12:48:04.355160:

JobId: 123456.r-man2

Project: abc

Exit Status: 0 (Linux Signal 0)

Service Units: 32.00

NCPUs Requested: CPUs Used: 32

CPU Time Used: 18:50:43

Memory Requested: 900mb Memory Used: 80mb

Vmem Used: 94mb

Walltime requested: 02:00:00 Walltime Used: 01:00:00

jobfs request: 100mb jobfs used: 1mb

==========================================================

Memory Used Mem used by the head node.

Vmem Ignore this.

jobfs used JOBFS used by all nodes. Details to come.

CPU utilisation is low if CPU Time ≪ Walltime Used x

NCPUS.

nci.org.au72/86

Scheduling Jobs: JOBFS requests (1/2)

JOBFS: Node-local scratch space, 396GB/node.

Slow. Can outperform /short, /g/data for small/frequent IO.

Only lasts for duration of job.

Don’t write checkpoint files to JOBFS.

qsub option/PBS-directive is -l jobfs=amount.

amount is the total jobfs request. E.g., 100MB,

25GB,...

Per-node jobfs request calculated as amount/#nodes.

PBS stores path to JOBFS in $PBS_JOBFS.

$PBS_JOBFS only visible to your job!

nci.org.au73/86

Scheduling Jobs: JOBFS requests (2/2)

Example JOBFS usage:

Contents of myscript.sh:

#!/bin/bash

#PBS -l ncpus=64

#PBS -l jobfs=2GB

(OTHER PBS DIRECTIVES)

echo The JOBFS directory for this job is $PBS_JOBFS

cp my_input_file $PBS_JOBFS

myprogram $PBS_JOBFS/my_input_file $PBS_JOBFS/my_output_file

cp $PBS_JOBFS/my_output_file /short/c25/$USER

The effective per-node JOBFS request is 2GB/(64/16) =

512MB.

Script executed on head node only. ∴ cp copies to/from

head node only.

mdss and netcp/netmv commands don’t work for JOBFS.

Also see ‘What is the JOBFS filesystem?’.

http://nci.org.au/services-support/getting-help/faqs/%23what-is-the-jobfs-filesystem-howwhen-do-i-use-it

nci.org.au74/86

Scheduling Jobs: Other filesystems

To prevent job running if /g/data or massdata offline:

#PBS -l other=filesystem

filesystem = gdata1, gdata2, mdss (i.e., massdata).

Not mandatory, but good practice.

massdata not available to compute jobs, i.e.,

normal/express queues* .

*mdss command only works from copyq jobs and login

nodes.

You can also use modstatus to check filesystem availability:

/opt/rash/bin/modstatus -n status

status = gdata1_status, gdata2_status, or mdss_status.

nci.org.au75/86

Scheduling Jobs: Modifying jobs

qalter JobId: Change resource request of jobs waiting to

start.

walltime, mem, ncpus, project,. . .

qdel JobId: Delete queued or running jobs.

exit: Stop currently-running interactive job.

qhold: Prevent queued job from starting, e.g., job

dependencies.

qselect: Lists jobs that meet criteria, e.g., belong to project

X.

We can increase walltime of running jobs (32, 64GB mem

nodes only).

nci.org.au76/86

Scheduling Jobs: Job statusTo display job status: qstat options JobId1 JobId2...

Some useful options (see man qstat for many more):

-u username List user’s queued/running jobs.

-q queuename Show jobs for the specified queue.

-x Include jobs that have finished in the

last day.

-f Show all information about job(s).

Resource usage is aggregate of all nodes.

-s System comments. Good for

troubleshooting.

-n List hostname of nodes assigned to job.

-w Use wider output fields.

nqstat, nqstat_anu: status of jobs belonging to your

projects*. *new jobs might not show up immediately.

nci.org.au77/86

Scheduling Jobs: Job progress

qps: Resources used by job’s processes. Same options as

ps.

qstat –n, qstat –f: Show list of nodes assigned to job.

qstat -f, nqstat_anu: Give rough indication of cpu

utilisation %.

pbs_rusage: Summary of resource usage, as given in ‘.o’

file.

qls: List contents of running job’s JOBFS dir.

qcat: Show job script or std output/error produced so far

(‘.o’ and ‘.e’ files).

qcp: Copy files to/from running job’s JOBFS dir.

nci.org.au78/86

Long run-times expose jobs to system/program instabilities.

You won’t be reimbursed for lost SUs.

Consider implementing a checkpoint mechanism.

Don’t save checkpoint files to JOBFS.

Self-submitting jobs can resume automatically if

interrupted.

Job dependencies: -W depend=type:JobID1 JobId2...

Also on, before, beforeok, etc. See PBS Pro manual.

Multiple levels of dependencies can fail if jobs take too long.

Scheduling Jobs: Checkpoints/Automation

type

after Start after dependencies have started.

afterok Start if dependencies finish successfully.

afternotok Start if dependencies finish with errors.

afterany Start after all dependencies finish.


nci.org.au79/86

Scheduling Jobs: Note on parallelism (1/3)

Many packages take advantage of parallelism automatically.

Options for parallelising custom code:

Option 1. Job-script starts multiple copies of your program.

(a) ‘for’ loop to start processes in background (&), then

wait.

(b) pbsdsh, pbsdsh_anu (like ssh): Can detect multiple

nodes.

(c) pbs_tmrsh (like ssh): Flexible, but must give it node

names

from $PBS_NODEFILE.

pbsdsh, etc., only work from within job script (or interactive

job).

Option 1 works for serial code.

Contention when multiple processes access same file.

1000’s of simultaneous IO ops can degrade Lustre

speed.

Work and memory are replicated unnecessarily.

nci.org.au80/86


Option 2. Shared-memory parallelism via OpenMP.

(Not to be confused with OpenMPI).

CPUs must reside on same node. ∴ limited to 16 CPUs.

Imposes parallelism onto serial code via embedded

compiler directives.

Can combine with Option 1 to overcome node limit

(cumbersome).

nci.org.au81/86


Option 3. Distributed parallelism via MPI library.

Arbitrary number of CPUs/nodes.

Overcomes limitations of previous two options.

Many programs can be implemented using just the basic

MPI calls.

Highly-optimised version of OpenMPI installed on Raijin.

Once you’re accustomed to MPI you’ll never look back. . .

nci.org.au82/86

Exercise 5. Using job scheduler (1/3)Create the following job script, and call it, e.g.,

exercise5.sh:#!/bin/bash

#PBS -q express

#PBS -l walltime=00:04:00

#PBS -l ncpus=2,mem=10MB,jobfs=10MB

#PBS -l wd

echo "ncpus = $PBS_NCPUS, total mem = $PBS_VMEM bytes"

echo "jobfs dir = $PBS_JOBFS"

echo "Contents of node file:"

cat $PBS_NODEFILE

NUM_NODES=$(cat $PBS_NODEFILE | wc –l)

#NB. $(command) is replaced by output of command.

NODE_NAMES=$(uniq $PBS_NODEFILE) # See man uniq.

echo "# of nodes: $NUM_NODES"

echo "Hostnames of nodes: $NODE_NAMES"

sleep 300 # Sleep for 5 minutes.

echo "Some things just aren't meant to be."

nci.org.au83/86

Exercise 5. Using job scheduler (2/3)

Make the script executable and submit it to the scheduler:chmod +x exercise5.sh

qsub exercise5.sh

Experiment with qstat/nqstat/anu_nqstat, e.g.,qstat -Q, qstat normal, qstat -u $USER,

qstat -saw JobId, nqstat_anu -P $PROJECT

Once job starts, check progress using, e.g.,qstat -f JobId, qps JobId, qcat -o JobId

Wait for job to finish:watch -n 4 qstat JobId # ctrl-c to stop watching.

Did job finish successfully? Inspect .o and .e files, e.g.,cat exercise5.sh.oJobId

nci.org.au84/86

Exercise 5. Using job scheduler (3/3)

If you like, submit interactive job with X-Windows option for

qsub (-X):

Must be connected to Raijin using ssh with -X (PC) or -Y

(Mac) option.

Then,

qsub -I -X -q express -l walltime=00:02:00,ncpus=1

(when job starts) xeyes

ctrl-c to close xeyes.

Make sure you use exit command to end the job!

Just to be certain: qdel JobId

nci.org.au85/86

Finally. . .

Raijin fun facts!

Time-lapse video of Raijin being assembled.

Watch our tape robot at work.

http://nci.org.au/wp-content/uploads/2015/04/raijin-infograph-dark.pdf

https://www.youtube.com/watch?v=yhsFrvhm7pU

https://www.youtube.com/watch?v=31M1A7TQf9w

Download - Nci.org.au @NCInews Using Raijin Resources These slides: //nci.org.au/user-support/training/using-raijin-course

Top Related