submitting jobs to the sun grid engine at sheffield and leeds (node1)

42
Submitting Jobs to the Sun Grid Engine at Sheffield and Leeds (Node1) Deniz Savas Corporate Information and Computing Services The University of Sheffield Email [email protected]

Upload: cruz-raymond

Post on 30-Dec-2015

37 views

Category:

Documents


2 download

DESCRIPTION

Submitting Jobs to the Sun Grid Engine at Sheffield and Leeds (Node1). Deniz Savas Corporate Information and Computing Services The University of Sheffield Email [email protected]. Presentation Outline. Introducing the grid and batch concepts. Job submission scripts - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Submitting Jobs to the Sun Grid Engine at  Sheffield and Leeds (Node1)

Submitting Jobs to the Sun Grid Engine at

Sheffield and Leeds (Node1)

Deniz Savas

Corporate Information and Computing Services

The University of Sheffield

Email [email protected]

Page 2: Submitting Jobs to the Sun Grid Engine at  Sheffield and Leeds (Node1)

• Introducing the grid and batch concepts. • Job submission scripts• How to submit batch jobs.• How to monitor the progress of the jobs.• Starting up Interactive jobs.• Cancelling already submitted jobs.

Presentation Outline

Page 3: Submitting Jobs to the Sun Grid Engine at  Sheffield and Leeds (Node1)

•Cluster Grid•(titans)

•Enterprise Grid•(cics & non-cics machines)

•Global Grid•WRG i.e. Sheffield,Leeds & York)•Other Organisations, if they join us.

Types of Computing Grids

Page 4: Submitting Jobs to the Sun Grid Engine at  Sheffield and Leeds (Node1)

Mapping the terminology to reality

• Compute Resources/Nodes: Computers available – In Sheffield : titania, titan00 , titan01 , …. Titan08– In Leeds: Maxima– In York: Pascali

• Jobs: Batch and interactive jobs requested to run on these compute-nodes.

• SGE : Sun Grid Engine job scheduler.• WRG : White Rose Grid: All the above compute nodes

and the infra-structure which makes them appear as a unified resource.

Page 5: Submitting Jobs to the Sun Grid Engine at  Sheffield and Leeds (Node1)

Objectives of a Resource Management Scheme

-Fair Sharing of resources amongst the users

Managed by means of SGE Resource Management and Policy Administration Components.

-Optimal Use of Resources

Managed by careful definition of job queues and in real time by SGE queue, job and share management components.

- Utilisation Policy can be –functional –share_based or -deadline_based with manual override if needed. We use share_based policy at WRG whereby past usage is taken into account.

Page 6: Submitting Jobs to the Sun Grid Engine at  Sheffield and Leeds (Node1)

Current WR Grid Resources

• At Sheffield University– Titania + nine workers ‘titans’ (titan00, titan01,

….titan08 ). Each machine has 8 ultra-sparc 900MHz processors and 32GBytes of shared memory.

• At Leeds University– Maxima: Four 8 ultra-sparc processor workers with 24

GBytes of shared memory + one 20 processor with 44GBytes of shared memory.

– Snowdon : 256 node Intel Beowulf Cluster (distributed memory ) ,each configured as dual 2.2 G Xeon processors with 1 GByte or 2GByte memory.

• At York University– Pascali : 8 ultra-sparc processors with 24 GBytes of

shared memory + fimbrata 20 processor with 44 GBytes of shared memory

– Nevada: 40 (1 intel processor each) workers in Beowulf cluster

Page 7: Submitting Jobs to the Sun Grid Engine at  Sheffield and Leeds (Node1)

Share Policy at Sheffield WRG node

• At Sheffield University there are two major groups– Sheffield University Users– The rest of WRG ( i.e. Leeds and York users )

• Sheffield University users get 75% of the resources• The rest get the remaining 25% of the resources• Within each group everyone has equal shares.• Priority of jobs decided according to share and past

usage.• Leeds and York has the same shares policy i.e. 75-25

split, but they also have further local groups like departments and research groups.

Page 8: Submitting Jobs to the Sun Grid Engine at  Sheffield and Leeds (Node1)

How the SGE System Operates

• Users submit an interactive (qsh, qrsh) or a batch job (qsub) to the Sun Grid Engine.

• For an interactive job: (qsh, qrsh, qlogin)– If there are resources immediately available, job

gets started– Otherwise the user is informed about the lack of

resources and job gets abandoned.• For a batch job: ( qsub)

– If there are resources immediately available the job gets started

– Otherwise the job is kept in a queue until resources to execute it becomes available.

• Jobs are always passed onto the available executing hosts

• Records of each jobs progress through the system are kept and reported when requested.

Page 9: Submitting Jobs to the Sun Grid Engine at  Sheffield and Leeds (Node1)

Components of the SGE

• Hosts– Master (coordinator of activities, holder of queues)– Execution (workers)– Administration ( sets up system, queues etc)– Submit (users can submit jobs from these)

Usually the master host and the administrator host are the same machine ( titan00)

• Queues (defined by the administrator)• User and Administrator Commands • Daemons

Page 10: Submitting Jobs to the Sun Grid Engine at  Sheffield and Leeds (Node1)

Progress of an Interactive Job from Submission to Completion at Sheffield

1. User asks to run an interactive job (qsh, qrsh, qlogin) (from a submit node)

2. SGE checks to see if there are resources available to start the job immediately (on an execution node)

• If so, the interactive session is started under the control/monitoring of SGE (on an execute node)

• If resources are not available the request is simply rejected and the user notified. This is because by its very nature users can not wait for an interactive session to start.

3. User terminates the job by typing exit or logout or the job is terminated when the queue limits are reached.

Page 11: Submitting Jobs to the Sun Grid Engine at  Sheffield and Leeds (Node1)

Progress of a Batch Job from Submission to Completion at Sheffield

1- User initiates a batch job by means of the qsub or qmon command or by using a local script already provided such as runfluent or runmatlab (on a submit node) which issue a qsub command.

2- SGE analyses the job parameters to determine the resources needed to run the job.

3- The job is transferred to the master node’s queue.

4 If sufficient resources are available, the job is started immediately ( on an execute node)Otherwise the job waits in the queue until the resources become available

Page 12: Submitting Jobs to the Sun Grid Engine at  Sheffield and Leeds (Node1)

Table of SGE commands

Command(s) Description User/System

qsub, qresub,qmon

Submit batch jobs USER

qsh,qlogin,qrsh Submit Interactive Jobs

USER

qstat , qhost, qdel, qmon

Status of queues and jobs in queues , list of execute nodes, remove jobs from queues

USER

qacct, qmon, qalter, qdel, qmod

Monitor/manage accounts, queues, jobs etc

SYSTEM

Page 13: Submitting Jobs to the Sun Grid Engine at  Sheffield and Leeds (Node1)

Submitting Batch Jobs qsub command

In its simplest form any script file can be submitted to the SGE by simply typing qsub scriptfile . In this way the scriptfile is queued to be executed by the SGE under default conditions and using default amount of resources.

Such use is not desirable as the default conditions provided may not be appropriate to run that job . Also, providing a good estimate of the amount of resources needed helps SGE to schedule the tasks more efficiently.

There are two distinct mechanisms for specifying the environment & resources;1) Via parameters to the qsub command2) Via special SGE comments (#$ ) in the script file that is submitted.

The meaning of the parameters are the same for both methods and they control such things as; -cpu time required- number of processors needed ( for multi-processor jobs),- Output file names,- Notification of job activity.

Page 14: Submitting Jobs to the Sun Grid Engine at  Sheffield and Leeds (Node1)

Method 1Using qsub command parameters

Format:qsub [qsub_params] script_file [-- script_arguments]

Examples:qsub myjobqsub –cwd myjobqsub –l h_cpu=00:05:00 myjob -- test1 -large

Note that this provides a mechanism for providing parameters to the script file following the -- token.

Page 15: Submitting Jobs to the Sun Grid Engine at  Sheffield and Leeds (Node1)

Method 2Special comments in script files

A script file is a file containing a set of unix commands written in a scripting language ‘usually a Bourne or C-Shell’. When the job runs these script files are executed as if their contents were typed at the keyboard.

In a script file any line beginning with # will normally be treated as a comment line and ignored.

However the SGE treats the special comment lines in the submitted script, which start with the special sequence #$ ,in a special way. SGE expects to find declarations of the qsub parameters in these comment lines. At the time of job submission SGE determines the job resource from these comment lines.

If there are any conflicts between the actual qsub command-line parameters and the special comment (#$) qsub parameters the command line parameters always override the special comment parameters.

Page 16: Submitting Jobs to the Sun Grid Engine at  Sheffield and Leeds (Node1)

Examples of special #$ commentsin a script file

# force the shell to be the Bourne shell# At WRG the default shell is the C-shell#$ -S /bin/sh

# specify myresults as the output file#$ -o myresults

# start the job in the current directory#$ -cwd # we compile the programf90 test.for –o mytestprog# we run the program and read the data that program# would have read from the keyboard from file mydata mytestprog < mydata

Page 17: Submitting Jobs to the Sun Grid Engine at  Sheffield and Leeds (Node1)

An example OpenMP job script

#$ -S /bin/tcsh#$ -cwd#$ -pe openmp 4 #$ -l h_cpu=01:30:00# setenv OMP_NUM_THREADS 4 :this is another way# of setting the threads.setenv PARALLEL 4myprog

Page 18: Submitting Jobs to the Sun Grid Engine at  Sheffield and Leeds (Node1)

An example MPI job script

#$ -S /bin/tcsh#$ -cwd# parallel environment is MPI#$ -pe mpi_pe 4# limit run to 1 hours actual clock time#$ -l h_rt=1:00:00 mprun –x sge my_mpi_program

Page 19: Submitting Jobs to the Sun Grid Engine at  Sheffield and Leeds (Node1)

The progress of your batch job

• The user submits a batch job as described above eg. qsub myscript_file • The job is placed in the queue and given a unique job

number <nnnn> • The user is informed immediately of the job number

<nnnn> • The user can check the progress of the job by using the

qstat command. Status of the job is shown as qw (waiting), t (transfering) or r (running)

• User can abort a job by using the qdel command at this stage.

• When the job runs the standard output and error messages are placed in files named <my_scriptfile>.o<nnnn> and <my_scriptfile>.e<nnnn> respectively

Page 20: Submitting Jobs to the Sun Grid Engine at  Sheffield and Leeds (Node1)

qsub command parameters( resource related )

• -l h_cpu=hh:mm:ss : define maximum cpu time • -l h_rt=hh:mm:ss : define max. wall clock time• -l h_vmem=memory : define max. memory(Leeds)• -pe openmp m : Openmp parallel environment• -pe openmp n-m : -m means 1 to m. n- means at

least n no.of.processors • -pe cre parallel environment (any) (Sheffield) • -pe mpi_pe n-m :, mpi job (Leeds)

Page 21: Submitting Jobs to the Sun Grid Engine at  Sheffield and Leeds (Node1)

qsub command parameters continued …(notification related)

• -M email_address : • -m b e a s : send an email when the job begins , ends,

aborted or suspended E.g –m be • -now : Start running now or if can’t run exit with an

error code.• -verify : do not submit the job but check and report on

submission.

Page 22: Submitting Jobs to the Sun Grid Engine at  Sheffield and Leeds (Node1)

qsub command parameters continued …(output files related)

When a job is started it takes its job name from the script_file that was submitted. The standard output and error output is sent to files which are named jobname.onnnn and jobname.ennnn respectively, where nnnn is the job number. The following parameters modify this behaviour:

• -e path : error output file. • –o path :standard o/p• -j y : merge the error and standard output• -N jobname : name to be used for the job.

Page 23: Submitting Jobs to the Sun Grid Engine at  Sheffield and Leeds (Node1)

qsub command parameters continued …(environment related)

• -cwd : Change working directory. The job will run at the directory of submission. Without this option the jobs get started at the root login directory of the user

• -S shell_path : e.g. –S /bin/shSGE uses the tcsh shell as the default environment for executing the job scripts. If the job script was written for another shell such as sh as is usually the case then the –S parameter must be used to make sure of using the right environment.

• -v variable[=value] or -V ( use all current variables)It is possible to set environment variables for the execution shell to ensure the correct running of programs which rely on the existence of certain environment variables.

Page 24: Submitting Jobs to the Sun Grid Engine at  Sheffield and Leeds (Node1)

Hints

• Once you prepared your job script you can test it by simply running it, if possible for a very small problem. Note also that the qsub parameters which are defined using the #$ sequence will be treated as comments during this run.

• Q: Should I define the qsub parameters in the script file or as parameters at the time of issuing qsub ?A: The choice is yours, I prefer to define any parameter, which is not likely to alter between runs, within the script file to save myself having to remember it at each submission.

Page 25: Submitting Jobs to the Sun Grid Engine at  Sheffield and Leeds (Node1)

SGE related Environment Variables

Apart from the specific environment variables passed via the –v or –V options, during the execution of a batch job the following environment variables are also available to help build unique or customized filenames messages etc.

• $HOME• $USER• $JOB_NAME• $HOSTNAME• $SGE_TASK_ID

Page 26: Submitting Jobs to the Sun Grid Engine at  Sheffield and Leeds (Node1)

Array jobs and the $SGE_TASK_ID variable

Example:

#$ -S /bin/tcsh#$ -l h_cpu=01:00:00#$ -t 1-10:1#$ -cwdmyprog > results.$(SGE_TASK_ID)

This will run 10 jobs. The jobs are considered to be independent of each other and hence may run in parallel depending on the availability of resources.

It is possible to make these jobs dependent on each other so as to impose an order of execution by means of the –hold_jid parameter.

Page 27: Submitting Jobs to the Sun Grid Engine at  Sheffield and Leeds (Node1)

Submitting Batch Jobs via the qmon command

If you are using an X terminal ( such as provided by Exceed ) then a GUI interface named qmon can also be used to make job submission easier.

This command also allows an easier way of setting the job parameters.

Page 28: Submitting Jobs to the Sun Grid Engine at  Sheffield and Leeds (Node1)

Job submission panel of QMON

Click on Job Submission Icon

Click to browse for the job script

test2

Page 29: Submitting Jobs to the Sun Grid Engine at  Sheffield and Leeds (Node1)

Job queues

• Unlike the traditional batch queue systems, users do not need to select the queue they are submitting to. Instead SGE uses the resource needs as specified by the user to determine the best queue for the job.

• In Sheffield and Leeds the underlying queues are setup according to memory size and cpu time requirements and also numbers of multiple cpu’s needed (for mpi & openmp jobs )

• qstat –F displays full queue information, Also qmon (Task-Queue_Control) will allow information to be distilled about the queue limits.

Page 30: Submitting Jobs to the Sun Grid Engine at  Sheffield and Leeds (Node1)

Job queue configuration

Normally you will not need to know the details of each queue, as the Grid Engine will make the decisions for you in selecting a suitable queue for your job. If you feel the need to find out how the job queues are configured, perhaps to aid you in specifying the appropriate resources, you may do so by using the qconf system administrator command.

• qconf –sql will give a list of all the queues• qconf –sq queue_name will list details of a specific

queue’s configuration

Page 31: Submitting Jobs to the Sun Grid Engine at  Sheffield and Leeds (Node1)

Current queue groups at Sheffield

• Fast up to 2 hours• Medium up to 96 hours• Large up to 168 hours• Xtra_Large up to 240 hours

Page 32: Submitting Jobs to the Sun Grid Engine at  Sheffield and Leeds (Node1)

Current queue groups at Leeds

• Fast 1 up to 15 minutes• Fast2 up to 30 minutes• Medium up to 96 hours• Large up to 168 hours and 8 processors

Page 33: Submitting Jobs to the Sun Grid Engine at  Sheffield and Leeds (Node1)

Monitoring the progress of your jobs

• The commands qstat and the XWindows based qmon can be used to check on the progress of your jobs through the system.

• We recommend that you use the qmon command if your terminal has X capability as this makes it easier to view your jobs progress and also cancel or abort it, if it becomes necessary to do so.

Page 34: Submitting Jobs to the Sun Grid Engine at  Sheffield and Leeds (Node1)

Checking the progress of jobs with QMON

Click on Job Control IconClick on Running Jobs tab

Page 35: Submitting Jobs to the Sun Grid Engine at  Sheffield and Leeds (Node1)

qstat command

• qstat command will list all the jobs in the system that are either waiting to be run or running.

• qstat –f full listing• qstat –u username or Qstat • qstat –f –u usernameStatus of the job is indicated by letters as:

qw waiting t transferingr running s,S suspendedR restarted T treshold

Page 36: Submitting Jobs to the Sun Grid Engine at  Sheffield and Leeds (Node1)

Starting Interactive Jobs

• Sun Grid Engine will also allow the running of interactive jobs on the executing hosts.

• Basic accounting and executing strategies that apply to the batch jobs also apply to the interactive jobs, the only difference being that a job submitted to run interactively will not be queued if there are no resources to run that job. It will simply rejected and the user informed. This is because by its very nature users can not be made to wait for an interactive session.

Page 37: Submitting Jobs to the Sun Grid Engine at  Sheffield and Leeds (Node1)

Commands for Submitting Interactive Jobs• qlogin :starts a telnet session to one of the executing hosts.• qrsh : starts a remote command shell and optionally

executes a shell-scripts. • qsh : starts an Xterm session.

In Sheffield all interactive jobs are put into the small queues which limits the clock time to 8 hours and cpu time to 2 hours.BEWARE: As soon as any of these two time limits are exceeded the job will terminate without any warning.

Page 38: Submitting Jobs to the Sun Grid Engine at  Sheffield and Leeds (Node1)

qlogin command

• Starts up an interactive session on one of the execute nodes under the control of SGE

• Interactive job uses the same terminal ( via telnet) as the submitting process hence taking control of the terminal until a logout or exit command terminates the job.

Page 39: Submitting Jobs to the Sun Grid Engine at  Sheffield and Leeds (Node1)

qrsh command

qrsh [parameters]

• If no parameters are given it behaves exactly like qlogin

• If there are parameters a remote shell is started up on one of the executing nodes and the parameters are passed to shell for execution. For example, if a script file name is presented as a parameter, commands in the script file are executed and the job terminates when the end of the script file is reached. Example : qrsh myscript

Page 40: Submitting Jobs to the Sun Grid Engine at  Sheffield and Leeds (Node1)

qsh and Qsh

• qsh –display display_specifier

qsh starts up an X-terminal within which the interactive job is started. It is possible to pass any Xterm parameters via the -- construct. Example : qsh -- –title myjob1

Type man xterm for a list of parameters.

• Qsh : this is a Sheffield only qsh with nicer X-term parameters.

Page 41: Submitting Jobs to the Sun Grid Engine at  Sheffield and Leeds (Node1)

Deleting jobs from the queue:qdel command

It is always possible to cancel or terminate the jobs that are submitted to the SGE before they are completed. This applies to queuing, transferring and running jobs. Format of the qdel command :

qdel job_id_number You can also delete all jobs belonging to you by typing;

qdel –u your_user_name

Page 42: Submitting Jobs to the Sun Grid Engine at  Sheffield and Leeds (Node1)

Further documentation and help

• man ( manual pages) : man sge , man qsh so on..

• On titania : docs command which starts up the Netscape browser. See the section on sun grid engine.

• http://www.shef.ac.uk/wrgrid/packages/sge/sgeindex.html

The End