an overview of torque/moab queuing. topics arc topology authentication architecture of the queuing...
TRANSCRIPT
![Page 1: An overview of Torque/Moab queuing. Topics ARC topology Authentication Architecture of the queuing system Workflow Job Scripts Some queuing strategies](https://reader037.vdocuments.site/reader037/viewer/2022102906/56649d095503460f949dc07f/html5/thumbnails/1.jpg)
An overview of Torque/Moab queuing
![Page 2: An overview of Torque/Moab queuing. Topics ARC topology Authentication Architecture of the queuing system Workflow Job Scripts Some queuing strategies](https://reader037.vdocuments.site/reader037/viewer/2022102906/56649d095503460f949dc07f/html5/thumbnails/2.jpg)
TopicsARC topologyAuthentication Architecture of the queuing systemWorkflowJob ScriptsSome queuing strategies
![Page 3: An overview of Torque/Moab queuing. Topics ARC topology Authentication Architecture of the queuing system Workflow Job Scripts Some queuing strategies](https://reader037.vdocuments.site/reader037/viewer/2022102906/56649d095503460f949dc07f/html5/thumbnails/3.jpg)
Network Topology
![Page 4: An overview of Torque/Moab queuing. Topics ARC topology Authentication Architecture of the queuing system Workflow Job Scripts Some queuing strategies](https://reader037.vdocuments.site/reader037/viewer/2022102906/56649d095503460f949dc07f/html5/thumbnails/4.jpg)
![Page 5: An overview of Torque/Moab queuing. Topics ARC topology Authentication Architecture of the queuing system Workflow Job Scripts Some queuing strategies](https://reader037.vdocuments.site/reader037/viewer/2022102906/56649d095503460f949dc07f/html5/thumbnails/5.jpg)
ARC Authentication
![Page 6: An overview of Torque/Moab queuing. Topics ARC topology Authentication Architecture of the queuing system Workflow Job Scripts Some queuing strategies](https://reader037.vdocuments.site/reader037/viewer/2022102906/56649d095503460f949dc07f/html5/thumbnails/6.jpg)
AccountsYour account is your VT PIDYour password is your VT PID passwordContact 4help to change your password
![Page 7: An overview of Torque/Moab queuing. Topics ARC topology Authentication Architecture of the queuing system Workflow Job Scripts Some queuing strategies](https://reader037.vdocuments.site/reader037/viewer/2022102906/56649d095503460f949dc07f/html5/thumbnails/7.jpg)
ArchitectureResource Manager - TorqueScheduler - MoabAllocation Manager - Gold
![Page 8: An overview of Torque/Moab queuing. Topics ARC topology Authentication Architecture of the queuing system Workflow Job Scripts Some queuing strategies](https://reader037.vdocuments.site/reader037/viewer/2022102906/56649d095503460f949dc07f/html5/thumbnails/8.jpg)
Account RequestsTo request an account:
http://www.arc.vt.edu/arc/UserAccounts.php
System X accountshttps://portal.arc.vt.edu/allocation/
alloc_request.htmlTo add users to a Hat/Project for System XPI Email [email protected] to ask to have that
person added
![Page 9: An overview of Torque/Moab queuing. Topics ARC topology Authentication Architecture of the queuing system Workflow Job Scripts Some queuing strategies](https://reader037.vdocuments.site/reader037/viewer/2022102906/56649d095503460f949dc07f/html5/thumbnails/9.jpg)
Queue Architecture
![Page 10: An overview of Torque/Moab queuing. Topics ARC topology Authentication Architecture of the queuing system Workflow Job Scripts Some queuing strategies](https://reader037.vdocuments.site/reader037/viewer/2022102906/56649d095503460f949dc07f/html5/thumbnails/10.jpg)
Resource ManagerTorque (Tera-scale Open-source Research
and QUEue manager)Branch of OpenPBS2 Parts
Pbs_mom Daemon on each compute node Handles job start up and keeps track of the
node’s state Pbs_server
Server that jobs are submitted to. Keeps track of all nodes and jobs
![Page 11: An overview of Torque/Moab queuing. Topics ARC topology Authentication Architecture of the queuing system Workflow Job Scripts Some queuing strategies](https://reader037.vdocuments.site/reader037/viewer/2022102906/56649d095503460f949dc07f/html5/thumbnails/11.jpg)
Moab SchedulerTakes state information from the resource
manager and then schedules jobs to run“The Brains”Implements and manages:
Scheduling policiesDynamic prioritiesReservationsFairshare
![Page 12: An overview of Torque/Moab queuing. Topics ARC topology Authentication Architecture of the queuing system Workflow Job Scripts Some queuing strategies](https://reader037.vdocuments.site/reader037/viewer/2022102906/56649d095503460f949dc07f/html5/thumbnails/12.jpg)
Allocation ManagerGold
Keeps track of cpu-hours
![Page 13: An overview of Torque/Moab queuing. Topics ARC topology Authentication Architecture of the queuing system Workflow Job Scripts Some queuing strategies](https://reader037.vdocuments.site/reader037/viewer/2022102906/56649d095503460f949dc07f/html5/thumbnails/13.jpg)
![Page 14: An overview of Torque/Moab queuing. Topics ARC topology Authentication Architecture of the queuing system Workflow Job Scripts Some queuing strategies](https://reader037.vdocuments.site/reader037/viewer/2022102906/56649d095503460f949dc07f/html5/thumbnails/14.jpg)
WorkflowFrom the queuing system point of view
When a scheduling interval startsMoab asks pbs_server the state of the nodes and of any jobsMoab attempts to schedule any eligble jobs if there are
enough resources freeMoab tells pbs_server to schedule start any jobs that can be
startedPbs_server contacts the pbs_mom on the first node
assigned to the job (That pbs_mom is called the mother superior)
The mother superior executes the jobs scripts submitted by the user
When a pbs heartbeat happensThe pbs_server will contact the pbs_mom and ask the
status of its node
![Page 15: An overview of Torque/Moab queuing. Topics ARC topology Authentication Architecture of the queuing system Workflow Job Scripts Some queuing strategies](https://reader037.vdocuments.site/reader037/viewer/2022102906/56649d095503460f949dc07f/html5/thumbnails/15.jpg)
WorkflowFrom a user’s point of viewSubmit a job script to the queuing systemWait for the job to be scheduled and runGet the results
![Page 16: An overview of Torque/Moab queuing. Topics ARC topology Authentication Architecture of the queuing system Workflow Job Scripts Some queuing strategies](https://reader037.vdocuments.site/reader037/viewer/2022102906/56649d095503460f949dc07f/html5/thumbnails/16.jpg)
The QueueQueue divided into 3 subqueues:Active – runningEligible – idle, but waiting to runBlocked – idle, held, deferred
![Page 17: An overview of Torque/Moab queuing. Topics ARC topology Authentication Architecture of the queuing system Workflow Job Scripts Some queuing strategies](https://reader037.vdocuments.site/reader037/viewer/2022102906/56649d095503460f949dc07f/html5/thumbnails/17.jpg)
Blocked jobsA job can be “blocked” for several
reasons:Requested resources not availableReserved nodes offlineUser already has the maximum number
of eligible jobs in the queueUser places intentional holdMoab supports four distinct types of
holds: user, system, batch, and deferred
![Page 18: An overview of Torque/Moab queuing. Topics ARC topology Authentication Architecture of the queuing system Workflow Job Scripts Some queuing strategies](https://reader037.vdocuments.site/reader037/viewer/2022102906/56649d095503460f949dc07f/html5/thumbnails/18.jpg)
Job ScriptsThe job script has a few definitions to inform
the queuing system of your job requirements and who you are
Includes environment variables and commands to run your application
![Page 19: An overview of Torque/Moab queuing. Topics ARC topology Authentication Architecture of the queuing system Workflow Job Scripts Some queuing strategies](https://reader037.vdocuments.site/reader037/viewer/2022102906/56649d095503460f949dc07f/html5/thumbnails/19.jpg)
Script DefinitionsWalltime request
#PBS -lwalltime=hh:mm:ssCPU request
For System X #PBS -lnodes=X:ppn=2
X number of nodes with 2 processors per node
For Cauldron #PBS -lncpus=X
X number of cores
![Page 20: An overview of Torque/Moab queuing. Topics ARC topology Authentication Architecture of the queuing system Workflow Job Scripts Some queuing strategies](https://reader037.vdocuments.site/reader037/viewer/2022102906/56649d095503460f949dc07f/html5/thumbnails/20.jpg)
Script DefinitionsWhich queue you want to use
#PBS -q <queue name>queues available now
System X OS X partition: production_q System X Linux partition: linux_q Cauldron: cauldron_q Inferno2: inferno2_q Ithaca: ithaca_q Ithaca parallel matlab: pmatlab_q
![Page 21: An overview of Torque/Moab queuing. Topics ARC topology Authentication Architecture of the queuing system Workflow Job Scripts Some queuing strategies](https://reader037.vdocuments.site/reader037/viewer/2022102906/56649d095503460f949dc07f/html5/thumbnails/21.jpg)
Script DefinitionsSome information about who you are
Your submission group #PBS -W group_list=<group>
For System X it is tcf_user For Cauldron it is sgiusers Type `groups` when logged into a head node to check that
you belong to group of the machine you wish to submit to
Your cpu-hour hat #PBS -A <hat>
On Cauldron it is sgim0000 System X users were told their hat in their welcome
letters.
![Page 22: An overview of Torque/Moab queuing. Topics ARC topology Authentication Architecture of the queuing system Workflow Job Scripts Some queuing strategies](https://reader037.vdocuments.site/reader037/viewer/2022102906/56649d095503460f949dc07f/html5/thumbnails/22.jpg)
Job Script Template#!/bin/bash
#PBS -lwalltime=01:00:00
#PBS -lncpus=8
#PBS -q cauldron_q
#PBS -W group_list=sgiusers
#PBS -A sgim0000
![Page 23: An overview of Torque/Moab queuing. Topics ARC topology Authentication Architecture of the queuing system Workflow Job Scripts Some queuing strategies](https://reader037.vdocuments.site/reader037/viewer/2022102906/56649d095503460f949dc07f/html5/thumbnails/23.jpg)
Job ScriptAfter the PBS definitions, put in the
commands to start your jobThere are example job scripts found in
/apps/doc(s)
![Page 24: An overview of Torque/Moab queuing. Topics ARC topology Authentication Architecture of the queuing system Workflow Job Scripts Some queuing strategies](https://reader037.vdocuments.site/reader037/viewer/2022102906/56649d095503460f949dc07f/html5/thumbnails/24.jpg)
Running Your JobUse qsub to submit your job to the queue
qsub ./jobscriptTo check on your job’s status
qstat -a <queue name>showq -p <partition name>
OSX, LINUX, or CAULDRONcheckjob <job id number>cstat (on Cauldron)
To delete a job, use qdelqdel <job id number>
![Page 25: An overview of Torque/Moab queuing. Topics ARC topology Authentication Architecture of the queuing system Workflow Job Scripts Some queuing strategies](https://reader037.vdocuments.site/reader037/viewer/2022102906/56649d095503460f949dc07f/html5/thumbnails/25.jpg)
Check StatusTo display jobs currently in the queue:-bash-3.1$ showq -p LINUX
active jobs------------------------
JOBID USERNAME STATE PROCS REMAINING STARTTIME
176882 jalemkul Running 24 23:46:56 Mon Aug 2 07:11:24
176885 jalemkul Running 24 1:01:37:59 Mon Aug 2 09:02:27
176889 jalemkul Running 24 1:02:21:27 Mon Aug 2 09:45:55
176918 kmsong Running 44 6:14:25:16 Mon Aug 2 16:49:44
176897 kmsong Running 88 15:17:01:30 Tue Aug 3 11:25:58
5 active jobs 118 of 118 processors in use by local jobs (100.00%)
50 of 59 nodes active (84.75%)
eligible jobs----------------------
JOBID USERNAME STATE PROCS WCLIMIT QUEUETIME
0 eligible jobs
blocked jobs-----------------------
JOBID USERNAME STATE PROCS WCLIMIT QUEUETIME
176956 kmsong Idle 112 33:08:00:00 Tue Aug 3 15:15:13
1 blocked job
Total jobs: 6
![Page 26: An overview of Torque/Moab queuing. Topics ARC topology Authentication Architecture of the queuing system Workflow Job Scripts Some queuing strategies](https://reader037.vdocuments.site/reader037/viewer/2022102906/56649d095503460f949dc07f/html5/thumbnails/26.jpg)
Check StatusWith qstat:-bash-3.1$ qstat linux_q
Job id Name User Time Use S Queue
------------------- ---------------- --------------- -------- - -----
176882.queue yt42_md1 jalemkul 1229:52: R linux_q
176885.queue yt42_md3 jalemkul 1185:20: R linux_q
176889.queue yt42_md2 jalemkul 1168:07: R linux_q
176897.queue DNS kmsong 00:00:00 R linux_q
176918.queue Re1200_2sec kmsong 1828:24: R linux_q
176956.queue LDNS kmsong 0 Q linux_q
Note: status give by R – running and Q – queued
![Page 27: An overview of Torque/Moab queuing. Topics ARC topology Authentication Architecture of the queuing system Workflow Job Scripts Some queuing strategies](https://reader037.vdocuments.site/reader037/viewer/2022102906/56649d095503460f949dc07f/html5/thumbnails/27.jpg)
Qstat -f-bash-3.1$ checkjob -v 176956
job 176956 (RM job '176956.queue.arc-int.vt.edu’)
AName: LDNS
State: Idle
Creds: user:kmsong group:tcf_user account:engr1003 class:linux_q qos:sysx_qos
WallTime: 00:00:00 of 33:08:00:00
SubmitTime: Tue Aug 3 15:15:13
(Time Queued Total: 1:23:40:11 Eligible: 00:00:19)
NodeMatchPolicy: EXACTNODE
Total Requested Tasks: 112
Total Requested Nodes: 56
Req[0] TaskCount: 112 Partition: ALL
NodeAccess: SINGLEJOB
TasksPerNode: 2
UMask: 0000
OutputFile: sysx2.arc-int.vt.edu:/home/kmsong/Turb_channel/Simulation/Re600/176103.queue.arc-int.vt.edu/LDNS.o176956
ErrorFile: sysx2.arc-int.vt.edu:/home/kmsong/Turb_channel/Simulation/Re600/176103.queue.arc-int.vt.edu/LDNS.e176956
BypassCount: 305
Partition List: LINUX,SHARED
SrcRM: SystemX DstRM: SystemX DstRMJID: 176956.queue.arc-int.vt.edu
Submit Args: -l walltime=800:00:00 -l nodes=56:ppn=2 -Wgroup_list -Aengr1003 -NLDNS -q linux_q -I
Flags: INTERACTIVE
Attr: INTERACTIVE,checkpoint
StartPriority: 200
PE: 112.00
![Page 28: An overview of Torque/Moab queuing. Topics ARC topology Authentication Architecture of the queuing system Workflow Job Scripts Some queuing strategies](https://reader037.vdocuments.site/reader037/viewer/2022102906/56649d095503460f949dc07f/html5/thumbnails/28.jpg)
NOTE: job violates constraints for partition OSX (partition OSX not in job partition mask)
Node Availability for Partition LINUX --------
available for 2 tasks - n[925,951-958]
rejected for State - n[833-1024]
NOTE: job req cannot run in dynamic partition LINUX now (insufficient procs available: 18 < 112)
NOTE: job violates constraints for partition CAULDRON (partition CAULDRON not in job partition mask)
NOTE: job violates constraints for partition INFERNO2 (partition INFERNO2 not in job partition mask)
NOTE: job violates constraints for partition TT (partition TT not in job partition mask)
NOTE: job violates constraints for partition PECOS (partition PECOS not in job partition mask)
NOTE: job violates constraints for partition ITHACA (partition ITHACA not in job partition mask)
BLOCK MSG: job 176956 violates active SOFT MAXJOB limit of 2 for class linux_q user (Req: 1 InUse: 2) (recorded at last scheduling iteration)
![Page 29: An overview of Torque/Moab queuing. Topics ARC topology Authentication Architecture of the queuing system Workflow Job Scripts Some queuing strategies](https://reader037.vdocuments.site/reader037/viewer/2022102906/56649d095503460f949dc07f/html5/thumbnails/29.jpg)
Queuing StrategiesQueue early, queue often
Queue your jobs up! Can’t run jobs if they aren’t in the queue Don’t wait for the queue to get smaller because the
job will wait, its waiting anyways! Possibility for backfill for smaller jobs
Have an accurate walltime Accurate walltimes will help the queue try to backfill
in smaller jobs in between runs of larger jobs, but only if it won’t effect the start time of the next job
Try to queue large jobs before downtimes If you have a large job that can never seem to have
enough cpus available, queue it up before a downtime.
![Page 30: An overview of Torque/Moab queuing. Topics ARC topology Authentication Architecture of the queuing system Workflow Job Scripts Some queuing strategies](https://reader037.vdocuments.site/reader037/viewer/2022102906/56649d095503460f949dc07f/html5/thumbnails/30.jpg)
Queue StrategiesThe command `showbf`
That command shows cpus available right now, and for how long
Showstartestimated start time of a job
checkjob -vCheckpointing
If your code does checkpointing you can exploit backfill, by queuing jobs to fill the small places but maybe not running to completion
Good idea in general, in case of hardware failure
![Page 31: An overview of Torque/Moab queuing. Topics ARC topology Authentication Architecture of the queuing system Workflow Job Scripts Some queuing strategies](https://reader037.vdocuments.site/reader037/viewer/2022102906/56649d095503460f949dc07f/html5/thumbnails/31.jpg)
showbf-bash-3.1$ showbf
Partition Tasks Nodes StartOffset Duration StartDate
--------- ----- ----- ------------ ------------ --------------
ALL 146 43 00:00:00 INFINITY 09:29:01_08/10
OSX 4 2 00:00:00 INFINITY 09:29:01_08/10
LINUX 62 31 00:00:00 INFINITY 09:29:01_08/10
PECOS 8 1 00:00:00 INFINITY 09:29:01_08/10
ITHACA 72 9 00:00:00 INFINITY 09:29:01_08/10
![Page 32: An overview of Torque/Moab queuing. Topics ARC topology Authentication Architecture of the queuing system Workflow Job Scripts Some queuing strategies](https://reader037.vdocuments.site/reader037/viewer/2022102906/56649d095503460f949dc07f/html5/thumbnails/32.jpg)
showstart-bash-3.1$ showstart 177165
job 177165 requires 64 procs for 12:00:00
Estimated Rsv based start in 8:47:53 on Tue Aug 10 18:11:40
Estimated Rsv based completion in 20:47:53 on Wed Aug 11 06:11:40
Best Partition: OSX
![Page 33: An overview of Torque/Moab queuing. Topics ARC topology Authentication Architecture of the queuing system Workflow Job Scripts Some queuing strategies](https://reader037.vdocuments.site/reader037/viewer/2022102906/56649d095503460f949dc07f/html5/thumbnails/33.jpg)
showstart-bash-3.1$ showstart 64@12:00:00
job 64@12:00:00 requires 64 procs for 12:00:00
Estimated Rsv based start in 8:44:19 on Tue Aug 10 18:11:40
Estimated Rsv based completion in 20:44:19 on Wed Aug 11 06:11:40
Best Partition: OSX
![Page 34: An overview of Torque/Moab queuing. Topics ARC topology Authentication Architecture of the queuing system Workflow Job Scripts Some queuing strategies](https://reader037.vdocuments.site/reader037/viewer/2022102906/56649d095503460f949dc07f/html5/thumbnails/34.jpg)
DocumentationTorque/PBS and Moab scheduler and job
submission documentation:http://www.clusterresources.com/pages/
resources/documentation.php