cern batch service: htcondor16 more about condor_q • by default condor_qshows: • user’s job...
TRANSCRIPT
CERN Batch Service: HTCondor
8/15/2019 Document reference 2
Agenda
• Batch Service
• What is HTCondor?
• Job Submission
• Multiple Jobs & Requirements
• File Transfer
8/15/2019 Document reference 3
4
Batch Service
• IT-CM-IS Mandate:“Provide high-level compute services to the CERN Tier-0 and WLCG “
• HTCondor: our production batch service.
• Service used for both “grid” and “local” submission
• Local means open to all CERN users, kerberos, shared filesystem, managed submission nodes
• ~218k cores in HTCondor
• Over a million jobs a day
• Service Element: Batch Service
What is ?
5
Part of the content adapted from: “An introduction to using HTCondor”
by Christina Koch, HTCondor Week 2016 & HTCondor Week 2018
6
What is ?
• Open Source batch system developed at the CHTC at the University of Wisconsin
• “High Throughput Computing”
• Long history in HEP and elsewhere (including previously at CERN)
• Used extensively in OSG, and things like the CMS global pool (200K++ cores)
• System of symmetric matching of job requests to resources using ClassAds of job requirements and machine resources
7
HTCondor elements
ScheddSchedd
ScheddSchedd
Schedd
Collector
Negotiator
StartdStartd
StartdStartd
StartdStartd
StartdStartd
StartdStartd
StartdStartd
StartdStartd
Submit Side Broker Execute Side
Send jobs to
reserved slot
Send machine
properties
(ClassAds)
Pull list of jobs
Match jobs &
machines
Execute Side
• Slot: 1 CPU / 2GB RAM / 20GB Disk
• CPU / Memory will be scaled in requests to
reflect slot
• Ask for 2 CPUs, get 4 GB RAM
• Mostly CentOS7 at this point
• CentOS8 in the works, but likely CentOS7
platform for next run
• Docker & Singularity are available for
containers
8/15/2019 Document reference 8
9
Jobs
• A single computing task is called a “job”
• Three main pieces of a job are the input,
executable and output
10
Job Example
wi.dat
compare_
states
us.dat
wi.dat.out
$ compare_states wi.dat us.dat wi.dat.out
• The executable must be runnable from the
command line without any interactive input
11
Job Translation
• Submit file: communicates everything about your job(s) to HTCondor
• The main goal of this training is to show you how to properly represent your job in a submit file
executable = compare_states
arguments = wi.dat us.dat wi.dat.out
should_transfer_files = YES
transfer_input_files = us.dat, wi.dat
when_to_transfer_output = ON_EXIT
log = job.log
output = job.out
error = job.err
request_cpus = 1
request_disk = 20MB
request_memory = 20MB
queue 1
12
CERN HTCondor Service
HTCondor Pool – “CERN Condor Share”
Wo
rkers
Local Schedd
bigbirdXY.cern.ch
Central Manager
tweetybirdXY.cern.ch
Worker
GRID
Local Users
Grid Cert
Kerberos
Grid Cert
Authentication
Typically from
lxplus.cern.ch
Authentication
Worker Worker
Local Schedd
bigbirdXY.cern.ch
Local Schedd
bigbirdXY.cern.ch
CE Schedd
ce5XY.cern.ch
CE Schedd
ce5XY.cern.ch
CE Schedd
ce5XY.cern.ch
SLC6 / Mix CC7 SLC6 / Short
…
…
Worker Worker Worker …Worker Worker WorkerWorkerWorkerWorker
Different flavours, same config:
afs, cvmfs, eos, root,…
“It’s like lxplus”
Central Manager
tweetybirdXY.cern.ch
13
Ex. 1: Job Submission: sub filelxplus ~$ vi ex1.sub
universe = vanilla
executable = ex1.sh
arguments = "training 2018"
output = output/ex1.out
error = error/ex1.err
log = log/ex1.log
queue
universe: an HTCondor
execution environment.
Vanilla is the default and it
should cover 90% cases.
executable
arguments: arguments
are any options passed to
the executable from the command line.
output/error: captures
stdout & stderr
log: file created by HTCondor
to track job progress
queue: keyword indicating
“create a job”
14
Ex. 1: Job Submission: script
lxplus ~$ vi ex1.sh
#!/bin/sh
echo 'Date: ' $(date)
echo 'Host: ' $(hostname)
echo 'System: ' $(uname -spo)
echo 'Home: ' $HOME
echo 'Workdir: ' $PWD
echo 'Path: ' $PATH
echo "Program: $0"
echo "Args: $*"
The shebang (#!) is mandatory when submitting script files in HTCondor:
“#!/bin/sh” “#!/bin/bash” “#!/bin/env python”
Malformed or invalid shebang silently ignored and no error reported (yet)
lxplus ~$ chmod +x ex1.sh
15
Ex. 1: Job Submission
lxplus ~$ condor_submit ex1.sub
Submitting job(s).
1 job(s) submitted to cluster 162.
universe = vanilla
executable = ex1.sh
arguments = "training 2018"
output = output/ex1.out
error = error/ex1.err
log = log/ex1.log
queue
To submit a job/jobs
condor_submit <submit_file>
To monitor submitted jobs:
condor_q
lxplus ~$ condor_q
-- Schedd: bigbird99.cern.ch : <137.138.120.138:9618?... @ 11/19/18 20:50:42
OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS
fernandl CMD: ex1.sh 11/19 20:49 _ _ 1 1 162.0
1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended
16
More about condor_q
• By default condor_q shows:
• User’s job only
• Jobs summarized in batches: same
cluster or same executable or same batch
namelxplus ~$ condor_q
-- Schedd: bigbird99.cern.ch : <137.138.120.138:9618?... @ 11/19/18 20:50:42
OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS
fernandl CMD: /bin/hostname 11/19 20:49 _ _ 1 1 162.0
1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended
JobId = ClusterId .ProcId
17
More about condor_q
• To see individual job information, use:condor_q –nobatch
• We will use –nobatch option in the
following slides to see extra detail about what is happening with a job
lxplus ~$ condor_q -nobatch
-- Schedd: bigbird99.cern.ch : <137.138.120.138:9618?... @ 11/19/18 20:50:32
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
162.0 fernandl 11/19 20:49 0+00:00:00 I 0 0.0 hostname
1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended
18
Job States
condor_
submit
Idle
(I)
Running
(R)
Completed
(C)
transfer
executable
and input to
execute
node
transfer
output
back to
submit node
in the queue leaving the queue
19
Log File000 (168.000.000) 11/20 11:34:25 Job submitted from host:
<137.138.120.138:9618?addrs=137.138.120.138-9618&noUDP&sock=1069_d2d4_3>
...
001 (168.000.000) 11/20 11:37:26 Job executing on host:
<188.185.217.222:9618?addrs=188.185.217.222-9618+[--1]-9618&noUDP&sock=3285_211b_3>
...
006 (168.000.000) 11/20 11:37:30 Image size of job updated: 15
0 - MemoryUsage of job (MB)
0 - ResidentSetSize of job (KB)
...
005 (168.000.000) 11/20 11:37:30 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
19 - Run Bytes Sent By Job
15768 - Run Bytes Received By Job
19 - Total Bytes Sent By Job
15768 - Total Bytes Received By Job
Partitionable Resources : Usage Request Allocated
Cpus : 1 1
Disk (KB) : 31 15 1841176
Memory (MB) : 0 2000 2000
...
The Central Manager
20
Class Ad & Matchmaking
21
The Central Manager
• HTCondor matches jobs with computers via a “central manager”
submitexecute
execute
execute
central manager
22
Class Ads
• HTCondor stores a list of information about each job and each computer.
• This information is stored as a “Class Ad”
• Class Ads have the format:AttributeName = value
• value can be Boolean, number or string
23
Job Class Ads
RequestCpus = 1
Err = “job.err"
WhenToTransferOutput = "ON_EXIT"
TargetType = "Machine"
Cmd = “/afs/cern.ch/user/f/fernandl/condor/exe“
Arguments = “x y z”
JobUniverse = 5
Iwd = “/afs/cern.ch/user/f/fernandl/condor"
RequestDisk = 20480
NumJobStarts = 0
WantRemoteIO = true
OnExitRemove = true
MyType = "Job"
Out = "job.out"
UserLog =
“/afs/cern.ch/user/f/fernandl/condor/job.log"
RequestMemory = 20
...
...
+
HTCondor configuration*
executable = exe
Arguments = “x y z”
log = job.log
output = job.out
error = job.err
queue 1
=
24
Machine Class AdsHasFileTransfer = true
DynamicSlot = true
TotalSlotDisk = 4300218.0
TargetType = "Job"
TotalSlotMemory = 2048
Mips = 17902
Memory = 2048
UtsnameSysname = "Linux"
MAX_PREEMPT = ( 3600 * 72 )
Requirements = ( START ) && (
IsValidCheckpointPlatform ) && (
WithinResourceLimits )
OpSysMajorVer = 6
TotalMemory = 9889
HasGluster = true
OpSysName = "SL"
HasDocker = true
...
=
+
HTCondor configuration
25
Job Matching
• On a regular basis, the central managerreviews Job and Machine Class Ads andmatches jobs to computers
submitexecute
execute
execute
central manager
26
Job Execution
• After the central manager makes the match, thesubmit and execute points communicatedirectly
submitexecute
execute
execute
central manager
27
Class Ads for People
• Class Ads also provides lots of usefulinformation about jobs and computers toHTCondor users and administrators
28
Finding Job Attributes
• Use the “long” option for condor_qcondor_q –l <JobId>
$ condor_q -l 128.0
Arguments = ""
Cmd = "/bin/hostname"
Err = "error/hostname.err"
Iwd = "/afs/cern.ch/user/f/fernandl/temp/htcondor-training/module_single_jobs"
JobUniverse = 5
OnExitRemove = true
Out = "output/hostname.out"
RequestMemory = 2000
Requirements = ( TARGET.Hostgroup =?= "bi/condor/gridworker/share/mixed" ||
TARGET.Hostgroup =?= "bi/condor/gridworker/shareshort" || TARGET.Hostgroup =?=
"bi/condor/gridworker/share/singularity" || TARGET.Hostgroup =?=
"bi/condor/gridworker/sharelong" ) && VanillaRequirements
TargetType = "Machine"
UserLog = "/afs/cern.ch/user/f/fernandl/temp/htcondor-
training/module_single_jobs/log/hostname.log"
WantRemoteIO = true
WhenToTransferOutput = "ON_EXIT_OR_EVICT"
...
29
Resource Request
• Jobs use a part of the computer, not the
whole thing.
• Important to size job requirements
appropriately: memory, cpus and disk.
• CERN HTCondor defaults:
• 1 CPU
• 2 Gb ram
• 20 Gb disk
whole
computer
your request
Resource Request (II)
• Even if the system sets a default CPU, memory
and disk requests, they may be too small.
• Important to run the job and get the information
from the log to request the right amount of
resources:
• Requesting too little: causes problems for
your and other jobs, jobs might be held by
HTCondor or killed by the system.
• Requesting too much: jobs will match to
fewer slots and will waste resources.
30
Time to start running
• As we have seen, jobs don’t start to run immediately after the submission.
• Many factors involved:• Negotiation Cycle: the central managers don’t
perform matchmaking continuously. It is an expensive operation (~ 5 min).
• User priority: users priority is dynamic and recalculated according to usage.
• Availability of resources: many worker flavours. Machines matching your job requirements might be busy.
• More info: BatchDocs (Fairshare) & Manual (User Priorities)
31
32
Ex. 4: Multiple Jobs (queue)lxplus ~$ vi ex4.sub
universe = vanilla
executable = ex4.sh
arguments = $(ClusterId) $(JobId)
output = output/$(ClusterId).$(ProcId).out
error = error/$(ClusterId).$(ProcId).err
log = log/$(ClusterId).log
queue 5
queue: it controls how
many instances of the job
are submitted (default 1).
It supports dynamic input.
Pre-defined macros: we
can use the $ClusterId and
$ProcId vars in order to
provide unique values to the
jobs files.
33
Ex. 5: Multiple Jobs (queue)lxplus ~$ vi ex5.sub
universe = vanilla
executable = $(filename)
output = output/$(ClusterId).$(ProcId).out
error = error/$(ClusterId).$(ProcId).err
log = log/$(ClusterId).log
queue filename matching files ex5/*.sh
The resulting jobs point to different executables, but they will belong to the same ClusterId with different ProcIds.
The usage of regular expressions in queue allows us to
submit more than one different jobs.
multiple
queue
statements
Not recommended. Can be useful when submitting job batches
where a single (non-file/argument) characteristic is changing
matching ..
pattern
Natural nested looping, minimal programming, use optional
“files” and “dirs” keywords to only match files or directories
Requires good naming conventions,
in .. list Supports multiple variables, all information contained in a single
file, reproducible
Harder to automate submit file creation
from .. file Supports multiple variables, highly modular (easy to use one
submit file for many job batches), reproducible
Additional file needed
Queue Statement Comparison
CERNism: JobFlavour
• Set of pre-defined run times to bucket jobs
easily (Default: espresso)
35
universe = vanilla
executable = training.sh
output = output/$(ClusterId).$(ProcId).out
error = error/$(ClusterId).$(ProcId).err
Log = log/$(ClusterId).log
+JobFlavour = "microcentury"
queue
espresso = 20 min
microcentury = 1 hour
longlunch = 2 hours
workday = 8 hours
tomorrow = 1 day
testmatch = 3 days
nextweek = 1 week
Exceeding MaxRuntime
• What will happen if we set MaxRuntime less
than the job needs in order to be executed
36
universe = vanilla
executable = training.sh
output = output/$(ClusterId).$(ProcId).out
error = error/$(ClusterId).$(ProcId).err
log = log/$(ClusterId).log
+MaxRuntime = 120
queue
The job will be removed
by the system.
[fprotops@lxplus088 training]$ condor_q -af MaxRuntime120
[fprotops@lxplus088 training]$ condor_history -l <job id> |grep -i removeRemoveReason = "Job removed by SYSTEM_PERIODIC_REMOVE due to wall time exceeded allowed max."
Debug (I): condor_status
• It displays the status of machines in the pool:$ condor_status –avail
$ condor_status –schedd
$ condor_status <hostname>
$ condor_status –l <hostname>
• It supports filtering based on ClassAd:
37
[fprotops@lxplus071 ssh]$ condor_status -const ‘OpSysAndVer=?="CentOS7"’[email protected] LINUX X86_64 Claimed Busy [email protected] LINUX X86_64 Claimed Busy [email protected] LINUX X86_64 Claimed Busy [email protected] LINUX X86_64 Claimed Busy [email protected] LINUX X86_64 Claimed Busy 11.510
Debug (II): condor_ssh_to_job
• Creates an ssh session to a running job:$ condor_ssh_to_job <job id>
$ condor_ssh_to_job -auto-retry <job id>
• This will get us access to the contents of our sandbox in the worker node: output, temp files, credentials…
38
[fprotops@lxplus071 ssh]$ condor_ssh_to_job -auto-retry <job id>[email protected]: Rejecting request, because the job execution environment is not yet ready.Waiting for job to start...Welcome to [email protected]!Your condor job is running with pid(s) 18694.[fprotops@b626c4b230 dir_18443]$ lscondor_exec.exe _condor_stderr _condor_stdout fprotops.cc test.txt tmp var
Debug (III): condor_tail
• It displays the tail of the job files:$ condor_tail -follow <job id>
• The output can be controlled via flags:$ condor_tail -follow -no-stdout -stderr <job id>
39
[fprotops@lxplus052 training]$ condor_tail -follow <job id>Welcome to the HTCondor training!
Debug (IV): Hold & Removed• Apart from Idle, Running and Completed, HTCondor defines
two more states:
• Hold and Removed
• Jobs can get into Hold or Removed status either by the user or by the system.
• Related commands:$ condor_q –hold$ condor_q –af HoldReason <job_id>$ condor_hold$ condor_release$ condor_rm$ condor_history –limit 1 <job_id> -afRemoveReason
40
[fprotops@lxplus052 training]$ condor_q -af HoldReason 155.0via condor_hold (by user fprotops)
File Transfer
• A job will need input and output data.
• There are several ways to get data in or out
of the batch system, so we need to know a
little about the trade offs.
• Do you want to use a shared filesystem? Do
you want to have condor transfer data for
you? Should you input or output in the job
payload itself?
15/08/2019 condor data transfer 41
Infrastructure
15/08/2019 condor data transfer 42
Adding Input files
• In order to add input files, we just need to add “transfer_input_files” to our submit file
• It’s a list of files to take from the working directory to send to the job sandbox
• This example produce one output file “merge.out”
15/08/2019 condor data transfer 43
executable = merge.sh
arguments = a.txt b.txt merge.out
transfer_input_files = a.txt, b.txt
log = job.log
output = job.out
error = job.err
+JobFlavour = “longlunch”
queue 1
Transferring output back
• By default condor
will transfer everything
in your sandbox
• To only transfer back the
file you need, use
transfer_output_files
• Adding to
transfer_output_files
adds file to list that
“condor_tail” can see
15/08/2019 condor data transfer 44
executable = merge.sh
arguments = a.txt b.txt merge.out
transfer_input_files = a.txt, b.txt
transfer_output_files = merge.out
log = job.log
output = job.out
error = job.err
+JobFlavour = “longlunch”
queue 1
Important considerations
• Even when using a shared filesystem, files are transferred to a scratch space on the workers, the “sandbox”.• Remember the impact on the filesystem! The most
efficient use of network filesystems is typically to write once, at the end of a job
• You have 20GB per CPU of sandbox
• There are limits to the amount of data that we allow to be transferred using condor file transfer• The limit is 1GB currently per job
• The job itself can do file transfer, both input and output
15/08/2019 condor data transfer 45
condor_submit -spool
• You may not want condor to create files in your shared filesystem• Particularly if you are submitting 10s of 1000s of jobs
• condor_submit –spool transfers files to the Schedd
• Important notes:• This makes the system async – to get any files back
you need to run condor_transfer_data
• The spool on the Schedd is limited!
• Best practice for this mode: spool, but write data out to end location within job, use spool only for stdout/err
15/08/2019 condor data transfer 46
Note on AFS & EOS
• Shared filesystem is used a lot for batch jobs
• Current best practices:• AFS, EOS FUSE, EOS via xrdcp are all available on
the worker node
• Between the submit node and the Schedd, only AFS is currently supported
• No exe, log, stdout, err in EOS in your submit file
• With all network filesystem, best to write at end of job, not constant I/O whilst the job is running
• AFS supported for as long as it’s available
• EOS FUSE will be supported when it is performant
15/08/2019 condor data transfer 47