cern batch service: htcondor16 more about condor_q • by default condor_qshows: • user’s job...

CERN Batch Service: HTCondor

8/15/2019 Document reference 2

Agenda

• Batch Service

• What is HTCondor?

• Job Submission

• Multiple Jobs & Requirements

• File Transfer


4

Batch Service

• IT-CM-IS Mandate:“Provide high-level compute services to the CERN Tier-0 and WLCG “

• HTCondor: our production batch service.

• Service used for both “grid” and “local” submission

• Local means open to all CERN users, kerberos, shared filesystem, managed submission nodes

• ~218k cores in HTCondor

• Over a million jobs a day

• Service Element: Batch Service

https://cern.service-now.com/service-portal/service-element.do?name=batch&from=byName

What is ?

5

Part of the content adapted from: “An introduction to using HTCondor”

by Christina Koch, HTCondor Week 2016 & HTCondor Week 2018

https://research.cs.wisc.edu/htcondor/HTCondorWeek2016/presentations/Koch_UserTutorial.pdf

https://agenda.hep.wisc.edu/event/1201/session/4/contribution/5/material/slides/1.pdf

6

What is ?

• Open Source batch system developed at the CHTC at the University of Wisconsin

• “High Throughput Computing”

• Long history in HEP and elsewhere (including previously at CERN)

• Used extensively in OSG, and things like the CMS global pool (200K++ cores)

• System of symmetric matching of job requests to resources using ClassAds of job requirements and machine resources

7

HTCondor elements

ScheddSchedd

ScheddSchedd

Schedd

Collector

Negotiator

StartdStartd

StartdStartd

StartdStartd

StartdStartd

StartdStartd

StartdStartd

StartdStartd

Submit Side Broker Execute Side

Send jobs to

reserved slot

Send machine

properties

(ClassAds)

Pull list of jobs

Match jobs &

machines

Execute Side

• Slot: 1 CPU / 2GB RAM / 20GB Disk

• CPU / Memory will be scaled in requests to

reflect slot

• Ask for 2 CPUs, get 4 GB RAM

• Mostly CentOS7 at this point

• CentOS8 in the works, but likely CentOS7

platform for next run

• Docker & Singularity are available for

containers


9

Jobs

• A single computing task is called a “job”

• Three main pieces of a job are the input,

executable and output

10

Job Example

wi.dat

compare_

states

us.dat

wi.dat.out

$ compare_states wi.dat us.dat wi.dat.out

• The executable must be runnable from the

command line without any interactive input

11

Job Translation

• Submit file: communicates everything about your job(s) to HTCondor

• The main goal of this training is to show you how to properly represent your job in a submit file

executable = compare_states

arguments = wi.dat us.dat wi.dat.out

should_transfer_files = YES

transfer_input_files = us.dat, wi.dat

when_to_transfer_output = ON_EXIT

log = job.log

output = job.out

error = job.err

request_cpus = 1

request_disk = 20MB

request_memory = 20MB

queue 1

12

CERN HTCondor Service

HTCondor Pool – “CERN Condor Share”

Wo

rkers

Local Schedd

bigbirdXY.cern.ch

Central Manager

tweetybirdXY.cern.ch

Worker

GRID

Local Users

Grid Cert

Kerberos

Grid Cert

Authentication

Typically from

lxplus.cern.ch

Authentication

Worker Worker

Local Schedd

bigbirdXY.cern.ch

Local Schedd

bigbirdXY.cern.ch

CE Schedd

ce5XY.cern.ch

CE Schedd

ce5XY.cern.ch

CE Schedd

ce5XY.cern.ch

SLC6 / Mix CC7 SLC6 / Short

…

…

Worker Worker Worker …Worker Worker WorkerWorkerWorkerWorker

Different flavours, same config:

afs, cvmfs, eos, root,…

“It’s like lxplus”

Central Manager

tweetybirdXY.cern.ch

13

Ex. 1: Job Submission: sub filelxplus ~$ vi ex1.sub

universe = vanilla

executable = ex1.sh

arguments = "training 2018"

output = output/ex1.out

error = error/ex1.err

log = log/ex1.log

queue

universe: an HTCondor

execution environment.

Vanilla is the default and it

should cover 90% cases.

executable

arguments: arguments

are any options passed to

the executable from the command line.

output/error: captures

stdout & stderr

log: file created by HTCondor

to track job progress

queue: keyword indicating

“create a job”

14

Ex. 1: Job Submission: script

lxplus ~$ vi ex1.sh

#!/bin/sh

echo 'Date: ' $(date)

echo 'Host: ' $(hostname)

echo 'System: ' $(uname -spo)

echo 'Home: ' $HOME

echo 'Workdir: ' $PWD

echo 'Path: ' $PATH

echo "Program: $0"

echo "Args: $*"

The shebang (#!) is mandatory when submitting script files in HTCondor:

“#!/bin/sh” “#!/bin/bash” “#!/bin/env python”

Malformed or invalid shebang silently ignored and no error reported (yet)

lxplus ~$ chmod +x ex1.sh

https://en.wikipedia.org/wiki/Shebang_(Unix)

15

Ex. 1: Job Submission

lxplus ~$ condor_submit ex1.sub

Submitting job(s).

1 job(s) submitted to cluster 162.

universe = vanilla

executable = ex1.sh

arguments = "training 2018"

output = output/ex1.out

error = error/ex1.err

log = log/ex1.log

queue

To submit a job/jobs

condor_submit <submit_file>

To monitor submitted jobs:

condor_q

lxplus ~$ condor_q

-- Schedd: bigbird99.cern.ch : <137.138.120.138:9618?... @ 11/19/18 20:50:42

OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS

fernandl CMD: ex1.sh 11/19 20:49 _ _ 1 1 162.0

1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended

16

More about condor_q

• By default condor_q shows:

• User’s job only

• Jobs summarized in batches: same

cluster or same executable or same batch

namelxplus ~$ condor_q


OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS

fernandl CMD: /bin/hostname 11/19 20:49 _ _ 1 1 162.0


JobId = ClusterId .ProcId

17

More about condor_q

• To see individual job information, use:condor_q –nobatch

• We will use –nobatch option in the

following slides to see extra detail about what is happening with a job

lxplus ~$ condor_q -nobatch


ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD

162.0 fernandl 11/19 20:49 0+00:00:00 I 0 0.0 hostname


18

Job States

condor_

submit

Idle

(I)

Running

(R)

Completed

(C)

transfer

executable

and input to

execute

node

transfer

output

back to

submit node

in the queue leaving the queue

19

Log File000 (168.000.000) 11/20 11:34:25 Job submitted from host:

<137.138.120.138:9618?addrs=137.138.120.138-9618&noUDP&sock=1069_d2d4_3>

...

001 (168.000.000) 11/20 11:37:26 Job executing on host:

<188.185.217.222:9618?addrs=188.185.217.222-9618+[--1]-9618&noUDP&sock=3285_211b_3>

...

006 (168.000.000) 11/20 11:37:30 Image size of job updated: 15

0 - MemoryUsage of job (MB)

0 - ResidentSetSize of job (KB)

...

005 (168.000.000) 11/20 11:37:30 Job terminated.

(1) Normal termination (return value 0)

Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage

Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage

Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage

Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage

19 - Run Bytes Sent By Job

15768 - Run Bytes Received By Job

19 - Total Bytes Sent By Job

15768 - Total Bytes Received By Job

Partitionable Resources : Usage Request Allocated

Cpus : 1 1

Disk (KB) : 31 15 1841176

Memory (MB) : 0 2000 2000

...

The Central Manager

20

Class Ad & Matchmaking

21

The Central Manager

• HTCondor matches jobs with computers via a “central manager”

submitexecute

execute

execute

central manager

22

Class Ads

• HTCondor stores a list of information about each job and each computer.

• This information is stored as a “Class Ad”

• Class Ads have the format:AttributeName = value

• value can be Boolean, number or string

23

Job Class Ads

RequestCpus = 1

Err = “job.err"

WhenToTransferOutput = "ON_EXIT"

TargetType = "Machine"

Cmd = “/afs/cern.ch/user/f/fernandl/condor/exe“

Arguments = “x y z”

JobUniverse = 5

Iwd = “/afs/cern.ch/user/f/fernandl/condor"

RequestDisk = 20480

NumJobStarts = 0

WantRemoteIO = true

OnExitRemove = true

MyType = "Job"

Out = "job.out"

UserLog =

“/afs/cern.ch/user/f/fernandl/condor/job.log"

RequestMemory = 20

...

...

+

HTCondor configuration*

executable = exe

Arguments = “x y z”

log = job.log

output = job.out

error = job.err

queue 1

=

24

Machine Class AdsHasFileTransfer = true

DynamicSlot = true

TotalSlotDisk = 4300218.0

TargetType = "Job"

TotalSlotMemory = 2048

Mips = 17902

Memory = 2048

UtsnameSysname = "Linux"

MAX_PREEMPT = ( 3600 * 72 )

Requirements = ( START ) && (

IsValidCheckpointPlatform ) && (

WithinResourceLimits )

OpSysMajorVer = 6

TotalMemory = 9889

HasGluster = true

OpSysName = "SL"

HasDocker = true

...

=

+

HTCondor configuration

25

Job Matching

• On a regular basis, the central managerreviews Job and Machine Class Ads andmatches jobs to computers

submitexecute

execute

execute

central manager

26

Job Execution

• After the central manager makes the match, thesubmit and execute points communicatedirectly

submitexecute

execute

execute

central manager

27

Class Ads for People

• Class Ads also provides lots of usefulinformation about jobs and computers toHTCondor users and administrators

28

Finding Job Attributes

• Use the “long” option for condor_qcondor_q –l <JobId>

$ condor_q -l 128.0

Arguments = ""

Cmd = "/bin/hostname"

Err = "error/hostname.err"

Iwd = "/afs/cern.ch/user/f/fernandl/temp/htcondor-training/module_single_jobs"

JobUniverse = 5

OnExitRemove = true

Out = "output/hostname.out"

RequestMemory = 2000

Requirements = ( TARGET.Hostgroup =?= "bi/condor/gridworker/share/mixed" ||

TARGET.Hostgroup =?= "bi/condor/gridworker/shareshort" || TARGET.Hostgroup =?=

"bi/condor/gridworker/share/singularity" || TARGET.Hostgroup =?=

"bi/condor/gridworker/sharelong" ) && VanillaRequirements

TargetType = "Machine"

UserLog = "/afs/cern.ch/user/f/fernandl/temp/htcondor-

training/module_single_jobs/log/hostname.log"

WantRemoteIO = true

WhenToTransferOutput = "ON_EXIT_OR_EVICT"

...

29

Resource Request

• Jobs use a part of the computer, not the

whole thing.

• Important to size job requirements

appropriately: memory, cpus and disk.

• CERN HTCondor defaults:

• 1 CPU

• 2 Gb ram

• 20 Gb disk

whole

computer

your request

Resource Request (II)

• Even if the system sets a default CPU, memory

and disk requests, they may be too small.

• Important to run the job and get the information

from the log to request the right amount of

resources:

• Requesting too little: causes problems for

your and other jobs, jobs might be held by

HTCondor or killed by the system.

• Requesting too much: jobs will match to

fewer slots and will waste resources.

30

Time to start running

• As we have seen, jobs don’t start to run immediately after the submission.

• Many factors involved:• Negotiation Cycle: the central managers don’t

perform matchmaking continuously. It is an expensive operation (~ 5 min).

• User priority: users priority is dynamic and recalculated according to usage.

• Availability of resources: many worker flavours. Machines matching your job requirements might be busy.

• More info: BatchDocs (Fairshare) & Manual (User Priorities)

31

http://batchdocs.web.cern.ch/batchdocs/fairshare/fairshare.html

http://research.cs.wisc.edu/htcondor/manual/v8.7/PrioritiesandPreemption.html#x19-640002.7.2

32

Ex. 4: Multiple Jobs (queue)lxplus ~$ vi ex4.sub

universe = vanilla

executable = ex4.sh

arguments = $(ClusterId) $(JobId)

output = output/$(ClusterId).$(ProcId).out

error = error/$(ClusterId).$(ProcId).err

log = log/$(ClusterId).log

queue 5

queue: it controls how

many instances of the job

are submitted (default 1).

It supports dynamic input.

Pre-defined macros: we

can use the $ClusterId and

$ProcId vars in order to

provide unique values to the

jobs files.

33

Ex. 5: Multiple Jobs (queue)lxplus ~$ vi ex5.sub

universe = vanilla

executable = $(filename)




queue filename matching files ex5/*.sh

The resulting jobs point to different executables, but they will belong to the same ClusterId with different ProcIds.

The usage of regular expressions in queue allows us to

submit more than one different jobs.

multiple

queue

statements

Not recommended. Can be useful when submitting job batches

where a single (non-file/argument) characteristic is changing

matching ..

pattern

Natural nested looping, minimal programming, use optional

“files” and “dirs” keywords to only match files or directories

Requires good naming conventions,

in .. list Supports multiple variables, all information contained in a single

file, reproducible

Harder to automate submit file creation

from .. file Supports multiple variables, highly modular (easy to use one

submit file for many job batches), reproducible

Additional file needed

Queue Statement Comparison

CERNism: JobFlavour

• Set of pre-defined run times to bucket jobs

easily (Default: espresso)

35

universe = vanilla

executable = training.sh



Log = log/$(ClusterId).log

+JobFlavour = "microcentury"

queue

espresso = 20 min

microcentury = 1 hour

longlunch = 2 hours

workday = 8 hours

tomorrow = 1 day

testmatch = 3 days

nextweek = 1 week

Exceeding MaxRuntime

• What will happen if we set MaxRuntime less

than the job needs in order to be executed

36

universe = vanilla

executable = training.sh




+MaxRuntime = 120

queue

The job will be removed

by the system.

[fprotops@lxplus088 training]$ condor_q -af MaxRuntime120

[fprotops@lxplus088 training]$ condor_history -l <job id> |grep -i removeRemoveReason = "Job removed by SYSTEM_PERIODIC_REMOVE due to wall time exceeded allowed max."

Debug (I): condor_status

• It displays the status of machines in the pool:$ condor_status –avail

$ condor_status –schedd

$ condor_status <hostname>

$ condor_status –l <hostname>

• It supports filtering based on ClassAd:

37

[fprotops@lxplus071 ssh]$ condor_status -const ‘OpSysAndVer=?="CentOS7"’[email protected] LINUX X86_64 Claimed Busy [email protected] LINUX X86_64 Claimed Busy [email protected] LINUX X86_64 Claimed Busy [email protected] LINUX X86_64 Claimed Busy [email protected] LINUX X86_64 Claimed Busy 11.510

Debug (II): condor_ssh_to_job

• Creates an ssh session to a running job:$ condor_ssh_to_job <job id>

$ condor_ssh_to_job -auto-retry <job id>

• This will get us access to the contents of our sandbox in the worker node: output, temp files, credentials…

38

[fprotops@lxplus071 ssh]$ condor_ssh_to_job -auto-retry <job id>[email protected]: Rejecting request, because the job execution environment is not yet ready.Waiting for job to start...Welcome to [email protected]!Your condor job is running with pid(s) 18694.[fprotops@b626c4b230 dir_18443]$ lscondor_exec.exe _condor_stderr _condor_stdout fprotops.cc test.txt tmp var

Debug (III): condor_tail

• It displays the tail of the job files:$ condor_tail -follow <job id>

• The output can be controlled via flags:$ condor_tail -follow -no-stdout -stderr <job id>

39

[fprotops@lxplus052 training]$ condor_tail -follow <job id>Welcome to the HTCondor training!

Debug (IV): Hold & Removed• Apart from Idle, Running and Completed, HTCondor defines

two more states:

• Hold and Removed

• Jobs can get into Hold or Removed status either by the user or by the system.

• Related commands:$ condor_q –hold$ condor_q –af HoldReason <job_id>$ condor_hold$ condor_release$ condor_rm$ condor_history –limit 1 <job_id> -afRemoveReason

40

[fprotops@lxplus052 training]$ condor_q -af HoldReason 155.0via condor_hold (by user fprotops)

File Transfer

• A job will need input and output data.

• There are several ways to get data in or out

of the batch system, so we need to know a

little about the trade offs.

• Do you want to use a shared filesystem? Do

you want to have condor transfer data for

you? Should you input or output in the job

payload itself?

15/08/2019 condor data transfer 41

Infrastructure


Adding Input files

• In order to add input files, we just need to add “transfer_input_files” to our submit file

• It’s a list of files to take from the working directory to send to the job sandbox

• This example produce one output file “merge.out”


executable = merge.sh

arguments = a.txt b.txt merge.out

transfer_input_files = a.txt, b.txt

log = job.log

output = job.out

error = job.err

+JobFlavour = “longlunch”

queue 1

Transferring output back

• By default condor

will transfer everything

in your sandbox

• To only transfer back the

file you need, use

transfer_output_files

• Adding to

transfer_output_files

adds file to list that

“condor_tail” can see


executable = merge.sh

arguments = a.txt b.txt merge.out

transfer_input_files = a.txt, b.txt

transfer_output_files = merge.out

log = job.log

output = job.out

error = job.err

+JobFlavour = “longlunch”

queue 1

Important considerations

• Even when using a shared filesystem, files are transferred to a scratch space on the workers, the “sandbox”.• Remember the impact on the filesystem! The most

efficient use of network filesystems is typically to write once, at the end of a job

• You have 20GB per CPU of sandbox

• There are limits to the amount of data that we allow to be transferred using condor file transfer• The limit is 1GB currently per job

• The job itself can do file transfer, both input and output


condor_submit -spool

• You may not want condor to create files in your shared filesystem• Particularly if you are submitting 10s of 1000s of jobs

• condor_submit –spool transfers files to the Schedd

• Important notes:• This makes the system async – to get any files back

you need to run condor_transfer_data

• The spool on the Schedd is limited!

• Best practice for this mode: spool, but write data out to end location within job, use spool only for stdout/err


Note on AFS & EOS

• Shared filesystem is used a lot for batch jobs

• Current best practices:• AFS, EOS FUSE, EOS via xrdcp are all available on

the worker node

• Between the submit node and the Schedd, only AFS is currently supported

• No exe, log, stdout, err in EOS in your submit file

• With all network filesystem, best to write at end of job, not constant I/O whilst the job is running

• AFS supported for as long as it’s available

• EOS FUSE will be supported when it is performant


cern batch service: htcondor16 more about condor_q • by default condor_qshows: • user’s job...

Documents