cern batch service: htcondor16 more about condor_q • by default condor_qshows: • user’s job...

47

Upload: others

Post on 20-Aug-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CERN Batch Service: HTCondor16 More about condor_q • By default condor_qshows: • User’s job only • Jobs summarized in batches: same cluster or same executable or same batch
Page 2: CERN Batch Service: HTCondor16 More about condor_q • By default condor_qshows: • User’s job only • Jobs summarized in batches: same cluster or same executable or same batch

CERN Batch Service: HTCondor

8/15/2019 Document reference 2

Page 3: CERN Batch Service: HTCondor16 More about condor_q • By default condor_qshows: • User’s job only • Jobs summarized in batches: same cluster or same executable or same batch

Agenda

• Batch Service

• What is HTCondor?

• Job Submission

• Multiple Jobs & Requirements

• File Transfer

8/15/2019 Document reference 3

Page 4: CERN Batch Service: HTCondor16 More about condor_q • By default condor_qshows: • User’s job only • Jobs summarized in batches: same cluster or same executable or same batch

4

Batch Service

• IT-CM-IS Mandate:“Provide high-level compute services to the CERN Tier-0 and WLCG “

• HTCondor: our production batch service.

• Service used for both “grid” and “local” submission

• Local means open to all CERN users, kerberos, shared filesystem, managed submission nodes

• ~218k cores in HTCondor

• Over a million jobs a day

• Service Element: Batch Service

Page 5: CERN Batch Service: HTCondor16 More about condor_q • By default condor_qshows: • User’s job only • Jobs summarized in batches: same cluster or same executable or same batch

What is ?

5

Part of the content adapted from: “An introduction to using HTCondor”

by Christina Koch, HTCondor Week 2016 & HTCondor Week 2018

Page 6: CERN Batch Service: HTCondor16 More about condor_q • By default condor_qshows: • User’s job only • Jobs summarized in batches: same cluster or same executable or same batch

6

What is ?

• Open Source batch system developed at the CHTC at the University of Wisconsin

• “High Throughput Computing”

• Long history in HEP and elsewhere (including previously at CERN)

• Used extensively in OSG, and things like the CMS global pool (200K++ cores)

• System of symmetric matching of job requests to resources using ClassAds of job requirements and machine resources

Page 7: CERN Batch Service: HTCondor16 More about condor_q • By default condor_qshows: • User’s job only • Jobs summarized in batches: same cluster or same executable or same batch

7

HTCondor elements

ScheddSchedd

ScheddSchedd

Schedd

Collector

Negotiator

StartdStartd

StartdStartd

StartdStartd

StartdStartd

StartdStartd

StartdStartd

StartdStartd

Submit Side Broker Execute Side

Send jobs to

reserved slot

Send machine

properties

(ClassAds)

Pull list of jobs

Match jobs &

machines

Page 8: CERN Batch Service: HTCondor16 More about condor_q • By default condor_qshows: • User’s job only • Jobs summarized in batches: same cluster or same executable or same batch

Execute Side

• Slot: 1 CPU / 2GB RAM / 20GB Disk

• CPU / Memory will be scaled in requests to

reflect slot

• Ask for 2 CPUs, get 4 GB RAM

• Mostly CentOS7 at this point

• CentOS8 in the works, but likely CentOS7

platform for next run

• Docker & Singularity are available for

containers

8/15/2019 Document reference 8

Page 9: CERN Batch Service: HTCondor16 More about condor_q • By default condor_qshows: • User’s job only • Jobs summarized in batches: same cluster or same executable or same batch

9

Jobs

• A single computing task is called a “job”

• Three main pieces of a job are the input,

executable and output

Page 10: CERN Batch Service: HTCondor16 More about condor_q • By default condor_qshows: • User’s job only • Jobs summarized in batches: same cluster or same executable or same batch

10

Job Example

wi.dat

compare_

states

us.dat

wi.dat.out

$ compare_states wi.dat us.dat wi.dat.out

• The executable must be runnable from the

command line without any interactive input

Page 11: CERN Batch Service: HTCondor16 More about condor_q • By default condor_qshows: • User’s job only • Jobs summarized in batches: same cluster or same executable or same batch

11

Job Translation

• Submit file: communicates everything about your job(s) to HTCondor

• The main goal of this training is to show you how to properly represent your job in a submit file

executable = compare_states

arguments = wi.dat us.dat wi.dat.out

should_transfer_files = YES

transfer_input_files = us.dat, wi.dat

when_to_transfer_output = ON_EXIT

log = job.log

output = job.out

error = job.err

request_cpus = 1

request_disk = 20MB

request_memory = 20MB

queue 1

Page 12: CERN Batch Service: HTCondor16 More about condor_q • By default condor_qshows: • User’s job only • Jobs summarized in batches: same cluster or same executable or same batch

12

CERN HTCondor Service

HTCondor Pool – “CERN Condor Share”

Wo

rkers

Local Schedd

bigbirdXY.cern.ch

Central Manager

tweetybirdXY.cern.ch

Worker

GRID

Local Users

Grid Cert

Kerberos

Grid Cert

Authentication

Typically from

lxplus.cern.ch

Authentication

Worker Worker

Local Schedd

bigbirdXY.cern.ch

Local Schedd

bigbirdXY.cern.ch

CE Schedd

ce5XY.cern.ch

CE Schedd

ce5XY.cern.ch

CE Schedd

ce5XY.cern.ch

SLC6 / Mix CC7 SLC6 / Short

Worker Worker Worker …Worker Worker WorkerWorkerWorkerWorker

Different flavours, same config:

afs, cvmfs, eos, root,…

“It’s like lxplus”

Central Manager

tweetybirdXY.cern.ch

Page 13: CERN Batch Service: HTCondor16 More about condor_q • By default condor_qshows: • User’s job only • Jobs summarized in batches: same cluster or same executable or same batch

13

Ex. 1: Job Submission: sub filelxplus ~$ vi ex1.sub

universe = vanilla

executable = ex1.sh

arguments = "training 2018"

output = output/ex1.out

error = error/ex1.err

log = log/ex1.log

queue

universe: an HTCondor

execution environment.

Vanilla is the default and it

should cover 90% cases.

executable

arguments: arguments

are any options passed to

the executable from the command line.

output/error: captures

stdout & stderr

log: file created by HTCondor

to track job progress

queue: keyword indicating

“create a job”

Page 14: CERN Batch Service: HTCondor16 More about condor_q • By default condor_qshows: • User’s job only • Jobs summarized in batches: same cluster or same executable or same batch

14

Ex. 1: Job Submission: script

lxplus ~$ vi ex1.sh

#!/bin/sh

echo 'Date: ' $(date)

echo 'Host: ' $(hostname)

echo 'System: ' $(uname -spo)

echo 'Home: ' $HOME

echo 'Workdir: ' $PWD

echo 'Path: ' $PATH

echo "Program: $0"

echo "Args: $*"

The shebang (#!) is mandatory when submitting script files in HTCondor:

“#!/bin/sh” “#!/bin/bash” “#!/bin/env python”

Malformed or invalid shebang silently ignored and no error reported (yet)

lxplus ~$ chmod +x ex1.sh

Page 15: CERN Batch Service: HTCondor16 More about condor_q • By default condor_qshows: • User’s job only • Jobs summarized in batches: same cluster or same executable or same batch

15

Ex. 1: Job Submission

lxplus ~$ condor_submit ex1.sub

Submitting job(s).

1 job(s) submitted to cluster 162.

universe = vanilla

executable = ex1.sh

arguments = "training 2018"

output = output/ex1.out

error = error/ex1.err

log = log/ex1.log

queue

To submit a job/jobs

condor_submit <submit_file>

To monitor submitted jobs:

condor_q

lxplus ~$ condor_q

-- Schedd: bigbird99.cern.ch : <137.138.120.138:9618?... @ 11/19/18 20:50:42

OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS

fernandl CMD: ex1.sh 11/19 20:49 _ _ 1 1 162.0

1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended

Page 16: CERN Batch Service: HTCondor16 More about condor_q • By default condor_qshows: • User’s job only • Jobs summarized in batches: same cluster or same executable or same batch

16

More about condor_q

• By default condor_q shows:

• User’s job only

• Jobs summarized in batches: same

cluster or same executable or same batch

namelxplus ~$ condor_q

-- Schedd: bigbird99.cern.ch : <137.138.120.138:9618?... @ 11/19/18 20:50:42

OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS

fernandl CMD: /bin/hostname 11/19 20:49 _ _ 1 1 162.0

1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended

JobId = ClusterId .ProcId

Page 17: CERN Batch Service: HTCondor16 More about condor_q • By default condor_qshows: • User’s job only • Jobs summarized in batches: same cluster or same executable or same batch

17

More about condor_q

• To see individual job information, use:condor_q –nobatch

• We will use –nobatch option in the

following slides to see extra detail about what is happening with a job

lxplus ~$ condor_q -nobatch

-- Schedd: bigbird99.cern.ch : <137.138.120.138:9618?... @ 11/19/18 20:50:32

ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD

162.0 fernandl 11/19 20:49 0+00:00:00 I 0 0.0 hostname

1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended

Page 18: CERN Batch Service: HTCondor16 More about condor_q • By default condor_qshows: • User’s job only • Jobs summarized in batches: same cluster or same executable or same batch

18

Job States

condor_

submit

Idle

(I)

Running

(R)

Completed

(C)

transfer

executable

and input to

execute

node

transfer

output

back to

submit node

in the queue leaving the queue

Page 19: CERN Batch Service: HTCondor16 More about condor_q • By default condor_qshows: • User’s job only • Jobs summarized in batches: same cluster or same executable or same batch

19

Log File000 (168.000.000) 11/20 11:34:25 Job submitted from host:

<137.138.120.138:9618?addrs=137.138.120.138-9618&noUDP&sock=1069_d2d4_3>

...

001 (168.000.000) 11/20 11:37:26 Job executing on host:

<188.185.217.222:9618?addrs=188.185.217.222-9618+[--1]-9618&noUDP&sock=3285_211b_3>

...

006 (168.000.000) 11/20 11:37:30 Image size of job updated: 15

0 - MemoryUsage of job (MB)

0 - ResidentSetSize of job (KB)

...

005 (168.000.000) 11/20 11:37:30 Job terminated.

(1) Normal termination (return value 0)

Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage

Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage

Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage

Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage

19 - Run Bytes Sent By Job

15768 - Run Bytes Received By Job

19 - Total Bytes Sent By Job

15768 - Total Bytes Received By Job

Partitionable Resources : Usage Request Allocated

Cpus : 1 1

Disk (KB) : 31 15 1841176

Memory (MB) : 0 2000 2000

...

Page 20: CERN Batch Service: HTCondor16 More about condor_q • By default condor_qshows: • User’s job only • Jobs summarized in batches: same cluster or same executable or same batch

The Central Manager

20

Class Ad & Matchmaking

Page 21: CERN Batch Service: HTCondor16 More about condor_q • By default condor_qshows: • User’s job only • Jobs summarized in batches: same cluster or same executable or same batch

21

The Central Manager

• HTCondor matches jobs with computers via a “central manager”

submitexecute

execute

execute

central manager

Page 22: CERN Batch Service: HTCondor16 More about condor_q • By default condor_qshows: • User’s job only • Jobs summarized in batches: same cluster or same executable or same batch

22

Class Ads

• HTCondor stores a list of information about each job and each computer.

• This information is stored as a “Class Ad”

• Class Ads have the format:AttributeName = value

• value can be Boolean, number or string

Page 23: CERN Batch Service: HTCondor16 More about condor_q • By default condor_qshows: • User’s job only • Jobs summarized in batches: same cluster or same executable or same batch

23

Job Class Ads

RequestCpus = 1

Err = “job.err"

WhenToTransferOutput = "ON_EXIT"

TargetType = "Machine"

Cmd = “/afs/cern.ch/user/f/fernandl/condor/exe“

Arguments = “x y z”

JobUniverse = 5

Iwd = “/afs/cern.ch/user/f/fernandl/condor"

RequestDisk = 20480

NumJobStarts = 0

WantRemoteIO = true

OnExitRemove = true

MyType = "Job"

Out = "job.out"

UserLog =

“/afs/cern.ch/user/f/fernandl/condor/job.log"

RequestMemory = 20

...

...

+

HTCondor configuration*

executable = exe

Arguments = “x y z”

log = job.log

output = job.out

error = job.err

queue 1

=

Page 24: CERN Batch Service: HTCondor16 More about condor_q • By default condor_qshows: • User’s job only • Jobs summarized in batches: same cluster or same executable or same batch

24

Machine Class AdsHasFileTransfer = true

DynamicSlot = true

TotalSlotDisk = 4300218.0

TargetType = "Job"

TotalSlotMemory = 2048

Mips = 17902

Memory = 2048

UtsnameSysname = "Linux"

MAX_PREEMPT = ( 3600 * 72 )

Requirements = ( START ) && (

IsValidCheckpointPlatform ) && (

WithinResourceLimits )

OpSysMajorVer = 6

TotalMemory = 9889

HasGluster = true

OpSysName = "SL"

HasDocker = true

...

=

+

HTCondor configuration

Page 25: CERN Batch Service: HTCondor16 More about condor_q • By default condor_qshows: • User’s job only • Jobs summarized in batches: same cluster or same executable or same batch

25

Job Matching

• On a regular basis, the central managerreviews Job and Machine Class Ads andmatches jobs to computers

submitexecute

execute

execute

central manager

Page 26: CERN Batch Service: HTCondor16 More about condor_q • By default condor_qshows: • User’s job only • Jobs summarized in batches: same cluster or same executable or same batch

26

Job Execution

• After the central manager makes the match, thesubmit and execute points communicatedirectly

submitexecute

execute

execute

central manager

Page 27: CERN Batch Service: HTCondor16 More about condor_q • By default condor_qshows: • User’s job only • Jobs summarized in batches: same cluster or same executable or same batch

27

Class Ads for People

• Class Ads also provides lots of usefulinformation about jobs and computers toHTCondor users and administrators

Page 28: CERN Batch Service: HTCondor16 More about condor_q • By default condor_qshows: • User’s job only • Jobs summarized in batches: same cluster or same executable or same batch

28

Finding Job Attributes

• Use the “long” option for condor_qcondor_q –l <JobId>

$ condor_q -l 128.0

Arguments = ""

Cmd = "/bin/hostname"

Err = "error/hostname.err"

Iwd = "/afs/cern.ch/user/f/fernandl/temp/htcondor-training/module_single_jobs"

JobUniverse = 5

OnExitRemove = true

Out = "output/hostname.out"

RequestMemory = 2000

Requirements = ( TARGET.Hostgroup =?= "bi/condor/gridworker/share/mixed" ||

TARGET.Hostgroup =?= "bi/condor/gridworker/shareshort" || TARGET.Hostgroup =?=

"bi/condor/gridworker/share/singularity" || TARGET.Hostgroup =?=

"bi/condor/gridworker/sharelong" ) && VanillaRequirements

TargetType = "Machine"

UserLog = "/afs/cern.ch/user/f/fernandl/temp/htcondor-

training/module_single_jobs/log/hostname.log"

WantRemoteIO = true

WhenToTransferOutput = "ON_EXIT_OR_EVICT"

...

Page 29: CERN Batch Service: HTCondor16 More about condor_q • By default condor_qshows: • User’s job only • Jobs summarized in batches: same cluster or same executable or same batch

29

Resource Request

• Jobs use a part of the computer, not the

whole thing.

• Important to size job requirements

appropriately: memory, cpus and disk.

• CERN HTCondor defaults:

• 1 CPU

• 2 Gb ram

• 20 Gb disk

whole

computer

your request

Page 30: CERN Batch Service: HTCondor16 More about condor_q • By default condor_qshows: • User’s job only • Jobs summarized in batches: same cluster or same executable or same batch

Resource Request (II)

• Even if the system sets a default CPU, memory

and disk requests, they may be too small.

• Important to run the job and get the information

from the log to request the right amount of

resources:

• Requesting too little: causes problems for

your and other jobs, jobs might be held by

HTCondor or killed by the system.

• Requesting too much: jobs will match to

fewer slots and will waste resources.

30

Page 31: CERN Batch Service: HTCondor16 More about condor_q • By default condor_qshows: • User’s job only • Jobs summarized in batches: same cluster or same executable or same batch

Time to start running

• As we have seen, jobs don’t start to run immediately after the submission.

• Many factors involved:• Negotiation Cycle: the central managers don’t

perform matchmaking continuously. It is an expensive operation (~ 5 min).

• User priority: users priority is dynamic and recalculated according to usage.

• Availability of resources: many worker flavours. Machines matching your job requirements might be busy.

• More info: BatchDocs (Fairshare) & Manual (User Priorities)

31

Page 32: CERN Batch Service: HTCondor16 More about condor_q • By default condor_qshows: • User’s job only • Jobs summarized in batches: same cluster or same executable or same batch

32

Ex. 4: Multiple Jobs (queue)lxplus ~$ vi ex4.sub

universe = vanilla

executable = ex4.sh

arguments = $(ClusterId) $(JobId)

output = output/$(ClusterId).$(ProcId).out

error = error/$(ClusterId).$(ProcId).err

log = log/$(ClusterId).log

queue 5

queue: it controls how

many instances of the job

are submitted (default 1).

It supports dynamic input.

Pre-defined macros: we

can use the $ClusterId and

$ProcId vars in order to

provide unique values to the

jobs files.

Page 33: CERN Batch Service: HTCondor16 More about condor_q • By default condor_qshows: • User’s job only • Jobs summarized in batches: same cluster or same executable or same batch

33

Ex. 5: Multiple Jobs (queue)lxplus ~$ vi ex5.sub

universe = vanilla

executable = $(filename)

output = output/$(ClusterId).$(ProcId).out

error = error/$(ClusterId).$(ProcId).err

log = log/$(ClusterId).log

queue filename matching files ex5/*.sh

The resulting jobs point to different executables, but they will belong to the same ClusterId with different ProcIds.

The usage of regular expressions in queue allows us to

submit more than one different jobs.

Page 34: CERN Batch Service: HTCondor16 More about condor_q • By default condor_qshows: • User’s job only • Jobs summarized in batches: same cluster or same executable or same batch

multiple

queue

statements

Not recommended. Can be useful when submitting job batches

where a single (non-file/argument) characteristic is changing

matching ..

pattern

Natural nested looping, minimal programming, use optional

“files” and “dirs” keywords to only match files or directories

Requires good naming conventions,

in .. list Supports multiple variables, all information contained in a single

file, reproducible

Harder to automate submit file creation

from .. file Supports multiple variables, highly modular (easy to use one

submit file for many job batches), reproducible

Additional file needed

Queue Statement Comparison

Page 35: CERN Batch Service: HTCondor16 More about condor_q • By default condor_qshows: • User’s job only • Jobs summarized in batches: same cluster or same executable or same batch

CERNism: JobFlavour

• Set of pre-defined run times to bucket jobs

easily (Default: espresso)

35

universe = vanilla

executable = training.sh

output = output/$(ClusterId).$(ProcId).out

error = error/$(ClusterId).$(ProcId).err

Log = log/$(ClusterId).log

+JobFlavour = "microcentury"

queue

espresso = 20 min

microcentury = 1 hour

longlunch = 2 hours

workday = 8 hours

tomorrow = 1 day

testmatch = 3 days

nextweek = 1 week

Page 36: CERN Batch Service: HTCondor16 More about condor_q • By default condor_qshows: • User’s job only • Jobs summarized in batches: same cluster or same executable or same batch

Exceeding MaxRuntime

• What will happen if we set MaxRuntime less

than the job needs in order to be executed

36

universe = vanilla

executable = training.sh

output = output/$(ClusterId).$(ProcId).out

error = error/$(ClusterId).$(ProcId).err

log = log/$(ClusterId).log

+MaxRuntime = 120

queue

The job will be removed

by the system.

[fprotops@lxplus088 training]$ condor_q -af MaxRuntime120

[fprotops@lxplus088 training]$ condor_history -l <job id> |grep -i removeRemoveReason = "Job removed by SYSTEM_PERIODIC_REMOVE due to wall time exceeded allowed max."

Page 37: CERN Batch Service: HTCondor16 More about condor_q • By default condor_qshows: • User’s job only • Jobs summarized in batches: same cluster or same executable or same batch

Debug (I): condor_status

• It displays the status of machines in the pool:$ condor_status –avail

$ condor_status –schedd

$ condor_status <hostname>

$ condor_status –l <hostname>

• It supports filtering based on ClassAd:

37

[fprotops@lxplus071 ssh]$ condor_status -const ‘OpSysAndVer=?="CentOS7"’[email protected] LINUX X86_64 Claimed Busy [email protected] LINUX X86_64 Claimed Busy [email protected] LINUX X86_64 Claimed Busy [email protected] LINUX X86_64 Claimed Busy [email protected] LINUX X86_64 Claimed Busy 11.510

Page 38: CERN Batch Service: HTCondor16 More about condor_q • By default condor_qshows: • User’s job only • Jobs summarized in batches: same cluster or same executable or same batch

Debug (II): condor_ssh_to_job

• Creates an ssh session to a running job:$ condor_ssh_to_job <job id>

$ condor_ssh_to_job -auto-retry <job id>

• This will get us access to the contents of our sandbox in the worker node: output, temp files, credentials…

38

[fprotops@lxplus071 ssh]$ condor_ssh_to_job -auto-retry <job id>[email protected]: Rejecting request, because the job execution environment is not yet ready.Waiting for job to start...Welcome to [email protected]!Your condor job is running with pid(s) 18694.[fprotops@b626c4b230 dir_18443]$ lscondor_exec.exe _condor_stderr _condor_stdout fprotops.cc test.txt tmp var

Page 39: CERN Batch Service: HTCondor16 More about condor_q • By default condor_qshows: • User’s job only • Jobs summarized in batches: same cluster or same executable or same batch

Debug (III): condor_tail

• It displays the tail of the job files:$ condor_tail -follow <job id>

• The output can be controlled via flags:$ condor_tail -follow -no-stdout -stderr <job id>

39

[fprotops@lxplus052 training]$ condor_tail -follow <job id>Welcome to the HTCondor training!

Page 40: CERN Batch Service: HTCondor16 More about condor_q • By default condor_qshows: • User’s job only • Jobs summarized in batches: same cluster or same executable or same batch

Debug (IV): Hold & Removed• Apart from Idle, Running and Completed, HTCondor defines

two more states:

• Hold and Removed

• Jobs can get into Hold or Removed status either by the user or by the system.

• Related commands:$ condor_q –hold$ condor_q –af HoldReason <job_id>$ condor_hold$ condor_release$ condor_rm$ condor_history –limit 1 <job_id> -afRemoveReason

40

[fprotops@lxplus052 training]$ condor_q -af HoldReason 155.0via condor_hold (by user fprotops)

Page 41: CERN Batch Service: HTCondor16 More about condor_q • By default condor_qshows: • User’s job only • Jobs summarized in batches: same cluster or same executable or same batch

File Transfer

• A job will need input and output data.

• There are several ways to get data in or out

of the batch system, so we need to know a

little about the trade offs.

• Do you want to use a shared filesystem? Do

you want to have condor transfer data for

you? Should you input or output in the job

payload itself?

15/08/2019 condor data transfer 41

Page 42: CERN Batch Service: HTCondor16 More about condor_q • By default condor_qshows: • User’s job only • Jobs summarized in batches: same cluster or same executable or same batch

Infrastructure

15/08/2019 condor data transfer 42

Page 43: CERN Batch Service: HTCondor16 More about condor_q • By default condor_qshows: • User’s job only • Jobs summarized in batches: same cluster or same executable or same batch

Adding Input files

• In order to add input files, we just need to add “transfer_input_files” to our submit file

• It’s a list of files to take from the working directory to send to the job sandbox

• This example produce one output file “merge.out”

15/08/2019 condor data transfer 43

executable = merge.sh

arguments = a.txt b.txt merge.out

transfer_input_files = a.txt, b.txt

log = job.log

output = job.out

error = job.err

+JobFlavour = “longlunch”

queue 1

Page 44: CERN Batch Service: HTCondor16 More about condor_q • By default condor_qshows: • User’s job only • Jobs summarized in batches: same cluster or same executable or same batch

Transferring output back

• By default condor

will transfer everything

in your sandbox

• To only transfer back the

file you need, use

transfer_output_files

• Adding to

transfer_output_files

adds file to list that

“condor_tail” can see

15/08/2019 condor data transfer 44

executable = merge.sh

arguments = a.txt b.txt merge.out

transfer_input_files = a.txt, b.txt

transfer_output_files = merge.out

log = job.log

output = job.out

error = job.err

+JobFlavour = “longlunch”

queue 1

Page 45: CERN Batch Service: HTCondor16 More about condor_q • By default condor_qshows: • User’s job only • Jobs summarized in batches: same cluster or same executable or same batch

Important considerations

• Even when using a shared filesystem, files are transferred to a scratch space on the workers, the “sandbox”.• Remember the impact on the filesystem! The most

efficient use of network filesystems is typically to write once, at the end of a job

• You have 20GB per CPU of sandbox

• There are limits to the amount of data that we allow to be transferred using condor file transfer• The limit is 1GB currently per job

• The job itself can do file transfer, both input and output

15/08/2019 condor data transfer 45

Page 46: CERN Batch Service: HTCondor16 More about condor_q • By default condor_qshows: • User’s job only • Jobs summarized in batches: same cluster or same executable or same batch

condor_submit -spool

• You may not want condor to create files in your shared filesystem• Particularly if you are submitting 10s of 1000s of jobs

• condor_submit –spool transfers files to the Schedd

• Important notes:• This makes the system async – to get any files back

you need to run condor_transfer_data

• The spool on the Schedd is limited!

• Best practice for this mode: spool, but write data out to end location within job, use spool only for stdout/err

15/08/2019 condor data transfer 46

Page 47: CERN Batch Service: HTCondor16 More about condor_q • By default condor_qshows: • User’s job only • Jobs summarized in batches: same cluster or same executable or same batch

Note on AFS & EOS

• Shared filesystem is used a lot for batch jobs

• Current best practices:• AFS, EOS FUSE, EOS via xrdcp are all available on

the worker node

• Between the submit node and the Schedd, only AFS is currently supported

• No exe, log, stdout, err in EOS in your submit file

• With all network filesystem, best to write at end of job, not constant I/O whilst the job is running

• AFS supported for as long as it’s available

• EOS FUSE will be supported when it is performant

15/08/2019 condor data transfer 47