using condor an introduction condor week 2008

149
1 http://www.cs.wisc.edu/condor Using Condor An Introduction Condor Week 2008

Upload: wyman

Post on 09-Feb-2016

81 views

Category:

Documents


0 download

DESCRIPTION

Using Condor An Introduction Condor Week 2008. The Condor Project (Established ‘85). Distributed High Throughput Computing research performed by a team of ~35 faculty, full time staff and students. The Condor Project (Established ‘85). - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Using Condor  An Introduction Condor Week 2008

1http://www.cs.wisc.edu/condor

Using Condor An Introduction

Condor Week 2008

Page 2: Using Condor  An Introduction Condor Week 2008

2http://www.cs.wisc.edu/condor

The Condor Project (Established ‘85)

Distributed High Throughput Computing research performed by a team of ~35 faculty, full time staff and students.

Page 3: Using Condor  An Introduction Condor Week 2008

3http://www.cs.wisc.edu/condor

The Condor Project (Established ‘85)Distributed High Throughput Computing research performed by a team of ~35 faculty, full time staff and students who:

face software engineering challenges in a distributed UNIX/Linux/NT environment

are involved in national and international grid collaborations,

actively interact with academic and commercial users,

maintain and support large distributed production environments,

and educate and train students.

Page 4: Using Condor  An Introduction Condor Week 2008

4http://www.cs.wisc.edu/condor

Some free software produced by the Condor

Project› Condor System› VDT› Metronome› ClassAd Library› DAGMan› Hawkeye› GCB

› MW› NeST / LotMan› Stork› And others… all as

open source

Page 5: Using Condor  An Introduction Condor Week 2008

5http://www.cs.wisc.edu/condor

Full featured system› Flexible scheduling policy engine via ClassAds

Preemption, suspension, requirements, preferences, groups, quotas, settable fair-share, system hold…

› Facilities to manage BOTH dedicated CPUs (clusters) and non-dedicated resources (desktops)

› Transparent Checkpoint/Migration for many types of serial jobs

› No shared file-system required› Federate clusters w/ a wide array of Grid Middleware› Workflow management (inter-dependencies)› Support for many job types – serial, parallel, etc.› Fault-tolerant: can survive crashes, network outages, no

single point of failure.› Development APIs: via SOAP / web services, DRMAA (C), Perl

package, GAHP, flexible command-line tools, MW› Platforms: Linux i386/IA64, Windows 2k/XP, MacOS, Solaris,

IRIX, HP-UX, Compaq Tru64, … lots.

Page 6: Using Condor  An Introduction Condor Week 2008

6http://www.cs.wisc.edu/condor

Meet Frieda.

She is a scientist

with a big problem.

Page 7: Using Condor  An Introduction Condor Week 2008

7http://www.cs.wisc.edu/condor

Frieda’s Application …Run a Parameter Sweep of F(x,y,z) for 20 values of x, 10 values of y and 3 values of z 20×10×3 = 600 combinations F takes on the average 6 hours to compute on a

“typical” workstation (total = 600 × 6 = 3600 hours) F requires a “moderate” (256MB) amount of

memory F performs “moderate” I/O - (x,y,z) is 5 MB and

F(x,y,z) is 50 MB

Page 8: Using Condor  An Introduction Condor Week 2008

8http://www.cs.wisc.edu/condor

I have 600simulations to run.

Where can I get help?

Page 9: Using Condor  An Introduction Condor Week 2008

9http://www.cs.wisc.edu/condor

While sharing a beverage with some colleagues, she shares her problem. Somebody asks “Have you tried Condor? It’s free.”

Page 10: Using Condor  An Introduction Condor Week 2008

10http://www.cs.wisc.edu/condor

Getting Condor› Available as a free download from

http://www.cs.wisc.edu/condor› Download Condor for your

operating system Available for most UNIX (including

Linux and Apple’s OS/X) platforms Also for Windows XP / 2003 / Vista

Page 11: Using Condor  An Introduction Condor Week 2008

11http://www.cs.wisc.edu/condor

Condor Releases› Stable / Developer Releases

Version numbering scheme similar to that of the (pre 2.6) Linux kernels …

Major.minor.release• Minor is even (a.b.c): Stable

– Examples: 6.6.3, 6.8.4, 6.8.5– Very stable, mostly bug fixes

• Minor is odd (a.b.c): Developer– New features, may have some bugs– Examples: 6.7.11, 6.9.1, 6.9.2

Page 12: Using Condor  An Introduction Condor Week 2008

12http://www.cs.wisc.edu/condor

Frieda Installs a “Personal Condor” on

her machine…› What do we mean by a “Personal” Condor? Condor on your own workstation No root / administrator access required No system administrator intervention

needed› After installation, Frieda submits her

jobs to her Personal Condor…

Page 13: Using Condor  An Introduction Condor Week 2008

13http://www.cs.wisc.edu/condor

yourworkstation

personalCondor

600 Condorjobs

Page 14: Using Condor  An Introduction Condor Week 2008

14http://www.cs.wisc.edu/condor

Personal Condor?!

What’s the benefit of a Condor “Pool” with just

one user and one machine?

Page 15: Using Condor  An Introduction Condor Week 2008

15http://www.cs.wisc.edu/condor

Your Personal Condor will ...› Keep an eye on your jobs and will keep

you posted on their progress› Implement your policy on the execution

order of the jobs› Keep a log of your job activities› Add fault tolerance to your jobs› Implement your policy on when the jobs

can run on your workstation

Page 16: Using Condor  An Introduction Condor Week 2008

16http://www.cs.wisc.edu/condor

Definitions› Job

The Condor representation of your work› Machine

The Condor representation of computers and that can perform the work

› Match Making Matching a job with a machine

“Resource”

Page 17: Using Condor  An Introduction Condor Week 2008

17http://www.cs.wisc.edu/condor

JobJobs state their requirements and

preferences:I need a Linux/x86 platformI want the machine with the most

memoryI prefer a machine in the chemistry

department

Page 18: Using Condor  An Introduction Condor Week 2008

18http://www.cs.wisc.edu/condor

MachineMachines state their requirements

and preferences:Run jobs only when there is no keyboard

activityI prefer to run Frieda’s jobsI am a machine in the physics

departmentNever run jobs belonging to Dr. Smith

Page 19: Using Condor  An Introduction Condor Week 2008

19http://www.cs.wisc.edu/condor

The Magic of Matchmaking

› Jobs and machines state their requirements and preferences

› Condor matches jobs with machinesbased on requirements and

preferences

Page 20: Using Condor  An Introduction Condor Week 2008

20http://www.cs.wisc.edu/condor

Getting Started:Submitting Jobs to

Condor› Overview: Choose a “Universe” for your job Make your job “batch-ready” Create a submit description file Run condor_submit to put your job in

the queue

Page 21: Using Condor  An Introduction Condor Week 2008

21http://www.cs.wisc.edu/condor

1. Choose the “Universe”› Controls how Condor handles jobs› Choices include:

Vanilla Standard Grid Java Parallel VM

Page 22: Using Condor  An Introduction Condor Week 2008

22http://www.cs.wisc.edu/condor

Using the Vanilla Universe

• The Vanilla Universe:– Allows running almost

any “serial” job– Provides automatic

file transfer, etc.– Like vanilla ice cream

• Can be used in just about any situation

Page 23: Using Condor  An Introduction Condor Week 2008

23http://www.cs.wisc.edu/condor

2. Make your job batch-ready

Must be able to run in the background

• No interactive input

• No GUI/window clicks

• No music ;^)

Page 24: Using Condor  An Introduction Condor Week 2008

24http://www.cs.wisc.edu/condor

Make your job batch-ready (continued)…

Job can still use STDIN, STDOUT, and STDERR (the keyboard and the screen), but files are used for these instead of the actual devices

Similar to UNIX shell:•$ ./myprogram <input.txt >output.txt

Page 25: Using Condor  An Introduction Condor Week 2008

25http://www.cs.wisc.edu/condor

3. Create a Submit Description File

› A plain ASCII text file› Condor does not care about file extensions› Tells Condor about your job:

Which executable, universe, input, output and error files to use, command-line arguments, environment variables, any special requirements or preferences (more on this later)

› Can describe many jobs at once (a “cluster”), each with different input, arguments, output, etc.

Page 26: Using Condor  An Introduction Condor Week 2008

26http://www.cs.wisc.edu/condor

Simple Submit Description File

# Simple condor_submit input file# (Lines beginning with # are comments)# NOTE: the words on the left side are not# case sensitive, but filenames are!Universe = vanillaExecutable = my_jobOutput = output.txt Queue

Page 27: Using Condor  An Introduction Condor Week 2008

27http://www.cs.wisc.edu/condor

4. Run condor_submit› You give condor_submit the name of

the submit file you have created: condor_submit my_job.submit

› condor_submit: Parses the submit file, checks for errors Creates a “ClassAd” that describes

your job(s) Puts job(s) in the Job Queue

Page 28: Using Condor  An Introduction Condor Week 2008

28http://www.cs.wisc.edu/condor

ClassAd ?› Condor’s internal data

representation Similar to classified ads (as the name

implies) Represent an object & its attributes

• Usually many attributes Can also describe what an object

matches with

Page 29: Using Condor  An Introduction Condor Week 2008

29http://www.cs.wisc.edu/condor

ClassAd Details› ClassAds can contain a lot of details

The job’s executable is analysis.exe The machine’s load average is 5.6

› ClassAds can specify requirements I require a machine with Linux

› ClassAds can specify preferences This machine prefers to run jobs from

the physics group

Page 30: Using Condor  An Introduction Condor Week 2008

30http://www.cs.wisc.edu/condor

ClassAd Details (continued)

› ClassAds are: semi-structured user-extensible schema-free Attribute = Expression

Page 31: Using Condor  An Introduction Condor Week 2008

31http://www.cs.wisc.edu/condor

Example:MyType = "Job"TargetType = "Machine"ClusterId = 1377Owner = "roy"Cmd = "sim.exe"Requirements = (Arch == "INTEL") && (OpSys == "LINUX") && (Disk >= DiskUsage) && ((Memory * 1024)>=ImageSize)…

ClassAd ExampleString

Number

Boolean

Page 32: Using Condor  An Introduction Condor Week 2008

32http://www.cs.wisc.edu/condor

The Dog ClassAd Type = “Dog” Color = “Brown” Price = 12

ClassAd for the “Job”

. . . Requirements = (type == “Dog”) && (color == “Brown”) && (price <= 15) . . .

Page 33: Using Condor  An Introduction Condor Week 2008

33http://www.cs.wisc.edu/condor

The Job Queue› condor_submit sends your job’s

ClassAd(s) to the schedd› The schedd (more details later):

Manages the local job queue Stores the job in the job queue

• Atomic operation, two-phase commit• “Like money in the bank”

› View the queue with condor_q

Page 34: Using Condor  An Introduction Condor Week 2008

34http://www.cs.wisc.edu/condor

Examplecondor_submit and

condor_q% condor_submit my_job.submitSubmitting job(s).1 job(s) submitted to cluster 1.

% condor_q

-- Submitter: perdita.cs.wisc.edu : <128.105.165.34:1027> : ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD

1.0 frieda 6/16 06:52 0+00:00:00 I 0 0.0 my_job

1 jobs; 1 idle, 0 running, 0 held

%

Page 35: Using Condor  An Introduction Condor Week 2008

35http://www.cs.wisc.edu/condor

Input, output & error files› Controlled by submit file settings

› You can define the job’s standard input, standard output and standard error: Read job’s standard input from “input_file”:

• Input = input_file• Shell equivalent: program <input_file

Write job’s standard ouput to “output_file”:• Output = output_file• Shell equivalent: program >output_file

Write job’s standard error to “error_file”:• Error = error_file• Shell equivalent: program 2>error_file

Page 36: Using Condor  An Introduction Condor Week 2008

36http://www.cs.wisc.edu/condor

Email about your job• Condor sends email about job

events to the submitting user• Specify “notification” in your

submit file to control which events:

Notification = completeNotification = neverNotification = errorNotification = always

Default

Page 37: Using Condor  An Introduction Condor Week 2008

37http://www.cs.wisc.edu/condor

Feedback on your job› Create a log of job events› Add to submit description file:

log = sim.log

› Becomes the Life Story of a Job Shows all events in the life of a job Always have a log file

Page 38: Using Condor  An Introduction Condor Week 2008

38http://www.cs.wisc.edu/condor

Sample Condor User Log

000 (0001.000.000) 05/25 19:10:03 Job submitted from host: <128.105.146.14:1816>...001 (0001.000.000) 05/25 19:12:17 Job executing on host: <128.105.146.14:1026>...005 (0001.000.000) 05/25 19:13:06 Job terminated.

(1) Normal termination (return value 0)...

Page 39: Using Condor  An Introduction Condor Week 2008

39http://www.cs.wisc.edu/condor

Example Submit Description File With

Logging# Example condor_submit input file# (Lines beginning with # are comments)# NOTE: the words on the left side are not# case sensitive, but filenames are!Universe = vanillaExecutable = /home/frieda/condor/my_job.condorLog = my_job.log ·Job log (from Condor)Input = my_job.in ·Program’s standard inputOutput = my_job.out ·Program’s standard outputError = my_job.err ·Program’s standard errorArguments = -a1 -a2 ·Command line argumentsInitialDir = /home/frieda/condor/runQueue

Page 40: Using Condor  An Introduction Condor Week 2008

40http://www.cs.wisc.edu/condor

“Clusters” and “Processes”› If your submit file describes multiple jobs, we call

this a “cluster”› Each cluster has a unique “cluster number”› Each job in a cluster is called a “process”

Process numbers always start at zero› A Condor “Job ID” is the cluster number, a period,

and the process number (i.e. 2.1) A cluster can have a single process

• Job ID = 20.0 ·Cluster 20, process 0 Or, a cluster can have more than one process

• Job ID: 21.0, 21.1, 21.2 ·Cluster 21, process 0, 1, 2

Page 41: Using Condor  An Introduction Condor Week 2008

41http://www.cs.wisc.edu/condor

Submit File for a Cluster# Example submit file for a cluster of 2 jobs

# with separate input, output, error and log filesUniverse = vanillaExecutable = my_jobArguments = -x 0log = my_job_0.logInput = my_job_0.inOutput = my_job_0.outError = my_job_0.errQueue ·Job 2.0 (cluster 2, process 0)Arguments = -x 1log = my_job_1.logInput = my_job_1.inOutput = my_job_1.outError = my_job_1.errQueue ·Job 2.1 (cluster 2, process 1)

Page 42: Using Condor  An Introduction Condor Week 2008

42http://www.cs.wisc.edu/condor

% condor_submit my_job.submit-file

Submitting job(s).

2 job(s) submitted to cluster 2.% condor_q-- Submitter: perdita.cs.wisc.edu : <128.105.165.34:1027> :

ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 1.0 frieda 4/15 06:52 0+00:02:11 R 0 0.0 my_job –a1 –a2

2.0 frieda 4/15 06:56 0+00:00:00 I 0 0.0 my_job –x 0 2.1 frieda 4/15 06:56 0+00:00:00 I 0 0.0 my_job –x 13 jobs; 2 idle, 1 running, 0 held

%

Submitting The Job

Page 43: Using Condor  An Introduction Condor Week 2008

43http://www.cs.wisc.edu/condor

Back to our 600 jobs…› We could put all input, output, error

& log files in the one directory One of each type for each job That’d be 2400 files (4 files × 600 jobs) Difficult to sort through

› Better: Create a subdirectory for each run

Page 44: Using Condor  An Introduction Condor Week 2008

44http://www.cs.wisc.edu/condor

Organize your files and directories for big runs

› Create subdirectories for each “run” run_0, run_1, … run_599

› Create input files in each of these run_0/simulation.in run_1/simulation.in … run_599/simulation.in

› The output, error & log files for each job will be created by Condor from your job’s output

Page 45: Using Condor  An Introduction Condor Week 2008

45http://www.cs.wisc.edu/condor

Frieda’s simulation directorysim.exesim.sub run_0

run_599

simulation.insimulation.outsimulation.errsimulation.log

simulation.insimulation.outsimulation.errsimulation.log

Page 46: Using Condor  An Introduction Condor Week 2008

46http://www.cs.wisc.edu/condor

Submit Description File for 600 Jobs

# Cluster of 600 jobs with different directoriesUniverse = vanillaExecutable = simLog = simulation.log...Arguments = -x 0InitialDir = run_0 ·Log, input, output & error files -> run_0Queue ·Job 3.0 (Cluster 3, Process 0)

Arguments = -x 1InitialDir = run_1 ·Log, input, output & error files -> run_1Queue ·Job 3.1 (Cluster 3, Process 1)

·Do this 598 more times…………

Page 47: Using Condor  An Introduction Condor Week 2008

47http://www.cs.wisc.edu/condor

Submit File for a Big Cluster of Jobs

› We just submitted 1 cluster with 600 processes

› All the input/output files will be in different directories

› The submit file is pretty unwieldy (over 1200 lines)

› Isn’t there a better way?

Page 48: Using Condor  An Introduction Condor Week 2008

48http://www.cs.wisc.edu/condor

Submit File for a Big Cluster of Jobs (the better

way) #1› We can queue all 600 in 1 “Queue” command Queue 600

› Condor provides $(Process) and $(Cluster) $(Process) will be expanded to the

process number for each job in the cluster• 0, 1, … 599

$(Cluster) will be expanded to the cluster number• Will be 4 for all jobs in this cluster

Page 49: Using Condor  An Introduction Condor Week 2008

49http://www.cs.wisc.edu/condor

Submit File for a Big Cluster of Jobs (the better

way) #2› The initial directory for each job can

be specified using $(Process) InitialDir = run_$(Process) Condor will expand these to “run_0”,

“run_1”, … “run_599” directories› Similarly, arguments can be variable

Arguments = -x $(Process) Condor will expand these to “-x 0”, “-x 1”, … “-x 599”

Page 50: Using Condor  An Introduction Condor Week 2008

50http://www.cs.wisc.edu/condor

Better Submit File for 600 Jobs

# Example condor_submit input file that defines# a cluster of 600 jobs with different directoriesUniverse = vanillaExecutable = my_jobLog = my_job.logInput = my_job.inOutput = my_job.outError = my_job.errArguments = –x $(Process) ·–x 0, -x 1, … -x 599InitialDir = run_$(Process) ·run_0 … run_599Queue 600 ·Jobs 4.0 … 4.599

Page 51: Using Condor  An Introduction Condor Week 2008

51http://www.cs.wisc.edu/condor

Now, we submit it…$ condor_submit my_job.submitSubmitting

job(s) ...............................................................................................................................................................................................................................................................

Logging submit event(s) ...............................................................................................................................................................................................................................................................

600 job(s) submitted to cluster 4.

Page 52: Using Condor  An Introduction Condor Week 2008

52http://www.cs.wisc.edu/condor

And, Check the queue$ condor_q-- Submitter: x.cs.wisc.edu : <128.105.121.53:510> : x.cs.wisc.eduID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD4.0 frieda 4/20 12:08 0+00:00:05 R 0 9.8 my_job -arg1 –x 04.1 frieda 4/20 12:08 0+00:00:03 I 0 9.8 my_job -arg1 –x 14.2 frieda 4/20 12:08 0+00:00:01 I 0 9.8 my_job -arg1 –x 24.3 frieda 4/20 12:08 0+00:00:00 I 0 9.8 my_job -arg1 –x 3...4.598 frieda 4/20 12:08 0+00:00:00 I 0 9.8 my_job -arg1 –x 5984.599 frieda 4/20 12:08 0+00:00:00 I 0 9.8 my_job -arg1 –x 599

600 jobs; 599 idle, 1 running, 0 held

Page 53: Using Condor  An Introduction Condor Week 2008

53http://www.cs.wisc.edu/condor

Removing jobs› If you want to remove a job from

the Condor queue, you use condor_rm

› You can only remove jobs that you own

› Privileged user can remove any jobs “root” on UNIX “administrator” on Windows

Page 54: Using Condor  An Introduction Condor Week 2008

54http://www.cs.wisc.edu/condor

Removing jobs (continued)

› Remove an entire cluster: condor_rm 4 ·Removes the whole cluster

› Remove a specific job from a cluster: condor_rm 4.0 ·Removes a single job

› Or, remove all of your jobs with “-a” condor_rm -a ·Removes all jobs / clusters

Page 55: Using Condor  An Introduction Condor Week 2008

55http://www.cs.wisc.edu/condor

The Boss says Frieda can add her

co-workers’ desktop machines

into her Condor pool as well…

but only if they can also submit jobs.

(Boss Fat Cat)

Good News

Page 56: Using Condor  An Introduction Condor Week 2008

56http://www.cs.wisc.edu/condor

Adding nodes› Frieda installs Condor on the desktop

machines, and configures them with her machine as the central manager The central manager:

• Central repository for the whole pool• Performs job / machine matching, etc.

› These are “non-dedicated” nodes, meaning that they can't always run Condor jobs

Page 57: Using Condor  An Introduction Condor Week 2008

57http://www.cs.wisc.edu/condor

yourworkstation

personalCondor

600 Condorjobs

Condor Pool

Page 58: Using Condor  An Introduction Condor Week 2008

58http://www.cs.wisc.edu/condor

condor_status% condor_statusName OpSys Arch State Activ LoadAv Mem ActvtyTimeantipholus.cs LINUX INTEL Unclaimed Idle 0.020 511 0+02:28:42coral.cs.wisc LINUX INTEL Claimed Busy 0.990 511 0+01:27:21doc.cs.wisc.e LINUX INTEL Unclaimed Idle 0.260 511 0+00:20:04dsonokwa.cs.w LINUX INTEL Claimed Busy 0.810 511 0+00:01:45ferdinand.cs. LINUX INTEL Claimed Suspe 1.130 511 0+00:00:55vm1@pinguino. LINUX INTEL Unclaimed Idle 0.000 255 0+01:03:28vm2@pinguino. LINUX INTEL Unclaimed Idle 0.190 255 0+01:03:29

Page 59: Using Condor  An Introduction Condor Week 2008

59http://www.cs.wisc.edu/condor

How can my jobs access their data

files?

Page 60: Using Condor  An Introduction Condor Week 2008

60http://www.cs.wisc.edu/condor

Access to Data in Condor

› Use shared filesystem if available› No shared filesystem?

Condor can transfer files• Can automatically send back changed

files• Atomic transfer of multiple files• Can be encrypted over the wire

Remote I/O Socket Standard Universe can use remote

system calls (more on this later)

Page 61: Using Condor  An Introduction Condor Week 2008

61http://www.cs.wisc.edu/condor

Condor File Transfer› ShouldTransferFiles = YES

Always transfer files to execution site› ShouldTransferFiles = NO

Rely on a shared filesystem› ShouldTransferFiles = IF_NEEDED

Will automatically transfer the files if the submit and execute machine are not in the same FileSystemDomain

Universe = vanillaExecutable = my_jobLog = my_job.logShouldTransferFiles = IF_NEEDEDTransfer_input_files = dataset.$(Process), common.dataTransfer_output_files = TheAnswer.datQueue 600

Page 62: Using Condor  An Introduction Condor Week 2008

62http://www.cs.wisc.edu/condor

We Always Want MoreCondor is managing and

running our jobs, but Our CPU requirements

are greater than our resources

Jobs are preempted more often than we like

Page 63: Using Condor  An Introduction Condor Week 2008

63http://www.cs.wisc.edu/condor

Happy Day! Frieda’s organization purchased a

Dedicated Cluster!› Frieda Installs Condor on all

the dedicated Cluster nodes› Frieda also adds a dedicated

central manager› She configures her entire

pool with this new host as the central manager…

Page 64: Using Condor  An Introduction Condor Week 2008

64http://www.cs.wisc.edu/condor

Condor Pool

With the additional resources, Frieda

and her co-workers can get their jobs completed even

faster.

600 Condorjobs

Frieda’s Condor Pool

Dedicated Cluster

Page 65: Using Condor  An Introduction Condor Week 2008

65http://www.cs.wisc.edu/condor

Another Universe

Page 66: Using Condor  An Introduction Condor Week 2008

66http://www.cs.wisc.edu/condor

Frieda submits some MPI jobs

universe = parallel executable = mp1script arguments = my_mpich_linked_executable arg1

arg2 machine_count = 4 should_transfer_files = yeswhen_to_transfer_output = on_exittransfer_input_files = my_mpich_linked_executable queue

Page 67: Using Condor  An Introduction Condor Week 2008

67http://www.cs.wisc.edu/condor

What Condor Daemons are

running on my machine, and what

do they do?

Page 68: Using Condor  An Introduction Condor Week 2008

68http://www.cs.wisc.edu/condor

condor_master› Starts up all other Condor daemons› If there are any problems and a daemon

exits, it restarts the daemon and sends email to the administrator

› Acts as the server for many Condor remote administration commands: condor_reconfig, condor_restart,

condor_off, condor_on, condor_config_val, etc.

Page 69: Using Condor  An Introduction Condor Week 2008

69http://www.cs.wisc.edu/condor

Condor Daemon LayoutPersonal Condor / Central Manager

Master

collector

negotiator

schedd

startd

= Process Spawned

Page 70: Using Condor  An Introduction Condor Week 2008

70http://www.cs.wisc.edu/condor

Central Manager:condor_collector

› Central manager: central repository and match maker for whole pool

› Collects information from all other Condor daemons in the pool “Directory Service” / Database for a Condor pool

› Each daemon sends a periodic update called a “ClassAd” to the collector

› Services queries for information: Queries from other Condor daemons Queries from users (condor_status)

› Only on the Central Manager› At least one collector per pool

Page 71: Using Condor  An Introduction Condor Week 2008

71http://www.cs.wisc.edu/condor

Condor Pool Layout: Collector

= ClassAd Communication Pathway

= Process Spawned Central Manager

Master

Collector

Page 72: Using Condor  An Introduction Condor Week 2008

72http://www.cs.wisc.edu/condor

Central Manager:condor_negotiator

› Performs “matchmaking” in Condor› Each “Negotiation Cycle” (typically 5 minutes):

Gets information from the collector about all available machines and all idle jobs

Tries to match jobs with machines that will serve them

Both the job and the machine must satisfy each other’s requirements

› Only one negotiator per pool› Only on the Central Manager

Page 73: Using Condor  An Introduction Condor Week 2008

73http://www.cs.wisc.edu/condor

Condor Pool Layout: Negotiator

= ClassAd Communication Pathway

= Process Spawned Central Manager

Master

Collector

negotiator

Page 74: Using Condor  An Introduction Condor Week 2008

74http://www.cs.wisc.edu/condor

Execute Hosts:condor_startd

› Execute host: machines that run user jobs

› Represents a machine to the Condor system

› Responsible for starting, suspending, and stopping jobs

› Enforces the wishes of the machine owner (the owner’s “policy”… more on this in the administrator’s tutorial)

› Creates a “starter” for each running job› One startd runs on each execute node

Page 75: Using Condor  An Introduction Condor Week 2008

75http://www.cs.wisc.edu/condor

Condor Pool Layout: startd= ClassAd Communication Pathway

= Process Spawned Cluster NodeMaster

startd

Cluster NodeMaster

startd

Central Manager

Master

Collector

negotiator

Page 76: Using Condor  An Introduction Condor Week 2008

76http://www.cs.wisc.edu/condor

Submit Hosts:condor_schedd

› Submit hosts: machines that users can submit jobs on

› Maintains the persistent queue of jobs› Responsible for contacting available machines

and sending them jobs› Services user commands which manipulate the

job queue: condor_submit,condor_rm, condor_q,

condor_hold, condor_release, condor_prio, …› Creates a “shadow” for each running job› One schedd runs on each submit host

Page 77: Using Condor  An Introduction Condor Week 2008

77http://www.cs.wisc.edu/condor

Condor Pool Layout: schedd

= ClassAd Communication Pathway

= Process Spawned Cluster NodeMaster

startd

Cluster NodeMaster

startdDesktopMaster

startd

DesktopMaster

startd

Central Manager

Master

Collector

negotiator

schedd

negotiator

schedd schedd

Page 78: Using Condor  An Introduction Condor Week 2008

78http://www.cs.wisc.edu/condor

Condor Pool Layout: master

= ClassAd Communication Pathway

= Process Spawned Central Manager

Master

Collector

Cluster NodeMaster

startd

Cluster NodeMaster

startdDesktopMaster

startdschedd

DesktopMaster

startdschedd

negotiator

schedd

negotiator

schedd

Master Master

Master Master

Master

Page 79: Using Condor  An Introduction Condor Week 2008

79http://www.cs.wisc.edu/condor

Now what?› Some of the machines in the pool

can’t run my jobs Not enough RAM Not enough scratch disk space Required software not installed Etc.

Page 80: Using Condor  An Introduction Condor Week 2008

80http://www.cs.wisc.edu/condor

Specify Requirements› An expression (syntax similar to C or Java)› Must evaluate to True for a match to be

made

Universe = vanillaExecutable = my_jobLog = my_job.logInitialDir = run_$(Process)Requirements = Memory >= 256 && Disk > 10000Queue 600

Page 81: Using Condor  An Introduction Condor Week 2008

81http://www.cs.wisc.edu/condor

Advanced Requirements› Requirements can match custom attributes in

your Machine Ad Can be added by hand to each machine Or, automatically using the “Hawkeye” mechanism

Universe = vanillaExecutable = $$(MyProgPath)Log = my_job.logInitialDir = run_$(Process)Requirements = Memory >= 256 && Disk > 10000 \

&& ( MyProgPath =!= UNDEFINED) )Queue 600

Page 82: Using Condor  An Introduction Condor Week 2008

82http://www.cs.wisc.edu/condor

And, Specify Rank› All matches which meet the

requirements can be sorted by preference with a Rank expression.

› Higher the Rank, the better the matchUniverse = vanillaExecutable = my_jobLog = my_job.logArguments = -arg1 –arg2InitialDir = run_$(Process)Requirements = Memory >= 256 && Disk > 10000Rank = (KFLOPS*10000) + MemoryQueue 600

Page 83: Using Condor  An Introduction Condor Week 2008

83http://www.cs.wisc.edu/condor

My jobs aren’t running!!

Page 84: Using Condor  An Introduction Condor Week 2008

84http://www.cs.wisc.edu/condor

Check the queue› Check the queue with condor_q:bash-2.05a$ condor_q-- Submitter: x.cs.wisc.edu : <128.105.121.53:510> :x.cs.wisc.eduID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD5.0 frieda 4/20 12:23 0+00:00:00 I 0 9.8 my_job -arg1 –n 05.1 frieda 4/20 12:23 0+00:00:00 I 0 9.8 my_job -arg1 –n 15.2 frieda 4/20 12:23 0+00:00:00 I 0 9.8 my_job -arg1 –n 25.3 frieda 4/20 12:23 0+00:00:00 I 0 9.8 my_job -arg1 –n 35.4 frieda 4/20 12:23 0+00:00:00 I 0 9.8 my_job -arg1 –n 45.5 frieda 4/20 12:23 0+00:00:00 I 0 9.8 my_job -arg1 –n 55.6 frieda 4/20 12:23 0+00:00:00 I 0 9.8 my_job -arg1 –n 65.7 frieda 4/20 12:23 0+00:00:00 I 0 9.8 my_job -arg1 –n 76.0 frieda 4/20 13:22 0+00:00:00 H 0 9.8 my_job -arg1 –arg28 jobs; 8 idle, 0 running, 1 held

Page 85: Using Condor  An Introduction Condor Week 2008

85http://www.cs.wisc.edu/condor

Look at jobs on hold% condor_q –hold-- Submiter: x.cs.wisc.edu :

<128.105.121.53:510> :x.cs.wisc.edu ID OWNER HELD_SINCE HOLD_REASON 6.0 frieda 4/20 13:23 Error from starter

on skywalker.cs.wisc.edu

9 jobs; 8 idle, 0 running, 1 held

Or, See full details for a job% condor_q –l 6.0

Page 86: Using Condor  An Introduction Condor Week 2008

86http://www.cs.wisc.edu/condor

Check machine status› Verify that there are idle machines with condor_status:bash-2.05a$ condor_statusName OpSys Arch State Activity LoadAv Mem [email protected] LINUX INTEL Claimed Busy 0.000 501 0+00:00:[email protected] LINUX INTEL Claimed Busy 0.000 501 0+00:00:[email protected] LINUX INTEL Claimed Busy 0.040 501 0+00:00:[email protected] LINUX INTEL Claimed Busy 0.000 501 0+00:00:05

Total Owner Claimed Unclaimed Matched PreemptingINTEL/LINUX 4 0 4 0 0 0 Total 4 0 4 0 0 0

Page 87: Using Condor  An Introduction Condor Week 2008

87http://www.cs.wisc.edu/condor

Look in Job Log› Look in your job log for clues:bash-2.05a$ cat my_job.log000 (031.000.000) 04/20 14:47:31 Job submitted from host:

<128.105.121.53:48740>...007 (031.000.000) 04/20 15:02:00 Shadow exception! Error from starter on gig06.stat.wisc.edu: Failed

to open '/scratch.1/frieda/workspace/v67/condor-test/test3/run_0/my_job.in' as standard input: No such file or directory (errno 2)

0 - Run Bytes Sent By Job 0 - Run Bytes Received By Job...

Page 88: Using Condor  An Introduction Condor Week 2008

88http://www.cs.wisc.edu/condor

Still not running?Exercise a little patience

› On a busy pool, it can take a while to match and start your jobs

› Wait at least a negotiation cycle or two (typically a few minutes)

Page 89: Using Condor  An Introduction Condor Week 2008

89http://www.cs.wisc.edu/condor

Look to condor_q for help:condor_q -analyze

bash-2.05a$ condor_q -ana 29---029.000: Run analysis summary. Of 1243 machines, 1243 are rejected by your job's requirements0 are available to run your job

WARNING: Be advised: No resources matched request's constraints Check the Requirements expression below:Requirements = ((Memory > 8192)) && (Arch == "INTEL") &&

(OpSys == "LINUX") && (Disk >= DiskUsage) && (TARGET.FileSystemDomain == MY.FileSystemDomain)

Page 90: Using Condor  An Introduction Condor Week 2008

90http://www.cs.wisc.edu/condor

Better analysis (Linux only):

condor_q –better-analyzebash-2.05a$ condor_q -better-ana 29The Requirements expression for your job is:

( ( target.Memory > 8192 ) ) && ( target.Arch == "INTEL" ) &&( target.OpSys == "LINUX" ) && ( target.Disk >= DiskUsage ) &&( TARGET.FileSystemDomain == MY.FileSystemDomain )Condition Machines Matched Suggestion--------- ---------------- ----------1 ( ( target.Memory > 8192 ) ) 0 MODIFY TO 40002 ( TARGET.FileSystemDomain == "cs.wisc.edu" )5843 ( target.Arch == "INTEL" ) 10784 ( target.OpSys == "LINUX" ) 11005 ( target.Disk >= 13 ) 1243

Page 91: Using Condor  An Introduction Condor Week 2008

91http://www.cs.wisc.edu/condor

Learn about available resources:

bash-2.05a$ condor_status –const 'Memory > 8192'(no output means no matches)bash-2.05a$ condor_status -const 'Memory > 4096'Name OpSys Arch State Activ LoadAv Mem [email protected]. LINUX X86_64 Unclaimed Idle 0.000 5980 1+05:35:[email protected]. LINUX X86_64 Unclaimed Idle 0.000 5980 13+05:37:[email protected]. LINUX X86_64 Unclaimed Idle 0.000 7988 1+06:00:[email protected]. LINUX X86_64 Unclaimed Idle 0.000 7988 13+06:03:47

Total Owner Claimed Unclaimed Matched Preempting X86_64/LINUX 4 0 0 4 0 0 Total 4 0 0 4 0 0

Page 92: Using Condor  An Introduction Condor Week 2008

92http://www.cs.wisc.edu/condor

Job Policy Expressions› User can supply job policy expressions

in the submit file.› Can be used to describe a successful

run.on_exit_remove = <expression>on_exit_hold = <expression>periodic_remove = <expression>periodic_hold = <expression>

Page 93: Using Condor  An Introduction Condor Week 2008

93http://www.cs.wisc.edu/condor

Job Policy Examples› Do not remove if exits with a signal:

on_exit_remove = ExitBySignal == False

› Place on hold if exits with nonzero status or ran for less than an hour: on_exit_hold = ( (ExitBySignal==False) && (ExitSignal != 0) ) || ( (ServerStartTime - JobStartDate) < 3600)

› Place on hold if job has spent more than 50% of its time suspended: periodic_hold = CumulativeSuspensionTime >

(RemoteWallClockTime / 2.0)

Page 94: Using Condor  An Introduction Condor Week 2008

94http://www.cs.wisc.edu/condor

Insert ClassAd attributes

› Special purpose usage› In the submit description file,

introduce an attribute for the job+Department = “biochemistry”

causes the ClassAd to containDepartment = “biochemistry”

Page 95: Using Condor  An Introduction Condor Week 2008

95http://www.cs.wisc.edu/condor

We’ve seen how Condor can:

Keep an eye on your jobs and will keep you posted on their progress

Implement your policy on the execution order of the jobs

Keep a log of your job activities

Page 96: Using Condor  An Introduction Condor Week 2008

96http://www.cs.wisc.edu/condor

Another Universe

Page 97: Using Condor  An Introduction Condor Week 2008

97http://www.cs.wisc.edu/condor

My new jobs run for 20 days…

› What happens when a job is forced off it’s CPU? Preempted by higher priority

user or job Vacated because of user activity

› How can I add fault tolerance to my jobs?

Page 98: Using Condor  An Introduction Condor Week 2008

98http://www.cs.wisc.edu/condor

Condor’s Standard Universe to the rescue!› Support for transparent process

checkpoint and restart› Remote system calls (remote

I/O) Your job can read / write files as if they were local

Page 99: Using Condor  An Introduction Condor Week 2008

99http://www.cs.wisc.edu/condor

Remote System Calls inthe Standard Universe

› I/O system calls are trapped and sent back to the submit machineExamples: open a file, write to a file

› No source code changes typically required

› Programming language independent

Page 100: Using Condor  An Introduction Condor Week 2008

100http://www.cs.wisc.edu/condor

Process Checkpointing in the

Standard Universe› Condor’s process checkpointing provides a mechanism to automatically save the state of a job

› The process can then be restarted from right where it was checkpointed After preemption, crash, etc.

Page 101: Using Condor  An Introduction Condor Week 2008

101http://www.cs.wisc.edu/condor

Checkpointing:Process Starts

checkpoint: the entire state of a program, saved in a file CPU registers, memory image, I/O

time

Page 102: Using Condor  An Introduction Condor Week 2008

102http://www.cs.wisc.edu/condor

Checkpointing:Process Checkpointed

time

1 2 3

Page 103: Using Condor  An Introduction Condor Week 2008

103http://www.cs.wisc.edu/condor

Checkpointing:Process Killed

time

3

3

Killed!

Page 104: Using Condor  An Introduction Condor Week 2008

104http://www.cs.wisc.edu/condor

Checkpointing:Process Resumed

time

3

3

goodput badput goodput

Page 105: Using Condor  An Introduction Condor Week 2008

105http://www.cs.wisc.edu/condor

When will Condor checkpoint your job?

› Periodically, if desired For fault tolerance

› When your job is preempted by a higher priority job

› When your job is vacated because the execution machine becomes busy

› When you explicitly run condor_checkpoint, condor_vacate, condor_off or condor_restart command

Page 106: Using Condor  An Introduction Condor Week 2008

106http://www.cs.wisc.edu/condor

Making the Standard Universe Work

› The job must be relinked with Condor’s standard universe support library

› To relink, place condor_compile in front of the command used to link the job:% condor_compile gcc -o myjob myjob.c

- OR -% condor_compile f77 -o myjob filea.f fileb.f

- OR -% condor_compile make –f MyMakefile

Page 107: Using Condor  An Introduction Condor Week 2008

107http://www.cs.wisc.edu/condor

Limitations of the Standard Universe

› Condor’s checkpointing is not at the kernel level. Standard Universe the job may not:

• Fork()• Use kernel threads• Use some forms of IPC, such as pipes and shared

memory› Must have access to source code to

relink› Many typical scientific jobs are OK

Page 108: Using Condor  An Introduction Condor Week 2008

108http://www.cs.wisc.edu/condor

Connecting Condors

› Frieda knows people with their own Condor pools, and gets permission to use their computing resources…

› How can Condor help her do this?

Page 109: Using Condor  An Introduction Condor Week 2008

109http://www.cs.wisc.edu/condor

Connect Condorswith Flocking

› Frieda configures her Condor pool to “flockflock” to her friend’s pool.

› Flocking is a Condor-specific technology.

Page 110: Using Condor  An Introduction Condor Week 2008

110http://www.cs.wisc.edu/condor

Condor Pool

600 Condorjobs

Frieda’s Condor Pool

Friendly Condor Pool

Page 111: Using Condor  An Introduction Condor Week 2008

111http://www.cs.wisc.edu/condor

Frieda meets The Grid› Frieda also has access to grid resources

she wants to use She has certificates and access to Globus or

other resources at remote institutions› But Frieda wants Condor’s queue

management features for her jobs!› She installs Condor so she can submit

“Grid Universe” jobs to Condor

Page 112: Using Condor  An Introduction Condor Week 2008

112http://www.cs.wisc.edu/condor

“Grid” Universe› All handled in your submit file› Supports a number of “back end” types:

Globus: GT2, GT4 NorduGrid UNICORE Condor PBS LSF EC2 NQS

Page 113: Using Condor  An Introduction Condor Week 2008

113http://www.cs.wisc.edu/condor

Grid Universe & Globus 2› Used for a Globus GT2 back-end

“Condor-G”› Format:Grid_Resource = gt2 Head-NodeGlobus_rsl = <RSL-String>› Example:Universe = gridGrid_Resource = gt2 beak.cs.wisc.edu/jobmanagerGlobus_rsl = (queue=long)(project=atom-smasher)

Page 114: Using Condor  An Introduction Condor Week 2008

114http://www.cs.wisc.edu/condor

Grid Universe & Globus 4› Used for a Globus GT4 back-

end› Format:Grid_Resource = gt4 <Head-Node> <Scheduler-Type>Globus_XML = <XML-String>

› Example:Universe = gridGrid_Resource = gt4 beak.cs.wisc.edu CondorGlobus_xml = <queue>long</queue><project>atom-

smasher</project>

Page 115: Using Condor  An Introduction Condor Week 2008

115http://www.cs.wisc.edu/condor

Grid Universe & Condor› Used for a Condor back-end

“Condor-C”› Format:Grid_Resource = condor <Schedd-Name> <Collector-Name>Remote_<param> = <value>

“Remote_” part is stripped off› Example:Universe = gridGrid_Resource = condor beak condor.cs.wisc.eduRemote_Universe = standard

Page 116: Using Condor  An Introduction Condor Week 2008

116http://www.cs.wisc.edu/condor

Grid Universe & NorduGrid

› Used for a NorduGrid back-endGrid_Resource = nordugrid <Host-Name>

› Example:Universe = gridGrid_Resource = nordugrid ngrid.cs.wisc.edu

Page 117: Using Condor  An Introduction Condor Week 2008

117http://www.cs.wisc.edu/condor

Grid Universe & UNICORE

› Used for a UNICORE back-end› Format:Grid_Resource = unicore <USite> <VSite>

› Example:Universe = gridGrid_Resource = unicore uhost.cs.wisc.edu vhost

Page 118: Using Condor  An Introduction Condor Week 2008

118http://www.cs.wisc.edu/condor

Grid Universe & PBS› Used for a PBS back-end› Format:Grid_Resource = pbs

› Example:Universe = gridGrid_Resource = pbs

Page 119: Using Condor  An Introduction Condor Week 2008

119http://www.cs.wisc.edu/condor

Grid Universe & LSF› Used for a LSF back-end› Format:Grid_Resource = lsf› Example:Universe = gridGrid_Resource = lsf

Page 120: Using Condor  An Introduction Condor Week 2008

120http://www.cs.wisc.edu/condor

Credential Management› Condor will do The Right Thing™ with

your X509 certificate and proxy› Override default proxy:

X509UserProxy = /home/frieda/other/proxy› Proxy may expire before jobs finish

executing Condor can use MyProxy to renew your proxy When a new proxy is available, Condor will

forward the renewed proxy to the job This works for non-grid jobs, too

Page 121: Using Condor  An Introduction Condor Week 2008

121http://www.cs.wisc.edu/condor

My jobs have have dependencies…

Can Condor help solve my dependency problems?

Page 122: Using Condor  An Introduction Condor Week 2008

122http://www.cs.wisc.edu/condor

Condor Universes:Scheduler and Local

› Scheduler Universe Plug in a meta-scheduler Developed for DAGMan (more later) Similar to Globus’s fork job manager

› Local Very similar to vanilla, but jobs run on

the local host Has more control over jobs than

scheduler universe

Page 123: Using Condor  An Introduction Condor Week 2008

123http://www.cs.wisc.edu/condor

Frieda learns DAGMan

› Directed Acyclic Graph Manager

› DAGMan allows you to specify the dependencies between your Condor jobs, so it can manage them automatically for you.

› (e.g., “Don’t run job “B” until job “A” has completed successfully.”)

Page 124: Using Condor  An Introduction Condor Week 2008

124http://www.cs.wisc.edu/condor

What is a DAG?› A DAG is the data structure

used by DAGMan to represent these dependencies.

› Each job is a “node” in the DAG.

› Each node can have any number of “parent” or “children” nodes – as long as there are no loops!

Job A

Job B

Job C

Job D

Page 125: Using Condor  An Introduction Condor Week 2008

125http://www.cs.wisc.edu/condor

Defining a DAG› A DAG is defined by a .dag file, listing each of

its nodes and their dependencies:# diamond.dagJob A a.subJob B b.subJob C c.subJob D d.subParent A Child B CParent B C Child D

› each node will run the Condor job specified by its accompanying Condor submit file

Job A

Job B Job C

Job D

Page 126: Using Condor  An Introduction Condor Week 2008

126http://www.cs.wisc.edu/condor

Submitting a DAG› To start your DAG, just run condor_submit_dag with your .dag file, and Condor will start a personal DAGMan daemon which to begin running your jobs:% condor_submit_dag diamond.dag

› condor_submit_dag is run by the schedd DAGMan daemon itself is “watched” by

Condor, so you don’t have to

Page 127: Using Condor  An Introduction Condor Week 2008

127http://www.cs.wisc.edu/condor

DAGMan

Running a DAG› DAGMan acts as a “meta-scheduler”,

managing the submission of your jobs to Condor based on the DAG dependencies.

CondorJobQueue

B C

D

A

A.dagFile

Page 128: Using Condor  An Introduction Condor Week 2008

128http://www.cs.wisc.edu/condor

DAGMan

Running a DAG (cont’d)› DAGMan holds & submits jobs to the

Condor queue at the appropriate times.

CondorJobQueue D

B

C

B

A

C

Page 129: Using Condor  An Introduction Condor Week 2008

129http://www.cs.wisc.edu/condor

Running a DAG (cont’d)› In case of a job failure, DAGMan continues

until it can no longer make progress, and then creates a “rescue” file with the current state of the DAG.

CondorJobQueue DAGMan

X

D

A

BRescue

File

Page 130: Using Condor  An Introduction Condor Week 2008

130http://www.cs.wisc.edu/condor

Recovering a DAG› Once the failed job is ready to be re-run,

the rescue file can be used to restore the prior state of the DAG.

CondorJobQueue

RescueFile

CDAGMan D

A

B C

Page 131: Using Condor  An Introduction Condor Week 2008

131http://www.cs.wisc.edu/condor

DAGMan

Recovering a DAG (cont’d)

› Once that job completes, DAGMan will continue the DAG as if the failure never happened.

CondorJobQueue

C

D

A

B

D

Page 132: Using Condor  An Introduction Condor Week 2008

132http://www.cs.wisc.edu/condor

DAGMan

Finishing a DAG› Once the DAG is complete, the DAGMan

job itself is finished, and exits.

CondorJobQueue

C

D

A

B

Page 133: Using Condor  An Introduction Condor Week 2008

133http://www.cs.wisc.edu/condor

Additional DAGMan Features

› Provides other handy features for job management… nodes can have PRE & POST scripts failed nodes can be automatically re-

tried a configurable number of times job submission can be “throttled”

Page 134: Using Condor  An Introduction Condor Week 2008

134http://www.cs.wisc.edu/condor

General User Commands› condor_status View Pool Status

› condor_q View Job Queue› condor_submit Submit new Jobs› condor_rm Remove Jobs› condor_prio Intra-User Prios› condor_history Completed Job Info› condor_submit_dag Submit new DAG› condor_checkpoint Force a checkpoint› condor_compile Link Condor library

Page 135: Using Condor  An Introduction Condor Week 2008

135http://www.cs.wisc.edu/condor

Condor Job Universes• Vanilla Universe• Standard Universe• Grid Universe• Scheduler

Universe• Local Universe• Virtual Machine

Universe• Java Universe

• Parallel Universe• MPICH-1• MPICH-2• LAM• …

Page 136: Using Condor  An Introduction Condor Week 2008

136http://www.cs.wisc.edu/condor

Why have a special Universe for Java jobs?

› Java Universe provides more than just inserting “java” at the start of the execute line of a vanilla job: Knows which machines have a JVM installed Knows the location, version, and

performance of JVM on each machine Knows about jar files, etc. Provides more information about Java job

completion than just JVM exit code• Program runs in a Java wrapper, allowing Condor

to report Java exceptions, etc.

Page 137: Using Condor  An Introduction Condor Week 2008

137http://www.cs.wisc.edu/condor

Universe Java Job› Example Java Universe Submit file:Universe = javaExecutable = Main.classjar_files = MyLibrary.jarInput = infileOutput = outfileArguments = Main 1 2 3Queue

Page 138: Using Condor  An Introduction Condor Week 2008

138http://www.cs.wisc.edu/condor

Java support, cont.bash-2.05a$ condor_status –java Name JavaVendor Ver State Actv LoadAv Memabulafia.cs Sun Microsy 1.5.0_ Claimed Busy 0.180 503acme.cs.wis Sun Microsy 1.5.0_ Unclaimed Idle 0.000 503adelie01.cs Sun Microsy 1.5.0_ Claimed Busy 0.000 1002adelie02.cs Sun Microsy 1.5.0_ Claimed Busy 0.000 1002… Total Owner Claimed Unclaimed Matched Preempting INTEL/LINUX 965 179 516 250 20 0 INTEL/WINNT50 102 6 65 31 0 0SUN4u/SOLARIS28 1 0 0 1 0 0 X86_64/LINUX 128 2 106 20 0 0

Total 1196 187 687 302 20 0

Page 139: Using Condor  An Introduction Condor Week 2008

139http://www.cs.wisc.edu/condor

Frieda wants Condor features on remote

resources› She wants to run standard

universe jobs on Grid-managed resources For matchmaking and dynamic

scheduling of jobs For job checkpointing and migration For remote system calls

Page 140: Using Condor  An Introduction Condor Week 2008

140http://www.cs.wisc.edu/condor

Condor GlideIn› Frieda can use the Grid Universe to run

Condor daemons on Grid resources› When the resources run these GlideIn

jobs, they will temporarily join her Condor Pool

› She can then submit Standard, Vanilla, PVM, or MPI Universe jobs and they will be matched and run on the remote resources

› Currently only supports Globus GT2 We hope to fix this limitation

Page 141: Using Condor  An Introduction Condor Week 2008

141http://www.cs.wisc.edu/condor

yourworkstation

Friendly Condor Pool

personalCondor

600 Condorjobs

Globus Grid

PBS LSF

Condor

Condor Pool

glide-in jobs

Page 142: Using Condor  An Introduction Condor Week 2008

143http://www.cs.wisc.edu/condor

GlideIn Concerns› What if the remote resource kills my GlideIn

job? That resource will disappear from your pool and your

jobs will be rescheduled on other machines Standard universe jobs will resume from their last

checkpoint like usual› What if all my jobs are completed before a

GlideIn job runs? If a GlideIn Condor daemon is not matched with a job

in 10 minutes, it terminates, freeing the resource

Page 143: Using Condor  An Introduction Condor Week 2008

144http://www.cs.wisc.edu/condor

In ReviewWith Condor’s help, Frieda can:

Manage her compute job workload Access local machines Access remote Condor Pools via flocking Access remote compute resources on

the Grid via “Grid Universe” jobs Carve out her own personal Condor Pool

from the Grid with GlideIn technology

Page 144: Using Condor  An Introduction Condor Week 2008

145http://www.cs.wisc.edu/condor

Administrator Commands› condor_vacate Leave a machine now

› condor_on Start Condor› condor_off Stop Condor› condor_reconfig Reconfig on-the-fly› condor_config_val View/set config› condor_userprio User Priorities› condor_stats View detailed usage

accounting stats

Page 145: Using Condor  An Introduction Condor Week 2008

146http://www.cs.wisc.edu/condor

My boss wants to watch what Condor is doing

Page 146: Using Condor  An Introduction Condor Week 2008

147http://www.cs.wisc.edu/condor

Use CondorView!› Provides visual graphs of current and past

utilization› Data is derived from Condor's own accounting

statistics› Interactive Java applet› Quickly and easily view:

How much Condor is being used How many cycles are being delivered Who is using them Utilization by machine platform or by user

Page 147: Using Condor  An Introduction Condor Week 2008

148http://www.cs.wisc.edu/condor

CondorView Usage Graph

Page 148: Using Condor  An Introduction Condor Week 2008

149http://www.cs.wisc.edu/condor

I could also talk lots about…› GCB: Living with firewalls & private networks

› Federated Grids/Clusters› APIs and Portals› MW› Database Support (Quill)› High Availability Fail-over› Compute On-Demand (COD)› Dynamic Pool Creation (“Glide-in”)› Role-based prioritization and accounting› Strong security, incl privilege separation› Data movement scheduling in workflows› …

Page 149: Using Condor  An Introduction Condor Week 2008

150http://www.cs.wisc.edu/condor

Thank you!Check us out on the Web:

http://www.condorproject.org

Email:[email protected]