Transcript
Page 1: Condor DAGMan: Introduction & Update

Peter CouvaresComputer Sciences DepartmentUniversity of Wisconsin-Madison

[email protected]://www.cs.wisc.edu/condor

Condor DAGMan:Introduction &

Update

Page 2: Condor DAGMan: Introduction & Update

2http://www.cs.wisc.edu/condor

DAGMan

› Directed Acyclic Graph Manager

› DAGMan allows you to specify the dependencies between your Condor jobs, so it can manage them automatically for you.

› (e.g., “Don’t run job “B” until job “A” has completed successfully.”)

Page 3: Condor DAGMan: Introduction & Update

3http://www.cs.wisc.edu/condor

Why is This Important?

› Most real science involves complex sequences of tasks – on many resources at many sites. E.g., move data, compute, check, move back, etc.

› … and many types of jobs working together Condor, Grid (Condor-G), MPI, shell scripts, etc.

› Failures are a certainty, so recoverability of the sequence – not just the jobs – is crucial.

Page 4: Condor DAGMan: Introduction & Update

4http://www.cs.wisc.edu/condor

What is a DAG?

› A DAG is the data structure used by DAGMan to represent these dependencies.

› Each job is a “node” in the DAG.

› Each node can have any number of “parent” or “children” nodes – as long as there are no loops!

Job A

Job B Job C

Job D

Page 5: Condor DAGMan: Introduction & Update

5http://www.cs.wisc.edu/condor

Defining a DAG

› A DAG is defined by a .dag file, listing each of its nodes and their dependencies:# diamond.dagJob A a.subJob B b.subJob C c.subJob D d.subParent A Child B CParent B C Child D

› each node will run the Condor or Grid job specified by its accompanying Condor submit file

Job A

Job B Job C

Job D

Page 6: Condor DAGMan: Introduction & Update

6http://www.cs.wisc.edu/condor

Submitting a DAG

› To start your DAG, just run condor_submit_dag with your .dag file, and Condor will start a personal DAGMan daemon to begin running your jobs:

% condor_submit_dag diamond.dag

› condor_submit_dag submits a Scheduler Universe job to run DAGMan under Condor… so DAGMan itself will be robust in case of failure, machine reboots, etc.

Page 7: Condor DAGMan: Introduction & Update

7http://www.cs.wisc.edu/condor

DAGMan

Running a DAG

› DAGMan acts as a “meta-scheduler”, managing the submission of your jobs to Condor based on the DAG dependencies.

CondorJobQueue

C

D

A

A

B.dagFile

Page 8: Condor DAGMan: Introduction & Update

8http://www.cs.wisc.edu/condor

DAGMan

Running a DAG (cont’d)

› DAGMan holds & submits jobs to the Condor queue at the appropriate times.

CondorJobQueue

C

D

B

C

B

A

Page 9: Condor DAGMan: Introduction & Update

9http://www.cs.wisc.edu/condor

DAGMan

Running a DAG (cont’d)

› In case of a job failure, DAGMan continues until it can no longer make progress, and then creates a “rescue” file with the current state of the DAG.

CondorJobQueue

X

D

A

BRescue

File

Page 10: Condor DAGMan: Introduction & Update

10http://www.cs.wisc.edu/condor

DAGMan

Recovering a DAG

› Once the failed job is ready to be re-run, the rescue file can be used to restore the prior state of the DAG.

CondorJobQueue

C

D

A

BRescue

File

C

Page 11: Condor DAGMan: Introduction & Update

11http://www.cs.wisc.edu/condor

DAGMan

Finishing a DAG

› Once the DAG is complete, the DAGMan job itself is finished, and exits.

CondorJobQueue

C

D

A

B

Page 12: Condor DAGMan: Introduction & Update

12http://www.cs.wisc.edu/condor

Additional DAGMan Features

› Provides other knobs handy for job management…

nodes can have PRE & POST scripts job submission can be “throttled” NEW: failed nodes can be

automatically re-tried a configurable number of times

Page 13: Condor DAGMan: Introduction & Update

13http://www.cs.wisc.edu/condor

PRE & POST Scripts

› Executes locally on the submit host before or after job submission…

› Example:# diamond.dagPRE A prepare-A.shJob A a.subJob B b.subJob C c.subJob D d.subPOST D double-check.shParent A Child B CParent B C Child D

› PRE/POST scripts are part of node

PREJob A

Job B Job C

Job DPOST

Page 14: Condor DAGMan: Introduction & Update

14http://www.cs.wisc.edu/condor

DAG “Throttling”

› You can tell DAGMan to limit the maximum number of jobs it submits at any one time condor_submit_dag -maxjobs N useful for managing resource limitations (e.g.,

licenses)

› You can also can limit the number of simultaneous PRE or POST scripts. Added after Vladimir Litvin’s 7000-node DAG

started 7000 PRE scripts on his machine!

Page 15: Condor DAGMan: Introduction & Update

15http://www.cs.wisc.edu/condor

Node RETRY

› Tells DAGMan to re-run a node multiple times if necessary…

› Example:# diamond.dagJob A a.subJob B b.subRETRY B 5Job C c.subRETRY C 5Job D d.subParent A Child B CParent B C Child D

Job A

Job B Job C

Job D

Page 16: Condor DAGMan: Introduction & Update

16http://www.cs.wisc.edu/condor

DAGMan Progress

› Testing… lots of testing. 10,000+ node DAGs run smoothly Developed automated DAG testing

tools to generate random DAGs and test for correct execution (Ning Lin & Will McDonald)

Lots of bugs fixed

Page 17: Condor DAGMan: Introduction & Update

17http://www.cs.wisc.edu/condor

DAGMan Progress (cont’d)

› New features Improved logging (timestamps, etc.) More efficient recovery Node RETRY capability DAG info in condor_q (with –dag flag) Robust in more failure cases Recursive DAGs for conditional execution

› DAGMan for Windows (Ray Pingree)

Page 18: Condor DAGMan: Introduction & Update

18http://www.cs.wisc.edu/condor

DAGMan Success

› DAGMan is becoming part of the common framework for running on the grid. Particle Physics Data Grid (PPDG) Grid Physics Network (GriPhyN) Many Super Computing 2001 demos more…

Page 19: Condor DAGMan: Introduction & Update

19http://www.cs.wisc.edu/condor

DAGMan in the GriPhyN ArchitectureApplication

Planner

Executor

Catalog Services

Info Services

Policy/Security

Monitoring

Repl. Mgmt.

Reliable TransferService

Compute Resource Storage Resource

DAG

DAG

DAGMAN, Kangaroo

GRAM GridFTP; GRAM; SRM

GSI, CAS

MDS

MCAT; GriPhyN catalogs

GDMP

MDS

Globus

diagram by Ian Foster (Argonne)

Page 20: Condor DAGMan: Introduction & Update

DAGMan in PPDG Tools

diagram by Jim Amundson (Fermilab)

Page 21: Condor DAGMan: Introduction & Update

21http://www.cs.wisc.edu/condor

What’s Next?

› More flexible control of node execution Currently implicit: “all my parents returned

0”. Why not, “all parents returned 0 AND ran for

more than two hours” or “parent A returned 0 and parent B returned 42”?

› 1st step: represent DAG nodes internally as ClassAds Allows DAGMan to decide when to run

nodes based on arbitrary requirements

Page 22: Condor DAGMan: Introduction & Update

22http://www.cs.wisc.edu/condor

What’s Next? (cont’d)

› Extend DAGMan to utilize DaP Scheduler (DaP?) to intelligently schedule data transfers along with Condor and Condor-G jobs.

DAGMan Condor-G

Condor

DaP Scheduler

Page 23: Condor DAGMan: Introduction & Update

23http://www.cs.wisc.edu/condor

Thank You!

› Interested in seeing more? Come to the DAGMan BoF

• Wednesday 9am - noon• Room 3393, Computer Sciences (1210 W. Dayton

St.)

Email us:• [email protected]

Try it!• http://www.cs.wisc.edu/condor


Top Related