rochester institute of technology 1 job submission andrew pangborn & myles maxfield...
DESCRIPTION
3 The Problem At one end are computing resources (the grid fabric) managed by batch queuing systems and middleware At the other end are end-users and their jobs/applications Need software and protocols for submitting jobs to the computing resources Also want to be able to monitor jobs after submission and efficiently schedule them to achieve high-throughput 01/19/09Service Oriented Cyberinfrastructure Lab,TRANSCRIPT
![Page 1: Rochester Institute of Technology 1 Job Submission Andrew Pangborn & Myles Maxfield 01/19/09Service Oriented Cyberinfrastructure Lab,](https://reader036.vdocuments.site/reader036/viewer/2022062504/5a4d1b1a7f8b9ab059992fa7/html5/thumbnails/1.jpg)
Rochester Institute of Technology1
Job Submission
Andrew Pangborn & Myles Maxfield
01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 1
![Page 2: Rochester Institute of Technology 1 Job Submission Andrew Pangborn & Myles Maxfield 01/19/09Service Oriented Cyberinfrastructure Lab,](https://reader036.vdocuments.site/reader036/viewer/2022062504/5a4d1b1a7f8b9ab059992fa7/html5/thumbnails/2.jpg)
2
The Grid• Virtual organizations spanning multiple
administrative domains– Different organizations and administrators– Different hardware– Different queuing systems
• How do we make sense of it all?
01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 2
![Page 3: Rochester Institute of Technology 1 Job Submission Andrew Pangborn & Myles Maxfield 01/19/09Service Oriented Cyberinfrastructure Lab,](https://reader036.vdocuments.site/reader036/viewer/2022062504/5a4d1b1a7f8b9ab059992fa7/html5/thumbnails/3.jpg)
3
The Problem• At one end are computing resources (the grid fabric)
managed by batch queuing systems and middleware
• At the other end are end-users and their jobs/applications
• Need software and protocols for submitting jobs to the computing resources
• Also want to be able to monitor jobs after submission and efficiently schedule them to achieve high-throughput
01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 3
![Page 4: Rochester Institute of Technology 1 Job Submission Andrew Pangborn & Myles Maxfield 01/19/09Service Oriented Cyberinfrastructure Lab,](https://reader036.vdocuments.site/reader036/viewer/2022062504/5a4d1b1a7f8b9ab059992fa7/html5/thumbnails/4.jpg)
4
Grid Architecture
01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 4
Image from Ian Foster paper (The Anatomy of the Grid)
Job Submission
![Page 5: Rochester Institute of Technology 1 Job Submission Andrew Pangborn & Myles Maxfield 01/19/09Service Oriented Cyberinfrastructure Lab,](https://reader036.vdocuments.site/reader036/viewer/2022062504/5a4d1b1a7f8b9ab059992fa7/html5/thumbnails/5.jpg)
5
Batch Queuing Systems• Submitting a job directly to the batch queuing
system• One or more queues
– Priorities• Two common architectures
– Client/server– Dynamic offloading
• User credential (delegation)• Jobs have states (e.g. Pending, Running)
01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 5
![Page 6: Rochester Institute of Technology 1 Job Submission Andrew Pangborn & Myles Maxfield 01/19/09Service Oriented Cyberinfrastructure Lab,](https://reader036.vdocuments.site/reader036/viewer/2022062504/5a4d1b1a7f8b9ab059992fa7/html5/thumbnails/6.jpg)
6
Batch Queuing Systems• Important examples:
– Portable Batch System– TORQUE– Xgrid– Sun Grid Engine– Load Sharing Facility– Condor
01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 6
![Page 7: Rochester Institute of Technology 1 Job Submission Andrew Pangborn & Myles Maxfield 01/19/09Service Oriented Cyberinfrastructure Lab,](https://reader036.vdocuments.site/reader036/viewer/2022062504/5a4d1b1a7f8b9ab059992fa7/html5/thumbnails/7.jpg)
7
Portable Batch System (PBS)• Originally developed for NASA• Client/server architecture• Server: pbs_server• Client: pbs_mom• Works with MPI with built-in shell script
variables
01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 7
![Page 8: Rochester Institute of Technology 1 Job Submission Andrew Pangborn & Myles Maxfield 01/19/09Service Oriented Cyberinfrastructure Lab,](https://reader036.vdocuments.site/reader036/viewer/2022062504/5a4d1b1a7f8b9ab059992fa7/html5/thumbnails/8.jpg)
8
PBS Examplelitherum@gras:~$ cat test.sh#!/bin/sh#testpbsecho This is a testecho today is `date`echo This is `hostname`echo The current working directory is `pwd`ls -alF /homeuptime
01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 8
![Page 9: Rochester Institute of Technology 1 Job Submission Andrew Pangborn & Myles Maxfield 01/19/09Service Oriented Cyberinfrastructure Lab,](https://reader036.vdocuments.site/reader036/viewer/2022062504/5a4d1b1a7f8b9ab059992fa7/html5/thumbnails/9.jpg)
9
PBS Examplelitherum@gras:~$ qsub test.sh6.gras.carrion.rit.edulitherum@gras:~$ qstatJob id Name User Time Use S Queue------------------------- ---------------- --------------- -------- - -----6.gras test.sh litherum 00:00:00 C batch litherum@gras:~$ cat test.sh.o6This is a testtoday is Sat Jan 17 18:20:20 EST 2009This is carrion02The current working directory is /home/litherumtotal 20drwxr-xr-x 31 litherum litherum 4096 Jan 17 18:19 litherum/ 18:20:20 up 131 days, 21:20, 0 users, load average: 0.00, 0.00,
0.0001/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 9
![Page 10: Rochester Institute of Technology 1 Job Submission Andrew Pangborn & Myles Maxfield 01/19/09Service Oriented Cyberinfrastructure Lab,](https://reader036.vdocuments.site/reader036/viewer/2022062504/5a4d1b1a7f8b9ab059992fa7/html5/thumbnails/10.jpg)
10
Torque• Built on top of PBS• Supports reservations, where you can
reserve specific resources for specific times.• Supports partitions, where you can partition a
cluster into smaller sub-clusters.
01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 10
![Page 11: Rochester Institute of Technology 1 Job Submission Andrew Pangborn & Myles Maxfield 01/19/09Service Oriented Cyberinfrastructure Lab,](https://reader036.vdocuments.site/reader036/viewer/2022062504/5a4d1b1a7f8b9ab059992fa7/html5/thumbnails/11.jpg)
11
Torquelitherum@gras:~$ showqACTIVE JOBS--------------------JOBNAME USERNAME STATE PROC REMAINING
STARTTIME 0 Active Jobs 0 of 4 Processors Active (0.00%) 0 of 2 Nodes Active (0.00%)IDLE JOBS----------------------JOBNAME USERNAME STATE PROC WCLIMIT
QUEUETIME0 Idle JobsBLOCKED JOBS----------------JOBNAME USERNAME STATE PROC WCLIMIT
QUEUETIMETotal Jobs: 0 Active Jobs: 0 Idle Jobs: 0 Blocked Jobs: 0
01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 11
![Page 12: Rochester Institute of Technology 1 Job Submission Andrew Pangborn & Myles Maxfield 01/19/09Service Oriented Cyberinfrastructure Lab,](https://reader036.vdocuments.site/reader036/viewer/2022062504/5a4d1b1a7f8b9ab059992fa7/html5/thumbnails/12.jpg)
12
Xgrid• Apple• Essentially the same as
Condor• GUI! =)• Client/server model
http://upload.wikimedia.org/wikipedia/en/6/62/XgridAdminTool.jpg01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 12
![Page 13: Rochester Institute of Technology 1 Job Submission Andrew Pangborn & Myles Maxfield 01/19/09Service Oriented Cyberinfrastructure Lab,](https://reader036.vdocuments.site/reader036/viewer/2022062504/5a4d1b1a7f8b9ab059992fa7/html5/thumbnails/13.jpg)
13
Sun Grid Engine• Open source, like everything new Sun puts
out• Supports
– Reservations– Job dependencies,– Checkpointing– Multiple scheduling algorithms– Web interface
• Professional!
01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 13
![Page 14: Rochester Institute of Technology 1 Job Submission Andrew Pangborn & Myles Maxfield 01/19/09Service Oriented Cyberinfrastructure Lab,](https://reader036.vdocuments.site/reader036/viewer/2022062504/5a4d1b1a7f8b9ab059992fa7/html5/thumbnails/14.jpg)
14
Middleware• These queuing systems are hard to use• There may be many systems employed in a
given grid• Wouldn’t it be nice if all this were unified in a
single implementation?• Middleware that handles job submission in a
virtual organization across resources spread throughout multiple administration domains would be useful!
01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 14
![Page 15: Rochester Institute of Technology 1 Job Submission Andrew Pangborn & Myles Maxfield 01/19/09Service Oriented Cyberinfrastructure Lab,](https://reader036.vdocuments.site/reader036/viewer/2022062504/5a4d1b1a7f8b9ab059992fa7/html5/thumbnails/15.jpg)
15
• A tool for pooling and “scavenging” computing resources and distributing jobs
• Similar to a batch queuing system [2]– job management– scheduling policy– priority scheme– resource monitoring– resource management.
• Also focuses on high-throughput and “opportunistic computing” [2]– Utilize computing resources whenever they are available
01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 15
Condor image from: http://www.cs.wisc.edu/condor/
![Page 16: Rochester Institute of Technology 1 Job Submission Andrew Pangborn & Myles Maxfield 01/19/09Service Oriented Cyberinfrastructure Lab,](https://reader036.vdocuments.site/reader036/viewer/2022062504/5a4d1b1a7f8b9ab059992fa7/html5/thumbnails/16.jpg)
16
Condor Universes [1]• Standard
– Check pointing, fault tolerance– Link job against condor libraries
• Vanilla– Simpler, can run universal binaries (do not need to be
“condor compiled”)– No support for partial execution or job relocation
• Others– PVM– MPI– Java
01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 16
![Page 17: Rochester Institute of Technology 1 Job Submission Andrew Pangborn & Myles Maxfield 01/19/09Service Oriented Cyberinfrastructure Lab,](https://reader036.vdocuments.site/reader036/viewer/2022062504/5a4d1b1a7f8b9ab059992fa7/html5/thumbnails/17.jpg)
17
Condor Submission File Example [1]#hello.sub#condor job file exampleUniverse = VanillaExecutable = helloOutput = hello.outInput = hello.inError = hello.errLog = hello.logQueue
01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 17
![Page 18: Rochester Institute of Technology 1 Job Submission Andrew Pangborn & Myles Maxfield 01/19/09Service Oriented Cyberinfrastructure Lab,](https://reader036.vdocuments.site/reader036/viewer/2022062504/5a4d1b1a7f8b9ab059992fa7/html5/thumbnails/18.jpg)
18
Some Condor Commands [5]• condor_submit <job_file.sub>
– Submit a condor job• condor_q
– View condor job queue• condor_status
– Check status of jobs in queue• condor_compile
– Re-links jobs for use in standard universe
01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 18
![Page 19: Rochester Institute of Technology 1 Job Submission Andrew Pangborn & Myles Maxfield 01/19/09Service Oriented Cyberinfrastructure Lab,](https://reader036.vdocuments.site/reader036/viewer/2022062504/5a4d1b1a7f8b9ab059992fa7/html5/thumbnails/19.jpg)
19
Condor job structures
Master-Worker• Single master process
coordinates all the independent tasks
• Collects results as workers finish, distributes new jobs to workers
DAG (Directed Acyclic Graph)
01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 19
Programming models for larger scale jobs using condor agent
![Page 20: Rochester Institute of Technology 1 Job Submission Andrew Pangborn & Myles Maxfield 01/19/09Service Oriented Cyberinfrastructure Lab,](https://reader036.vdocuments.site/reader036/viewer/2022062504/5a4d1b1a7f8b9ab059992fa7/html5/thumbnails/20.jpg)
20
GRAM [4]• Globus Resource Allocation Manager (GRAM)
– Resource allocation – Process creation – Monitoring– Management – Maps requests expressed in a Resource Specification Language
(RSL) into commands to local schedulers and computers.
01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 20
![Page 21: Rochester Institute of Technology 1 Job Submission Andrew Pangborn & Myles Maxfield 01/19/09Service Oriented Cyberinfrastructure Lab,](https://reader036.vdocuments.site/reader036/viewer/2022062504/5a4d1b1a7f8b9ab059992fa7/html5/thumbnails/21.jpg)
21
GRAM• Pluggable!• Can’t make up their mind how to describe jobs• Will submit jobs to:
– Condor– LSF– PBS/Torque– ???
• Unified interface, identifier for which cluster/service to use
• Job submission file01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 21
![Page 22: Rochester Institute of Technology 1 Job Submission Andrew Pangborn & Myles Maxfield 01/19/09Service Oriented Cyberinfrastructure Lab,](https://reader036.vdocuments.site/reader036/viewer/2022062504/5a4d1b1a7f8b9ab059992fa7/html5/thumbnails/22.jpg)
22
GRAM Examplemaxfield@tg-login1:~> globusrun-ws -submit -factory https://tg-
login.ornl.teragrid.org:8444/wsrf/services/ManagedJobFactoryService -factory-type PBS -streaming -job-
command /bin/hostnameDelegating user credentials...Done.Submitting job...Done.Job ID: uuid:89538014-e4f2-11dd-81df-0010180bb4e6Termination time: 01/18/2009 23:57 GMTCurrent job state: PendingCurrent job state: Activetg-c15Current job state: CleanUp-HoldCurrent job state: CleanUpCurrent job state: DoneDestroying job...Done.Cleaning up any delegated credentials...Done.01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 22
![Page 23: Rochester Institute of Technology 1 Job Submission Andrew Pangborn & Myles Maxfield 01/19/09Service Oriented Cyberinfrastructure Lab,](https://reader036.vdocuments.site/reader036/viewer/2022062504/5a4d1b1a7f8b9ab059992fa7/html5/thumbnails/23.jpg)
23
GRAM Input Example<job><executable>/bin/echo</executable><argument>this is an example string </argument><argument>Globus was here</argument><stdout>${GLOBUS_USER_HOME}/stdout</stdout><stderr>${GLOBUS_USER_HOME}/stderr</stderr></job>
http://www.globus.org/toolkit/docs/4.2/4.2.1/execution/gram4/user/#gram4-user-usagescenarios-jdd
01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 23
![Page 24: Rochester Institute of Technology 1 Job Submission Andrew Pangborn & Myles Maxfield 01/19/09Service Oriented Cyberinfrastructure Lab,](https://reader036.vdocuments.site/reader036/viewer/2022062504/5a4d1b1a7f8b9ab059992fa7/html5/thumbnails/24.jpg)
24
Condor-G [4]• Condor-G is a Globus-enabled version of the Condor scheduler.• It uses Globus to handle inter-organizational problems like:
– Security– Resource management for supercomputers,– Executable staging.
• The same Condor tools that access local resources are now able to use the Globus protocols to access resources at multiple sites.
• It communicates with these resources and transfers files to and from these resources using Globus mechanisms, such as:
– GSI for security– GRAM protocol for job submission– GASS for file transfer
• Condor-G can be used to submit jobs to systems managed by Globus.• Globus tools can be used to submit jobs to systems managed by Condor
01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 24
![Page 25: Rochester Institute of Technology 1 Job Submission Andrew Pangborn & Myles Maxfield 01/19/09Service Oriented Cyberinfrastructure Lab,](https://reader036.vdocuments.site/reader036/viewer/2022062504/5a4d1b1a7f8b9ab059992fa7/html5/thumbnails/25.jpg)
25
Condor-G
01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 25
![Page 26: Rochester Institute of Technology 1 Job Submission Andrew Pangborn & Myles Maxfield 01/19/09Service Oriented Cyberinfrastructure Lab,](https://reader036.vdocuments.site/reader036/viewer/2022062504/5a4d1b1a7f8b9ab059992fa7/html5/thumbnails/26.jpg)
26
Using Condor-G• Set condor universe=globus in submit file• Also need to specify the globus scheduler
hostname, for example:globusscheduler = example.org/jobmanager
• Still use globus_submit command• TeraGrid Condor-G example here:
– http://www.teragrid.org/userinfo/jobs/condorg.php
01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 26
![Page 27: Rochester Institute of Technology 1 Job Submission Andrew Pangborn & Myles Maxfield 01/19/09Service Oriented Cyberinfrastructure Lab,](https://reader036.vdocuments.site/reader036/viewer/2022062504/5a4d1b1a7f8b9ab059992fa7/html5/thumbnails/27.jpg)
27
UNICORE• Alternative to Globus• Primarily used in Europe• Uses web services, similar to GT4• GUI• Abstract Job Objects• User -> Server -> Virtual Site• X.509 and SSL
01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 27
![Page 28: Rochester Institute of Technology 1 Job Submission Andrew Pangborn & Myles Maxfield 01/19/09Service Oriented Cyberinfrastructure Lab,](https://reader036.vdocuments.site/reader036/viewer/2022062504/5a4d1b1a7f8b9ab059992fa7/html5/thumbnails/28.jpg)
28
UNICORE GUI
01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 28
![Page 29: Rochester Institute of Technology 1 Job Submission Andrew Pangborn & Myles Maxfield 01/19/09Service Oriented Cyberinfrastructure Lab,](https://reader036.vdocuments.site/reader036/viewer/2022062504/5a4d1b1a7f8b9ab059992fa7/html5/thumbnails/29.jpg)
29
Upperware• Abstract Job Objects? Workflows? What is all
this nonsense?!• Scientist (primary user) doesn’t care about
this stuff• Shouldn’t have to deal with writing XML
description files or creating a complicated workflow
• Simply let them run their program
01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 29
![Page 30: Rochester Institute of Technology 1 Job Submission Andrew Pangborn & Myles Maxfield 01/19/09Service Oriented Cyberinfrastructure Lab,](https://reader036.vdocuments.site/reader036/viewer/2022062504/5a4d1b1a7f8b9ab059992fa7/html5/thumbnails/30.jpg)
30
GridShell
01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 30
•Unified command line interface•Defer to resident experts
![Page 31: Rochester Institute of Technology 1 Job Submission Andrew Pangborn & Myles Maxfield 01/19/09Service Oriented Cyberinfrastructure Lab,](https://reader036.vdocuments.site/reader036/viewer/2022062504/5a4d1b1a7f8b9ab059992fa7/html5/thumbnails/31.jpg)
31
References1. http://www.linuxjournal.com/node/9058/print - Getting started with Condor2. Thain, D., Tannenbaum, T., & Livny, M. (2005). Distributed computing in practice:
the Condor experience.3. http://grid.rit.edu/seminar/lib/exe/fetch.php/users:jeremy_espenshade:condorjobs
ubmission.ppt4. http://iag.iucc.ac.il/presentations/front2.ppt5. http://www.cs.wisc.edu/condor/manual/v7.2/6. http://www.globus.org/toolkit/docs/4.2/4.2.1/execution/gram4/user/#gra
m4-user-usagescenarios-jdd7. http://upload.wikimedia.org/wikipedia/en/6/62/XgridAdminTool.jpg8. Wikipedia9. http://www.isgtw.org/images/Rudolph_expert_client_screenshot2.jpg10.http://upload.wikimedia.org/wikipedia/commons/a/a4/
Double_curvature_steel_lattice_Shell_by_Shukhov_in_Vyksa_1897_shell.jpg
01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 31