htcondor architecture htcondor week 2020 · 2020. 5. 19. · take three years to complete my...
TRANSCRIPT
![Page 1: HTCondor Architecture HTCondor Week 2020 · 2020. 5. 19. · take three years to complete my analysis, and I want to submit a paper in three ... computers for three days to make a](https://reader036.vdocuments.site/reader036/viewer/2022062612/61446d32b5d1170afb43ddb6/html5/thumbnails/1.jpg)
HTCondor ArchitectureHTCondor Week 2020
Todd TannenbaumCenter for High Throughput Computing
![Page 2: HTCondor Architecture HTCondor Week 2020 · 2020. 5. 19. · take three years to complete my analysis, and I want to submit a paper in three ... computers for three days to make a](https://reader036.vdocuments.site/reader036/viewer/2022062612/61446d32b5d1170afb43ddb6/html5/thumbnails/2.jpg)
Start with People
![Page 3: HTCondor Architecture HTCondor Week 2020 · 2020. 5. 19. · take three years to complete my analysis, and I want to submit a paper in three ... computers for three days to make a](https://reader036.vdocuments.site/reader036/viewer/2022062612/61446d32b5d1170afb43ddb6/html5/thumbnails/3.jpg)
People have Problems
![Page 4: HTCondor Architecture HTCondor Week 2020 · 2020. 5. 19. · take three years to complete my analysis, and I want to submit a paper in three ... computers for three days to make a](https://reader036.vdocuments.site/reader036/viewer/2022062612/61446d32b5d1170afb43ddb6/html5/thumbnails/4.jpg)
“My laptop will take three years to complete my analysis, and I want to submit a paper in three weeks”
“1,000x more compute, could revolutionizemy field”
“Some of my jobs need a lot of memory, others a lot of cores”
![Page 5: HTCondor Architecture HTCondor Week 2020 · 2020. 5. 19. · take three years to complete my analysis, and I want to submit a paper in three ... computers for three days to make a](https://reader036.vdocuments.site/reader036/viewer/2022062612/61446d32b5d1170afb43ddb6/html5/thumbnails/5.jpg)
“We pay a lot of money for research computing. I want these computers always busy, helping research”
“If Physics invests twice what Chemistry does in computers, they should get 2x the computing”
“If an important group needs all the computers for three days to make a paper deadline, I’m ok with that”
![Page 6: HTCondor Architecture HTCondor Week 2020 · 2020. 5. 19. · take three years to complete my analysis, and I want to submit a paper in three ... computers for three days to make a](https://reader036.vdocuments.site/reader036/viewer/2022062612/61446d32b5d1170afb43ddb6/html5/thumbnails/6.jpg)
Constraints
Constraints
HTCondorManages
Theseconstraints
![Page 7: HTCondor Architecture HTCondor Week 2020 · 2020. 5. 19. · take three years to complete my analysis, and I want to submit a paper in three ... computers for three days to make a](https://reader036.vdocuments.site/reader036/viewer/2022062612/61446d32b5d1170afb43ddb6/html5/thumbnails/7.jpg)
Not even that easyIn the real world, many users,
Many resource providers
![Page 8: HTCondor Architecture HTCondor Week 2020 · 2020. 5. 19. · take three years to complete my analysis, and I want to submit a paper in three ... computers for three days to make a](https://reader036.vdocuments.site/reader036/viewer/2022062612/61446d32b5d1170afb43ddb6/html5/thumbnails/8.jpg)
Distributed because of *people*Not because of machines.Our goal is to satisfy all these constraints.
This is a distributed problem.
![Page 9: HTCondor Architecture HTCondor Week 2020 · 2020. 5. 19. · take three years to complete my analysis, and I want to submit a paper in three ... computers for three days to make a](https://reader036.vdocuments.site/reader036/viewer/2022062612/61446d32b5d1170afb43ddb6/html5/thumbnails/9.jpg)
To reliably run as much work as possible
on as many machines as possible
Subject to all constraints
The Philosophy on 1 slide
![Page 10: HTCondor Architecture HTCondor Week 2020 · 2020. 5. 19. · take three years to complete my analysis, and I want to submit a paper in three ... computers for three days to make a](https://reader036.vdocuments.site/reader036/viewer/2022062612/61446d32b5d1170afb43ddb6/html5/thumbnails/10.jpg)
To maximize machine utilization*subject to constraints*
High Throughput is also High Utilization Computing!
The other side: administrator’s
![Page 11: HTCondor Architecture HTCondor Week 2020 · 2020. 5. 19. · take three years to complete my analysis, and I want to submit a paper in three ... computers for three days to make a](https://reader036.vdocuments.site/reader036/viewer/2022062612/61446d32b5d1170afb43ddb6/html5/thumbnails/11.jpg)
computing
![Page 12: HTCondor Architecture HTCondor Week 2020 · 2020. 5. 19. · take three years to complete my analysis, and I want to submit a paper in three ... computers for three days to make a](https://reader036.vdocuments.site/reader036/viewer/2022062612/61446d32b5d1170afb43ddb6/html5/thumbnails/12.jpg)
“Work” can be broken up into smaller jobsSmaller the better (up to a point)files as ipcany interdependencies via DAGsOptimize time-to-finish
not time-to-run
*
The Unstated Assumption
![Page 13: HTCondor Architecture HTCondor Week 2020 · 2020. 5. 19. · take three years to complete my analysis, and I want to submit a paper in three ... computers for three days to make a](https://reader036.vdocuments.site/reader036/viewer/2022062612/61446d32b5d1170afb43ddb6/html5/thumbnails/13.jpg)
Overview of condor:3 sides
SubmitExecute
Central Manager
![Page 14: HTCondor Architecture HTCondor Week 2020 · 2020. 5. 19. · take three years to complete my analysis, and I want to submit a paper in three ... computers for three days to make a](https://reader036.vdocuments.site/reader036/viewer/2022062612/61446d32b5d1170afb43ddb6/html5/thumbnails/14.jpg)
We are going to fill in the boxes!
1414
Execute MachineSubmit Machine
Central Manager
![Page 15: HTCondor Architecture HTCondor Week 2020 · 2020. 5. 19. · take three years to complete my analysis, and I want to submit a paper in three ... computers for three days to make a](https://reader036.vdocuments.site/reader036/viewer/2022062612/61446d32b5d1170afb43ddb6/html5/thumbnails/15.jpg)
ClassAds: The lingua franca of HTCondor
15
![Page 16: HTCondor Architecture HTCondor Week 2020 · 2020. 5. 19. · take three years to complete my analysis, and I want to submit a paper in three ... computers for three days to make a](https://reader036.vdocuments.site/reader036/viewer/2022062612/61446d32b5d1170afb43ddb6/html5/thumbnails/16.jpg)
ClassAds is a language for objects (jobs and machines) toExpress attributes about themselvesExpress what they require/desire in a “match”
(similar to personal classified ads)Structure : Set of attribute name/value pairs, where the value can be a literal or an expression. Semi-structured, no fixed schema.
What are ClassAds?
16
![Page 17: HTCondor Architecture HTCondor Week 2020 · 2020. 5. 19. · take three years to complete my analysis, and I want to submit a paper in three ... computers for three days to make a](https://reader036.vdocuments.site/reader036/viewer/2022062612/61446d32b5d1170afb43ddb6/html5/thumbnails/17.jpg)
› LiteralsStrings ( “RedHat6” ), integers, floats, boolean
(true/false), …› ExpressionsSimilar look to C/C++ or Java : operators, references,
functionsReferences: to other attributes in the same ad, or
attributes in an ad that is a candidate for a matchOperators: +, -, *, /, <, <=,>, >=, ==, !=, &&, and || all
work as expectedBuilt-in Functions: if/then/else, string manipulation,
regular expression pattern matching, list operations, dates, randomization, math (ceil, floor, quantize,…), time functions, eval, …
ClassAd Values
1717
![Page 18: HTCondor Architecture HTCondor Week 2020 · 2020. 5. 19. · take three years to complete my analysis, and I want to submit a paper in three ... computers for three days to make a](https://reader036.vdocuments.site/reader036/viewer/2022062612/61446d32b5d1170afb43ddb6/html5/thumbnails/18.jpg)
18
Job AdType = "Job"Requirements =
HasMatlabLicense== True &&
Memory >= 1024Rank = kflops + 1000000 * Memory
Cmd= "/bin/sleep"Args = "3600"Owner = "gthain"NumJobStarts = 8KindOfJob = "simulation"Department = "Math"
Machine AdType = "Machine"Cpus = 40Memory = 2048Requirements =(Owner == “gthain”) ||(KindOfJob == “simulation”)
Rank = Department == "Math"HasMatlabLicense = trueMaxTries = 4kflops = 41403
Simple Example
![Page 19: HTCondor Architecture HTCondor Week 2020 · 2020. 5. 19. · take three years to complete my analysis, and I want to submit a paper in three ... computers for three days to make a](https://reader036.vdocuments.site/reader036/viewer/2022062612/61446d32b5d1170afb43ddb6/html5/thumbnails/19.jpg)
› Two ClassAds can be matched via special attributes: Requirements and Rank
› Two ads match if both their Requirements expressions evaluate to True
› Rank evaluates to a float where higher is preferred; specifies the which match is desired if several ads meet the Requirements.
› Scoping of attribute references when matching• MY.name – Value for attribute “name” in local ClassAd• TARGET.name – Value for attribute “name” in match candidate
ClassAd• Name – Looks for “name” in the local ClassAd, then the
candidate ClassAd
The Magic of Matchmaking
20
![Page 20: HTCondor Architecture HTCondor Week 2020 · 2020. 5. 19. · take three years to complete my analysis, and I want to submit a paper in three ... computers for three days to make a](https://reader036.vdocuments.site/reader036/viewer/2022062612/61446d32b5d1170afb43ddb6/html5/thumbnails/20.jpg)
› HTCondor has many types of ClassAdsA "Job Ad" represents a job to CondorA "Machine Ad" represents a computing
resource Others types of ads represent other instances of
other services (daemons), users, accounting records.
ClassAd Types
21
![Page 21: HTCondor Architecture HTCondor Week 2020 · 2020. 5. 19. · take three years to complete my analysis, and I want to submit a paper in three ... computers for three days to make a](https://reader036.vdocuments.site/reader036/viewer/2022062612/61446d32b5d1170afb43ddb6/html5/thumbnails/21.jpg)
Architecture & Job Startup
![Page 22: HTCondor Architecture HTCondor Week 2020 · 2020. 5. 19. · take three years to complete my analysis, and I want to submit a paper in three ... computers for three days to make a](https://reader036.vdocuments.site/reader036/viewer/2022062612/61446d32b5d1170afb43ddb6/html5/thumbnails/22.jpg)
condor_master: runs on all machine, alwaysplus a condor_procd, condor_shared_port
condor_schedd: runs on submit machine
condor_startd: runs on execute machine
condor_negotiator, condor_collector: runs on central manager
Quick Review of Daemons
23
![Page 23: HTCondor Architecture HTCondor Week 2020 · 2020. 5. 19. · take three years to complete my analysis, and I want to submit a paper in three ... computers for three days to make a](https://reader036.vdocuments.site/reader036/viewer/2022062612/61446d32b5d1170afb43ddb6/html5/thumbnails/23.jpg)
Submit Machine Process View
24
condor_master(pid: 1740)
condor_schedd
condor_shadow condor_shadow condor_shadow
fork/exec
fork/exec
condor_procd
Tools: condor_submit, condor_q,condor_rm, condor_hold, …
condor_shared_port
![Page 24: HTCondor Architecture HTCondor Week 2020 · 2020. 5. 19. · take three years to complete my analysis, and I want to submit a paper in three ... computers for three days to make a](https://reader036.vdocuments.site/reader036/viewer/2022062612/61446d32b5d1170afb43ddb6/html5/thumbnails/24.jpg)
Execute Machine Process View
25
condor_master(pid: 1740)
condor_startd
condor_starter condor_starter condor_starter
fork/exec
Job Job Job
condor_procd
condor_shared_port
![Page 25: HTCondor Architecture HTCondor Week 2020 · 2020. 5. 19. · take three years to complete my analysis, and I want to submit a paper in three ... computers for three days to make a](https://reader036.vdocuments.site/reader036/viewer/2022062612/61446d32b5d1170afb43ddb6/html5/thumbnails/25.jpg)
Central Manager Process View
26
condor_master(pid: 1740)
condor_collector
fork/exec
condor_negotiator
condor_procd
condor_shared_port
![Page 26: HTCondor Architecture HTCondor Week 2020 · 2020. 5. 19. · take three years to complete my analysis, and I want to submit a paper in three ... computers for three days to make a](https://reader036.vdocuments.site/reader036/viewer/2022062612/61446d32b5d1170afb43ddb6/html5/thumbnails/26.jpg)
27
Claiming Protocol
27
Execute MachineSubmit Machine
Submit
Schedd Startd
Central Manager
CollectorNegotiator
Q
J
S
Q
S
J
J S
J J SSCLAIM
![Page 27: HTCondor Architecture HTCondor Week 2020 · 2020. 5. 19. · take three years to complete my analysis, and I want to submit a paper in three ... computers for three days to make a](https://reader036.vdocuments.site/reader036/viewer/2022062612/61446d32b5d1170afb43ddb6/html5/thumbnails/27.jpg)
28
Claim Activation
28
Execute MachineSubmit Machine
Schedd Startd
Central Manager
CollectorNegotiator
CLAIMED
Job
Shadow
ActivateClaim
Starter
![Page 28: HTCondor Architecture HTCondor Week 2020 · 2020. 5. 19. · take three years to complete my analysis, and I want to submit a paper in three ... computers for three days to make a](https://reader036.vdocuments.site/reader036/viewer/2022062612/61446d32b5d1170afb43ddb6/html5/thumbnails/28.jpg)
29
Repeat until Claim released
29
Execute MachineSubmit Machine
Schedd Startd
Central Manager
CollectorNegotiator
CLAIMED
Job
Shadow
ActivateClaim
Starter
![Page 29: HTCondor Architecture HTCondor Week 2020 · 2020. 5. 19. · take three years to complete my analysis, and I want to submit a paper in three ... computers for three days to make a](https://reader036.vdocuments.site/reader036/viewer/2022062612/61446d32b5d1170afb43ddb6/html5/thumbnails/29.jpg)
30
Repeat until Claim released
30
Execute MachineSubmit Machine
Schedd Startd
Central Manager
CollectorNegotiator
CLAIMED
Job
Shadow
ActivateClaim
Starter
![Page 30: HTCondor Architecture HTCondor Week 2020 · 2020. 5. 19. · take three years to complete my analysis, and I want to submit a paper in three ... computers for three days to make a](https://reader036.vdocuments.site/reader036/viewer/2022062612/61446d32b5d1170afb43ddb6/html5/thumbnails/30.jpg)
› When relinquished by one of the followinglease on the claim is not renewed
• Why? Machine powered off, disappeared, etcschedd
• Why? Out of jobs, shutting down, schedd didn’t “like” the machine, etc
startd• Why? Policy re CLAIM_WORKLIFE, prefers a different
match (via Rank), non-dedicated desktop, etcnegotiator
• Why? User priority inversion policyexplicitly via a command-line tool
• E.g. condor_vacate
When is claim released?
31
![Page 31: HTCondor Architecture HTCondor Week 2020 · 2020. 5. 19. · take three years to complete my analysis, and I want to submit a paper in three ... computers for three days to make a](https://reader036.vdocuments.site/reader036/viewer/2022062612/61446d32b5d1170afb43ddb6/html5/thumbnails/31.jpg)
› Machines (startds) or submitters (schedds) can dynamically appear and disappearKey for expanding a pool into clouds or gridsKey for backfilling HPC resources
› Scheduling policy can be very flexible (custom attributes) and very distributed
› Central manager just makes a match, then gets out of the way
› Distributed policy enables federation of resources across different organizations (administrative domains)Lots of network arrows on previous slidesReflects the P2P nature of HTCondor
Architecture items to note
32
![Page 32: HTCondor Architecture HTCondor Week 2020 · 2020. 5. 19. · take three years to complete my analysis, and I want to submit a paper in three ... computers for three days to make a](https://reader036.vdocuments.site/reader036/viewer/2022062612/61446d32b5d1170afb43ddb6/html5/thumbnails/32.jpg)
Submit-Onlymasterschedd
33
Layout of a General Condor PoolCentral Manager
master
collector
negotiator
= ClassAdCommunicationPathway
= Process Spawned
Submit-Onlymasterschedd
Execute-Onlymaster
startd
Both!
scheddstartd
master
Execute-Onlymaster
startd
![Page 33: HTCondor Architecture HTCondor Week 2020 · 2020. 5. 19. · take three years to complete my analysis, and I want to submit a paper in three ... computers for three days to make a](https://reader036.vdocuments.site/reader036/viewer/2022062612/61446d32b5d1170afb43ddb6/html5/thumbnails/33.jpg)
Thank You!