camgrid mark calleja cambridge escience centre. what is it? a number of like minded groups and...
TRANSCRIPT
![Page 1: CamGrid Mark Calleja Cambridge eScience Centre. What is it? A number of like minded groups and departments (10), each running their own Condor pool(s),](https://reader035.vdocuments.site/reader035/viewer/2022062417/5515edba55034638038b5188/html5/thumbnails/1.jpg)
CamGrid
Mark Calleja
Cambridge eScience Centre
![Page 2: CamGrid Mark Calleja Cambridge eScience Centre. What is it? A number of like minded groups and departments (10), each running their own Condor pool(s),](https://reader035.vdocuments.site/reader035/viewer/2022062417/5515edba55034638038b5188/html5/thumbnails/2.jpg)
What is it?• A number of like minded groups and departments
(10), each running their own Condor pool(s), which federate their resources (12).
• Coordinated by the Cambridge eScience Centre (CeSC), but no overall control.
• Been running now for ~2.5 years, ~70+ users.• Currently have ~950 processors/cores available.• “All” linux (various), mostly x86_64, running 24/7.• Mostly Dell PowerEdge 1950 (like HPCF), four
cores with 8GB.• Around 2M CPU hours to date.
![Page 3: CamGrid Mark Calleja Cambridge eScience Centre. What is it? A number of like minded groups and departments (10), each running their own Condor pool(s),](https://reader035.vdocuments.site/reader035/viewer/2022062417/5515edba55034638038b5188/html5/thumbnails/3.jpg)
Some details
• Pools run the latest stable version of Condor (currently 6.8.6).
• All machines get an (extra) IP address in a CUDN-only routeable range for Condor.
• Each pool sets its own policies, but these must be visible to other users of CamGrid.
• Currently we see vanilla, standard and parallel (MPI) universe jobs.
• Users get accounts on a machine in their local pool; jobs are then distributed around the grid by Condor using its flocking mechanism.
• MPI jobs on single SMP machines have proved very useful.
![Page 4: CamGrid Mark Calleja Cambridge eScience Centre. What is it? A number of like minded groups and departments (10), each running their own Condor pool(s),](https://reader035.vdocuments.site/reader035/viewer/2022062417/5515edba55034638038b5188/html5/thumbnails/4.jpg)
![Page 5: CamGrid Mark Calleja Cambridge eScience Centre. What is it? A number of like minded groups and departments (10), each running their own Condor pool(s),](https://reader035.vdocuments.site/reader035/viewer/2022062417/5515edba55034638038b5188/html5/thumbnails/5.jpg)
NTE of Ag3[Co(CN)6] with SMP/MPI sweep
![Page 6: CamGrid Mark Calleja Cambridge eScience Centre. What is it? A number of like minded groups and departments (10), each running their own Condor pool(s),](https://reader035.vdocuments.site/reader035/viewer/2022062417/5515edba55034638038b5188/html5/thumbnails/6.jpg)
Monitoring Tools
• A number of web based tools provided to monitor the state of the grid and of jobs.
• CamGrid is based on trust, so must make sure that machines are fairly configured.
• The university gave us £450k (~$950k) to buy new hardware; need to ensure that it’s online as promised.
![Page 7: CamGrid Mark Calleja Cambridge eScience Centre. What is it? A number of like minded groups and departments (10), each running their own Condor pool(s),](https://reader035.vdocuments.site/reader035/viewer/2022062417/5515edba55034638038b5188/html5/thumbnails/7.jpg)
![Page 8: CamGrid Mark Calleja Cambridge eScience Centre. What is it? A number of like minded groups and departments (10), each running their own Condor pool(s),](https://reader035.vdocuments.site/reader035/viewer/2022062417/5515edba55034638038b5188/html5/thumbnails/8.jpg)
![Page 9: CamGrid Mark Calleja Cambridge eScience Centre. What is it? A number of like minded groups and departments (10), each running their own Condor pool(s),](https://reader035.vdocuments.site/reader035/viewer/2022062417/5515edba55034638038b5188/html5/thumbnails/9.jpg)
![Page 10: CamGrid Mark Calleja Cambridge eScience Centre. What is it? A number of like minded groups and departments (10), each running their own Condor pool(s),](https://reader035.vdocuments.site/reader035/viewer/2022062417/5515edba55034638038b5188/html5/thumbnails/10.jpg)
![Page 11: CamGrid Mark Calleja Cambridge eScience Centre. What is it? A number of like minded groups and departments (10), each running their own Condor pool(s),](https://reader035.vdocuments.site/reader035/viewer/2022062417/5515edba55034638038b5188/html5/thumbnails/11.jpg)
CamGrid’s file viewer
• Standard universe uses RPCs to echo I/O operations back to submit host.
• What about other universes? How can I check the health of my long running simulation?
• We’ve provided our own facility, which involves an agent installed on each execute node and accessed via a web interface.
• Works with vanilla and parallel (MPI) jobs.
• Requires local sysadmins to install and run it.
![Page 12: CamGrid Mark Calleja Cambridge eScience Centre. What is it? A number of like minded groups and departments (10), each running their own Condor pool(s),](https://reader035.vdocuments.site/reader035/viewer/2022062417/5515edba55034638038b5188/html5/thumbnails/12.jpg)
CamGrid’s file viewer
![Page 13: CamGrid Mark Calleja Cambridge eScience Centre. What is it? A number of like minded groups and departments (10), each running their own Condor pool(s),](https://reader035.vdocuments.site/reader035/viewer/2022062417/5515edba55034638038b5188/html5/thumbnails/13.jpg)
![Page 14: CamGrid Mark Calleja Cambridge eScience Centre. What is it? A number of like minded groups and departments (10), each running their own Condor pool(s),](https://reader035.vdocuments.site/reader035/viewer/2022062417/5515edba55034638038b5188/html5/thumbnails/14.jpg)
![Page 15: CamGrid Mark Calleja Cambridge eScience Centre. What is it? A number of like minded groups and departments (10), each running their own Condor pool(s),](https://reader035.vdocuments.site/reader035/viewer/2022062417/5515edba55034638038b5188/html5/thumbnails/15.jpg)
Checkpointable vanilla universe
• Standard universe is fine, if you can link to Condor’s libraries (Pete Keller – “getting harder”).
• Investigating using BLCR (Berkeley Lab Checkpoint/Restart) kernel modules for linux.
• Uses kernel resources, and can thus restore resources that user-level libraries cannot.
• Supported by some flavours of MPI (late LAM, OpenMPI).
• The idea was to use Parrot’s user-space FS to wrap a vanilla job and save the job’s state on a chirp server.
• However, currently Parrot breaks some BLCR functionality.
![Page 16: CamGrid Mark Calleja Cambridge eScience Centre. What is it? A number of like minded groups and departments (10), each running their own Condor pool(s),](https://reader035.vdocuments.site/reader035/viewer/2022062417/5515edba55034638038b5188/html5/thumbnails/16.jpg)
What doesn’t work so well…
• Each pool is run by local sysadmin(s), but these are of variable quality/commitment.
• We’ve set up mailing lists for users and sysadmins: hardly ever used (don’t want to advertise ignorance?).
• Some pools have used SRIF hardware to redeploy machines committed earlier. Naughty…
• Don’t get me started on merger with UCS’s central resource (~400 nodes).
![Page 17: CamGrid Mark Calleja Cambridge eScience Centre. What is it? A number of like minded groups and departments (10), each running their own Condor pool(s),](https://reader035.vdocuments.site/reader035/viewer/2022062417/5515edba55034638038b5188/html5/thumbnails/17.jpg)
But generally we’re happy bunnies
• “CamGrid was an invaluable tool allowing us to reliably sample the large parameter space in a reasonable amount of time. A half-year's worth of CPU running was collected in a week."
-- Dr. Ben Allanach
• “CamGrid was essential in order for us to be able to run the different codes in real time.”
-- Prof. Fernando Quevedo
• “I needed to run simulations that took a couple of weeks each. Without access to the processors on CamGrid, it would have taken a couple of years to get enough results for a publication.“
-- Dr. Karen Lipkow
![Page 18: CamGrid Mark Calleja Cambridge eScience Centre. What is it? A number of like minded groups and departments (10), each running their own Condor pool(s),](https://reader035.vdocuments.site/reader035/viewer/2022062417/5515edba55034638038b5188/html5/thumbnails/18.jpg)
Current issues• Protecting resources on execute nodes; Condor
seems lax at this, e.g. memory, disk space. • Increasingly interested in VMs (i.e. Xen). Some
pools run it, but not concerted (effects on SMP MPI jobs?).
• Green issues: will we be forced to buy WoL cards in the near future?
• Altruistic computing: a recent wave of interest for BOINC/backfill jobs for medical, protein folding, etc., but who runs the jobs? Audit trail?
• How do we interact with outsiders? Ideally keep it to Condor (some Globus, toyed with VPNs). Most CamGrid stakeholders just dish out conventional, ssh-accessible accounts.
![Page 19: CamGrid Mark Calleja Cambridge eScience Centre. What is it? A number of like minded groups and departments (10), each running their own Condor pool(s),](https://reader035.vdocuments.site/reader035/viewer/2022062417/5515edba55034638038b5188/html5/thumbnails/19.jpg)
Finally…
• CamGrid:http://www.escience.cam.ac.uk/projects/camgrid/
• Contact:[email protected]
Questions?