grid canada cls escience workshop 21 st november, 2005
TRANSCRIPT
Grid Canada
CLS eScience Workshop21st November, 2005.
2
Grid Canada
• Joint venue with CANARIE/C3.ca/NRC.
• Grid Canada
– Setup Testbed some 3 years ago using Globus.
– CANARIE hosts the Web-Site, managed by Darcy.
– Main aim to increase Grid awareness across Canada
– Run the GC Certificate Authority (over 2500 issued)
– Main project is currently GridX1
3
GridX1 : A Canadian Computational GridA.Agarwal, M.Ahmed, D.Bickle, B.Caron, D.Deatrich, A.Dimopoulos,
L.Groer, R.Haria, R.Impey, L.Klektau, G.Mateescu, C.Lindsay, D.Quesnel, B.St. Arnaud, R.Simmons, R.Sobie, D.Vanderster, M.Vetterli, R.Walker,
M.Yuen
University of AlbertaUniversity of CalgaryUniversity of Toronto University of Victoria
Simon Fraser UniversityNational Research Council Canada
CANARIETRIUMF
4
Motivation
• GridX1 is driven by the scientific need for a Grid– the ATLAS particle physics experiment at CERN– Linked to the Large Hadron Collider (LHC) Grid Project
• Particle physics (HEP) simulations are “embarrassingly parallel”; multiple instances of serial (integer) jobs
• We want to exploit the unused cycles at non-HEP sites– Minimal software demands on sites
• Open to other applications (serial, integer)– Grid-enabling application is as complicated as making the
Grid– BaBar particle physics application (SLAC) under
development
5
GridX1 model
A number of facilities are dedicated to particle physics groups but most are shared with researchers in other fields
Each shared facility may have unique configuration requirements
GridX1 model:Generic Middleware (Virtual Data Toolkit: GT 2.4.3 + fixes)No OS requirement: SuSe and RedHat clusters.Generic user accounts: gcprod01 ... gcprodmnCondor-G Resource Broker for load balancing.
6
Overview
GridX1 currently has 9 clusters:Alberta(2), NRC Ottawa(2), WestGrid, Victoria(3), Toronto(1)
Discussions underway with McGill (HEP) (Just about to add)Total resources >> (1000 CPUs,10 TB disk,400 TB tape)
Maximum number of jobs running on GridX1 has exceeded 250
CondorG grid:Extension of Condor batch system
Scalable to 1000’s of jobs
Intuitive commands for running jobs on remote resources
7
Resource Management
classAds are used for passing site and job specifications
Resources periodically publishes their state to the collector
Free/total CPUs; Num of running and waiting jobs; est queue waiting time.
Job ClassAds contain a resource Requirements expression.
CPU requirements,OS, application software,
8
Job management
Each site specifies the maximum number of grid jobs, maxJobs. (100 at UVictoria)
Job is sent to site with lowest wait time.
Sites are selected on a round-robin basis.
RB submits jobs to a site until number of jobs pending at a site is 10% of maxJobs
9
Monitoring
10
System Status
unsubmitted: waiting on the GridX1 RB (no identified site)pending: sent to a resource but not runningrunning: activewaiting time: estimated time for the next job to run
11
Local resource status
Each site sets its own policy.
Some backfill and others have limits to the number of jobs.
12
Status
GridX1 used by the ATLAS experiment via the LCG-TRIUMF gateway
Over 12,000 ATLAS jobs successfully completed
13
Challenges
• GridX1 is a equivalent to a moderate-sized computing facility– It requires a “grid” system administrator to keep
system operational• We need a more automated way to install applications• Monitoring is in good shape but further improvements
are needed– Improve reliability and scalability
• Error recovery has not been an issue with LCG jobs – We will have to address this with the BaBar
simulation application
14
Data management
• No data grid management required for ATLAS Data Challenge– Data analysis jobs will require access to large input data sets
• Prototype data grid elements in place– Replica catalog– Jobs running on GridX1 query RLI and either copy data from
UVic to grid cache or link to file if it already exists in the grid cache
• Install dCache at UVic with a Storage Resource Manager (SRM)– dCache developed by Fermilab (Chicago) and DESY (Hamburg)– SRM’s are used to interface storage facilities on the Grid– Interface GridX1 storage to LCG via GridX1-SRM
15
Plans
Short term plans– Improving the reliability and scalability of the monitoring– Getting all sites operational (e.g. NRC Venus)– Getting the BaBar application running on more GridX1
sites
Long term plans– Add data grid capability– High-speed network links between sites– Explore virtual computing concept (e.g. Xen)– Web services based monitoring – Investigate grid resource broker algorithms (PhD thesis)
16
Summary
• GridX1 working very well
• Over 12,000 ATLAS jobs in past 6 months (5000 in March)
• In Feb/Mar GridX1 was running 7% to 10% of all LHC jobs world-wide
• BaBar application running on subset of GridX1– Typically 200 BaBar jobs run on the UVic clusters and Westgrid
• Talks at international conferences and press (national and international)
• We want to add more sites – Other applications could be run on the Grid (looking at two more)– Requirements and Instructions are available at www.gridx1.ca