grid canada cls escience workshop 21 st november, 2005

Grid Canada

CLS eScience Workshop21st November, 2005.

2

Grid Canada

• Joint venue with CANARIE/C3.ca/NRC.

• Grid Canada

– Setup Testbed some 3 years ago using Globus.

– CANARIE hosts the Web-Site, managed by Darcy.

– Main aim to increase Grid awareness across Canada

– Run the GC Certificate Authority (over 2500 issued)

– Main project is currently GridX1

3

GridX1 : A Canadian Computational GridA.Agarwal, M.Ahmed, D.Bickle, B.Caron, D.Deatrich, A.Dimopoulos,

L.Groer, R.Haria, R.Impey, L.Klektau, G.Mateescu, C.Lindsay, D.Quesnel, B.St. Arnaud, R.Simmons, R.Sobie, D.Vanderster, M.Vetterli, R.Walker,

M.Yuen

University of AlbertaUniversity of CalgaryUniversity of Toronto University of Victoria

Simon Fraser UniversityNational Research Council Canada

CANARIETRIUMF

4

Motivation

• GridX1 is driven by the scientific need for a Grid– the ATLAS particle physics experiment at CERN– Linked to the Large Hadron Collider (LHC) Grid Project

• Particle physics (HEP) simulations are “embarrassingly parallel”; multiple instances of serial (integer) jobs

• We want to exploit the unused cycles at non-HEP sites– Minimal software demands on sites

• Open to other applications (serial, integer)– Grid-enabling application is as complicated as making the

Grid– BaBar particle physics application (SLAC) under

development

5

GridX1 model

A number of facilities are dedicated to particle physics groups but most are shared with researchers in other fields

Each shared facility may have unique configuration requirements

GridX1 model:Generic Middleware (Virtual Data Toolkit: GT 2.4.3 + fixes)No OS requirement: SuSe and RedHat clusters.Generic user accounts: gcprod01 ... gcprodmnCondor-G Resource Broker for load balancing.

6

Overview

GridX1 currently has 9 clusters:Alberta(2), NRC Ottawa(2), WestGrid, Victoria(3), Toronto(1)

Discussions underway with McGill (HEP) (Just about to add)Total resources >> (1000 CPUs,10 TB disk,400 TB tape)

Maximum number of jobs running on GridX1 has exceeded 250

CondorG grid:Extension of Condor batch system

Scalable to 1000’s of jobs

Intuitive commands for running jobs on remote resources

7

Resource Management

classAds are used for passing site and job specifications

Resources periodically publishes their state to the collector

Free/total CPUs; Num of running and waiting jobs; est queue waiting time.

Job ClassAds contain a resource Requirements expression.

CPU requirements,OS, application software,

8

Job management

Each site specifies the maximum number of grid jobs, maxJobs. (100 at UVictoria)

Job is sent to site with lowest wait time.

Sites are selected on a round-robin basis.

RB submits jobs to a site until number of jobs pending at a site is 10% of maxJobs

9

Monitoring

10

System Status

unsubmitted: waiting on the GridX1 RB (no identified site)pending: sent to a resource but not runningrunning: activewaiting time: estimated time for the next job to run

11

Local resource status

Each site sets its own policy.

Some backfill and others have limits to the number of jobs.

12

Status

GridX1 used by the ATLAS experiment via the LCG-TRIUMF gateway

Over 12,000 ATLAS jobs successfully completed

13

Challenges

• GridX1 is a equivalent to a moderate-sized computing facility– It requires a “grid” system administrator to keep

system operational• We need a more automated way to install applications• Monitoring is in good shape but further improvements

are needed– Improve reliability and scalability

• Error recovery has not been an issue with LCG jobs – We will have to address this with the BaBar

simulation application

14

Data management

• No data grid management required for ATLAS Data Challenge– Data analysis jobs will require access to large input data sets

• Prototype data grid elements in place– Replica catalog– Jobs running on GridX1 query RLI and either copy data from

UVic to grid cache or link to file if it already exists in the grid cache

• Install dCache at UVic with a Storage Resource Manager (SRM)– dCache developed by Fermilab (Chicago) and DESY (Hamburg)– SRM’s are used to interface storage facilities on the Grid– Interface GridX1 storage to LCG via GridX1-SRM

15

Plans

Short term plans– Improving the reliability and scalability of the monitoring– Getting all sites operational (e.g. NRC Venus)– Getting the BaBar application running on more GridX1

sites

Long term plans– Add data grid capability– High-speed network links between sites– Explore virtual computing concept (e.g. Xen)– Web services based monitoring – Investigate grid resource broker algorithms (PhD thesis)

16

Summary

• GridX1 working very well

• Over 12,000 ATLAS jobs in past 6 months (5000 in March)

• In Feb/Mar GridX1 was running 7% to 10% of all LHC jobs world-wide

• BaBar application running on subset of GridX1– Typically 200 BaBar jobs run on the UVic clusters and Westgrid

• Talks at international conferences and press (national and international)

• We want to add more sites – Other applications could be run on the Grid (looking at two more)– Requirements and Instructions are available at www.gridx1.ca

grid canada cls escience workshop 21 st november, 2005

Documents

atlas jobs

maximum number of grid

grid awareness

condorg grid

tb tapemaximum number

job managementeach site

grid system administrator

jobs est queue