the pilot way to grid resources using...

16
Parag Mhashilkar On Behalf Of Igor Sfiligoi 1 , Daniel C Bradley 2 , Burt Holzman 1 , Parag Mhashilkar 1 , Sanjay Padhi 3 , Frank Würthwein 3 Fermilab, Batavia, IL 1 University Of Wisconsin at Madison, WI 2 , University Of California at San Diego, CA 3 , The Pilot Way To Grid Resources Using GlideinWMS

Upload: others

Post on 11-Oct-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Pilot Way To Grid Resources Using GlideinWMSglideinwms.fnal.gov/.../csie2009_033109/glideinWMS... · used Grid services Local Batch ... glideinWMS is a pilot WMS based on Condor

P a r a g M h a s h i l k a r

O n B e h a l f O f

I g o r S f i l i g o i1, D a n i e l C B r a d l e y

2,

B u r t H o l z m a n1, P a r a g M h a s h i l k a r

1,

S a n j a y P a d h i3

, F r a n k W ü r t h w e i n3

F e r m i l a b , B a t a v i a , I L1

U n i v e r s i t y O f W i s c o n s i n a t M a d i s o n , W I2

,

U n i v e r s i t y O f C a l i f o r n i a a t S a n D i e g o , C A3

,

The Pilot Way To Grid Resources Using GlideinWMS

Page 2: The Pilot Way To Grid Resources Using GlideinWMSglideinwms.fnal.gov/.../csie2009_033109/glideinWMS... · used Grid services Local Batch ... glideinWMS is a pilot WMS based on Condor

Overview

March 31, 2009The pilot way to Grid resources using glideinWMS

2

Grid Computing

Pilot Workload Management (WMS) Paradigm

Security Considerations

GlideinWMS implementation of Pilot Paradigm

Pseudo-interactive Monitoring using glideinWMS

Scalability of glideinWMS

Summary

References

Page 3: The Pilot Way To Grid Resources Using GlideinWMSglideinwms.fnal.gov/.../csie2009_033109/glideinWMS... · used Grid services Local Batch ... glideinWMS is a pilot WMS based on Condor

Grid Computing

March 31, 2009The pilot way to Grid resources using glideinWMS

3

Distributed computing paradigm spanning many administrative domains.

Widely deployed the scientific communities with high computing demandso High Energy Physics (HEP)

o Astro Physics Communities

o Weather Surveys

o Biology

o […]

General purpose Grids used by the scientific communitieso Open Science Grid (OSG)

o European Grid for E-SciEnce (EGEE)

o […]

Page 4: The Pilot Way To Grid Resources Using GlideinWMSglideinwms.fnal.gov/.../csie2009_033109/glideinWMS... · used Grid services Local Batch ... glideinWMS is a pilot WMS based on Condor

Typical Grid Use Case

March 31, 2009The pilot way to Grid resources using glideinWMS

4 Grid Site: An administrative domain

o Administrators deploy Grid middleware with following components- Compute Element (CE) running a

gatekeeper which executes jobs on behalf of users

Client tools on compute resources (or worker nodes) to talk to commonly used Grid services

Local Batch System (BS)

From user’s perspectiveo Pros

Large pool of resources to satisfy their computing needs

o Cons Middleware problems in managing

the job

Progress of the job is hidden from the user making monitoring it complicated

Heterogeneity of the resources over the Grid

Need a meta WMS to manage Grid jobs

Grid Site

Batch System

User

Computing jobs

Compute resource

CE running Gatekeeper

Page 5: The Pilot Way To Grid Resources Using GlideinWMSglideinwms.fnal.gov/.../csie2009_033109/glideinWMS... · used Grid services Local Batch ... glideinWMS is a pilot WMS based on Condor

Pilot WMS Paradigm

March 31, 2009The pilot way to Grid resources using glideinWMS

5

Pilot or just-in-time paradigmo Pilot factory submits pilot jobs

to different grid siteso Pilots start running on the

compute resources and fetch user jobs from the user job queue of WMS

Advantages of pilot based WMSo Forms a virtual private pool of

compute resourceso Partially hides heterogeneity of

grid sites from the user.o Pilot jobs

If the environment is bad, pilot exits, preventing the user job to start and thus fail

Act as a wrapper and makes sure that the environment is right for the user job to execute.

Grid Site

Batch System

User

Computing jobs

Compute resource

Gatekeeper

Pilot jobs Pilot WMS

Page 6: The Pilot Way To Grid Resources Using GlideinWMSglideinwms.fnal.gov/.../csie2009_033109/glideinWMS... · used Grid services Local Batch ... glideinWMS is a pilot WMS based on Condor

Security Considerations

March 31, 2009The pilot way to Grid resources using glideinWMS

6

Pilots are authenticated and authorized by the site gatekeeper.

Concerns with pilot based WMS User jobs do not traverse

through the site gatekeeper: Does not fit well with the Grid

model of authenticating / authorizing / accounting of user jobs.

Since pilot bootstraps the user job, both pilot and user job run under same OS user This allows a malicious user to

compromise the pilot job infrastructure.

Grid Site

Batch System

User

Computing jobs

Compute resource

Gatekeeper

Pilot jobs Pilot WMS

Page 7: The Pilot Way To Grid Resources Using GlideinWMSglideinwms.fnal.gov/.../csie2009_033109/glideinWMS... · used Grid services Local Batch ... glideinWMS is a pilot WMS based on Condor

Security Considerations

March 31, 2009The pilot way to Grid resources using glideinWMS

7

Possible Solution

Deploy mini-gatekeepers on the worker nodes to authenticate/authorize user jobs.

OSG and EGEE sites deploy gLExec, which acts as a mini gate-keepers on worker nodes to authenticate / authorize user jobs.

Grid Site

Batch System

User

Computing jobs

Compute resource

Gatekeeper

Pilot jobs Pilot WMS

Page 8: The Pilot Way To Grid Resources Using GlideinWMSglideinwms.fnal.gov/.../csie2009_033109/glideinWMS... · used Grid services Local Batch ... glideinWMS is a pilot WMS based on Condor

Pseudo-interactive Monitoring in Pilot WMS

March 31, 2009The pilot way to Grid resources using glideinWMS

8

Users need more info – When something goes wrong In case of very long running

jobs

Information useful to the user - What processes are running (ps) Peek at the log files (cat/tail) What files have been created (ls) Peek at the process stack (gdb

bt) Is the machine thrashing? (top)

Above information can be obtained through batch jobs Pilot starts another job that

acts as a monitoring job

Grid Site

Batch System

User

Computing jobs

Compute resource

Gatekeeper

Pilot jobs Pilot WMS

Monitoring jobs

Page 9: The Pilot Way To Grid Resources Using GlideinWMSglideinwms.fnal.gov/.../csie2009_033109/glideinWMS... · used Grid services Local Batch ... glideinWMS is a pilot WMS based on Condor

glideinWMS Implementation of the Pilot Paradigm

March 31, 2009The pilot way to Grid resources using glideinWMS

9

glideinWMS is based on Condor with the VO Frontend and the Factory sending pilot jobs (i.e. glideins) to the grid sites.

Condor as a user job WMS

glideinWMS Factoryo Glidein factories creates and submits pilot

jobs to the grid sites using CondorG

o Condor collector acts as a dashboard for message exchanging

o Factory receives orders from the VO frontend via the dashboard

VO Frontendo VO frontend monitors the CondorWMS and

regulates the number of pilot jobs sent by glidein factories via the dashboard

o Frontend acts as a match maker for the glideins

VO Frontend

All network traffic is authenticated and integrity checked

Support pseudo-interactive monitoring out of the box

Grid Site

Batch System

Startd

Computing jobs

Compute resource

Gatekeeper

Pilot jobs Schedd

……

Negotiator Collector

… Dashboard

Frontend

GFactory

Condor-G

Page 10: The Pilot Way To Grid Resources Using GlideinWMSglideinwms.fnal.gov/.../csie2009_033109/glideinWMS... · used Grid services Local Batch ... glideinWMS is a pilot WMS based on Condor

glideinWMS Implementation of the Pilot Paradigm

March 31, 2009The pilot way to Grid resources using glideinWMS

10

glideinWMS is based on Condor with the VO Frontend and the Factory sending pilot jobs (i.e. glideins) to the grid sites.

Condor as a user job WMS

o Condor collector acts as an information database

o Condor startd manages the compute resource

o Condor schedd acts as the job queue for users jobs

o Startd and schedd advertise the resource and jobs respectively to the collector using condor classAds

o Condor negotiator acts as a match maker between compute resources and user jobs

glideinWMS Factory

VO Frontend

All network traffic is authenticated and integrity checked

Support pseudo-interactive monitoring out of the box

Grid Site

Batch System

Startd

Computing jobs

Compute resource

Gatekeeper

Pilot jobs Schedd

……

Negotiator Collector

… Dashboard

Frontend

GFactory

Condor-G

Page 11: The Pilot Way To Grid Resources Using GlideinWMSglideinwms.fnal.gov/.../csie2009_033109/glideinWMS... · used Grid services Local Batch ... glideinWMS is a pilot WMS based on Condor

glideinWMS Implementation of the Pilot Paradigm

March 31, 2009The pilot way to Grid resources using glideinWMS

11 glideinWMS is based on Condor with the VO

Frontend and the Factory sending pilot jobs (i.e. glideins) to the grid sites.

Condor as a user job WMSo Condor collector acts as an information database

o Condor startd manages the compute resource

o Condor schedd acts as the job queue for users jobs

o Startd and schedd advertise the resource and jobs respectively to the collector

o Condor negotiator acts as a match maker between compute resources and user jobs

glideinWMS Factoryo Glidein factories creates and submits pilot jobs to

the grid sites using CondorG

o Condor collector acts as a dashboard for message exchanging

o Factory receives orders from the VO frontend via the dashboard

VO Frontendo VO frontend monitors the CondorWMS and

regulates the number of pilot jobs sent by glideinfactories via the dashboard

o Frontend acts as a match maker for the glideins

All network traffic is authenticated and integrity checked

Support pseudo-interactive monitoring out of the box

Grid Site

Batch System

Startd

Computing jobs

Compute resource

Gatekeeper

Pilot jobs Schedd

……

Negotiator Collector

… Dashboard

Frontend

GFactory

Condor-G

Page 12: The Pilot Way To Grid Resources Using GlideinWMSglideinwms.fnal.gov/.../csie2009_033109/glideinWMS... · used Grid services Local Batch ... glideinWMS is a pilot WMS based on Condor

Scalability of glideinWMS

March 31, 2009The pilot way to Grid resources using glideinWMS

12

Centralized WMS are generally less scalable glideinWMS scalability issues found

The centralized user queue keeping track of thousands of running jobs is memory exhaustive.

Security handshake in establishing communication between different components could be expensive in case of high network latency

glideinWMS addresses these scalability issues by Deploying multiple instances of the user queue service to spread the load Increasing the memory of the machine that hosts schedd service Deploying multiple slave collectors to reduce the impact of communication issues because

of high network latency

Table below summarizes the scalability achieved with a deployment running 1 Master collector, 70 slave collectors and using system with 16GB of memory to host the schedd service.

Criteria Design goal Achieved so far

Total number of user jobs in the queue at any given time 100k 200k

Number of glideins in the system at any given time 10k ~26k

Number of running jobs per schedd at any given time 10k ~23k

Grid sites handled ~100 ~100

Page 13: The Pilot Way To Grid Resources Using GlideinWMSglideinwms.fnal.gov/.../csie2009_033109/glideinWMS... · used Grid services Local Batch ... glideinWMS is a pilot WMS based on Condor

glideinWMS in CMS Operations

March 31, 2009The pilot way to Grid resources using glideinWMS

13

Running more than 20k glideins at any given time CMS operations using glideinWMS at it’s seven archival storage sites

CMS operations at Tier1 site at Fermilab

Page 14: The Pilot Way To Grid Resources Using GlideinWMSglideinwms.fnal.gov/.../csie2009_033109/glideinWMS... · used Grid services Local Batch ... glideinWMS is a pilot WMS based on Condor

Summary

March 31, 2009The pilot way to Grid resources using glideinWMS

14

Grid computing provides large computing power at the expense of complexity for the users

Pilot based WMS can alleviate some of the complexity by forming virtual private pool of compute resources for the users

glideinWMS is a pilot WMS based on Condor with a thin layer of software on top of Condor to send pilot jobs

The current release of glideinWMS has been tested to handle > 20k batch slots from ~100 Grid sites with ~200k jobs in the queue with average job start up rate of 1Hz

Page 15: The Pilot Way To Grid Resources Using GlideinWMSglideinwms.fnal.gov/.../csie2009_033109/glideinWMS... · used Grid services Local Batch ... glideinWMS is a pilot WMS based on Condor

Acknowledgements

March 31, 2009The pilot way to Grid resources using glideinWMS

15

glideinWMS infrastructure is developed in Fermilab in collaboration with the Condor team from Wisconsin and High Energy Physics experiments like CMS, CDF and MINOS.

Most of the glideinWMS development work is funded by USCMS (part of CMS) experiment.

Fermilab is operated by Fermi Research Alliance, LLC under Contract No. DE-AC02-07CH11359 with the United States Department of Energy.

The paper was partial sponsored by the National Science Foundation under Grant No. PHY-0533280 (DISUN) and PHY-0427113 (RACE).

The Open Science Grid (OSG)

Page 16: The Pilot Way To Grid Resources Using GlideinWMSglideinwms.fnal.gov/.../csie2009_033109/glideinWMS... · used Grid services Local Batch ... glideinWMS is a pilot WMS based on Condor

References

March 31, 2009The pilot way to Grid resources using glideinWMS

16

glideinWMS Homepage: http://www.uscms.org/SoftwareComputing/Grid/WMS/glideinWMS

glideinWMS publications and presentations:http://www.uscms.org/SoftwareComputing/Grid/WMS/glideinWMS/doc.html

Fermi National Accelerator Laboratory:http://www.fnal.gov

The Open Science Grid: http://www.opensciencegrid.org

EGEE Homepage:http://www.eu-egee.org