open science grid frank würthwein osg application coordinator experimental elementary particle...

35
Open Science Grid Frank Würthwein OSG Application Coordinator Experimental Elementary Particle Physics UCSD

Upload: angelica-sparks

Post on 29-Dec-2015

219 views

Category:

Documents


1 download

TRANSCRIPT

Open Science Grid

Frank WürthweinOSG Application Coordinator

Experimental Elementary Particle PhysicsUCSD

8/7/06 NBCR 2006 2

Particle Physics & Computing Science Driver

Event rate = Luminosity x Crossection

LHC Revolution starting in 2008— Luminosity x 10— Crossection x 150 (e.g. top-quark)

Computing Challenge 20PB in first year of running ~ 100MSpecInt2000 ~ close to 100,000 cores

8/7/06 NBCR 2006 3

Overview

OSG in a nutshell Organization “Architecture” Using the OSG Present Utilization & Expected Growth Summary of OSG Status

8/7/06 NBCR 2006 4

OSG in a nutshell High Throughput Computing

— Opportunistic scavenging on cheap hardware.— Owner controlled policies.

“open consortium”— Add OSG project to an open consortium to provide cohesion and

sustainability. Heterogeneous Middleware stack

Minimal site requirements & optional services Production grid allows coexistence of multiple OSG releases.

“Linux rules”: mostly RHEL3 on Intel/AMD Grid of clusters

— Compute & storage (mostly) on private Gb/s LANs.— Some sites with (multiple) 10Gb/s WAN “uplink”.

OrganizationStarted in 2005 as Consortium with contributed

effort only.Now adding OSG project to sustain production grid.

People coming together to build …People paid to operate …

8/7/06 NBCR 2006 6

Consortium & Project Consortium Council

IT Departments & their hardware resources

Science Application Communities

Middleware Providers

Funded Project (Starting 9/06) Operate services for a distributed

facility. Improve, Extend, Expand &

Interoperate Engagement, Education &

Outreach

Argonne Nat. Lab.Brookhaven Nat. Lab.CCR SUNY BuffaloFermi Nat. LabThomas Jefferson Nat. Lab.Lawrence Berkeley Nat. Lab.Stanford Lin. Acc. CenterTexas Adv. Comp. CenterRENCIPurdue

US Atlas CollaborationBaBar CollaborationCDF CollaborationUS CMS CollaborationD0 CollaborationGRASELIGOSDSSSTAR

Council Members:US Atlas S&C ProjectUS CMS S&C ProjectCondorGlobusSRMOSG Project

8/7/06 NBCR 2006 7

Consortium & Project Consortium Council

IT Departments & their hardware resources

Science Application Communities

Middleware Providers

Funded Project (Starting 9/06) Operate services for a distributed

facility. Improve, Extend, Expand &

Interoperate Engagement, Education &

Outreach

Argonne Nat. Lab.Brookhaven Nat. Lab.CCR SUNY BuffaloFermi Nat. LabThomas Jefferson Nat. Lab.Lawrence Berkeley Nat. Lab.Stanford Lin. Acc. CenterTexas Adv. Comp. CenterRENCIPurdue

US Atlas CollaborationBaBar CollaborationCDF CollaborationUS CMS CollaborationD0 CollaborationGRASELIGOSDSSSTAR

Council Members:US Atlas S&C ProjectUS CMS S&C ProjectCondorGlobusSRMOSG Project

Middleware

Middleware

Hardware

Hardware

User Support

User Support

8/7/06 NBCR 2006 8

OSG ManagementExecutive Director: Ruth Pordes

Facility Coordinator: Miron Livny

Application Coordinators: Torre Wenaus & fkw

Resource Managers: P. Avery & A. Lazzarini

Education Coordinator: Mike Wilde

Engagement Coord.: Alan Blatecky

Council Chair: Bill Kramer

Diverse Set of people from Universities & National Labs, including CS, Science Apps, & IT infrastructure people.

8/7/06 NBCR 2006 9

OSG Management Structure

“Architecture”Grid of sites

Me - My friends - the anonymous gridGrid of Grids

8/7/06 NBCR 2006 11

Grid of sites IT Departments at Universities & National Labs make their

hardware resources available via OSG interfaces.— CE: (modified) pre-ws GRAM— SE: SRM for large volume, gftp & (N)FS for small volume

Today’s scale:— 20-50 “active” sites (depending on definition of “active”)— ~ 5000 batch slots— ~ 500TB storage— ~ 10 “active” sites with shared 10Gbps or better connectivity

Expected Scale for End of 2008 ~50 “active” sites ~30-50,000 batch slots Few PB of storage ~ 25-50% of sites with shared 10Gbps or better connectivity

8/7/06 NBCR 2006 12

Making the Grid attractive Minimize entry threshold for resource owners

— Minimize software stack.— Minimize support load.

Minimize entry threshold for users— Feature rich software stack.— Excellent user support.

Resolve contradiction via “thick” Virtual Organization layer of services between users and the grid.

8/7/06 NBCR 2006 13

Me -- My friends -- The grid

O(104) Users

O(102-3) Sites

O(101-2) VOs

Thin client

Thin “Grid API”

Thick VOMiddleware& Support

Me

My friends

The anonymousGrid

Domain science specific Common to all sciences

8/7/06 NBCR 2006 14

Grid of Grids - from local to global

ScienceCommunityInfrastructure

CS/IT Campus Grids

National & InternationalCyberInfrastructure

for Science (e.g. Teragrid, EGEE, …)

(e.g Atlas, CMS, LIGO…)(e.g GLOW, FermiGrid, …)

OSG enables its usersto operate transparently across Grid boundaries globally.

Using the OSGAuthentication & AuthorizationMoving & Storing DataSubmitting jobs & “workloads”

8/7/06 NBCR 2006 16

Authentication & Authorization OSG Responsibilities

— X509 based middleware— Accounts may be dynamic/static, shared/FQAN-specific

VO Responsibilities— Instantiate VOMS— Register users & define/manage their roles

Site Responsibilities— Choose security model (what accounts are supported)— Choose VOs to allow— Default accept of all users in VO but individuals or groups

within VO can be denied.

8/7/06 NBCR 2006 17

User Management User obtains DN from CA that is vetted by TAGPMA User registers with VO and is added to VOMS of VO.

— VO responsible for registration of VOMS with OSG GOC.— VO responsible for users to sign AUP.— VO responsible for VOMS operations.

— VOMS shared for ops on multiple grids globally by some VOs.— Default OSG VO exists for new communities & single PIs.

Sites decide which VOs to support (striving for default admit)— Site populates GUMS daily from VOMSes of all VOs— Site chooses uid policy for each VO & role

— Dynamic vs static vs group accounts User uses whatever services the VO provides in support of users

— VOs generally hide grid behind portal Any and all support is responsibility of VO

— Helping its users— Responding to complains from grid sites about its users.

8/7/06 NBCR 2006 18

Moving & storing data OSG Responsibilities

— Define storage types & their APIs from WAN & LAN— Define information schema for “finding” storage— All storage is local to site - no global filesystem!

VO Responsibilities— Manage data transfer & catalogues

Site Responsibilities— Choose storage type to support & how much— Implement storage type according to OSG rules— Truth in advertisement

8/7/06 NBCR 2006 19

Disk areas in some detail: Shared filesystem as applications area at site.

— Read only from compute cluster.— Role based installation via GRAM.

Batch slot specific local work space.— No persistency beyond batch slot lease.— Not shared across batch slots.— Read & write access (of course).

SRM/gftp controlled data area.— “persistent” data store beyond job boundaries.— Job related stage in/out.— SRM v1.1 today.— SRM v2 expected in late 2006 (space reservation).

8/7/06 NBCR 2006 20

Securing your data Archival Storage in your trusted Archive

— You control where your data is archived. Data moved by party you trust

— You control who moves your data— You control encryption of your data

You compute at sites you trust— E.g. sites that guarantee specific unix uid for you.— E.g. sites whose security model satisfies your needs.

You decide how secure your data needs to be!

8/7/06 NBCR 2006 21

Submitting jobs/workloads OSG Responsibilities

— Define Interface to batch system (today: pre-ws GRAM)— Define information schema— Provide middleware that implements the above.

VO Responsibilities— Manage submissions & workflows— VO controlled workload management system or wms from other

grids, e.g. EGEE/LCG. Site Responsibilities

— Choose batch system— Configure interface according to OSG rules— Truth in advertisement

8/7/06 NBCR 2006 22

Simple Workflow Install Application Software at site(s)

— VO admin install via GRAM.— VO users have read only access from batch slots.

“Download” data to site(s)— VO admin move data via SRM/gftp.— VO users have read only access from batch slots.

Submit job(s) to site(s)— VO users submit job(s)/DAG via condor-g.— Jobs run in batch slots, writing output to local disk.— Jobs copy output from local disk to SRM/gftp data area.

Collect output from site(s)— VO users collect output from site(s) via SRM/gftp as part of DAG.

8/7/06 NBCR 2006 23

Some technical details Job submission

— Condor:— Condor-g— “schedd on the side” (simple multi-site brokering using condor schedd)— Condor glide-in

— EGEE workload management system— OSG CE compatble with glite Classic CE— Submissions via either LCG 2.7 RB or glite RB, including bulk submission

— Virtual Data System (VDS) in use on OSG Data placement using SRM

— SRM/dCache in use to virtualize many disks into one storage system — Schedule WAN Xfer across many gftp servers— Typical WAN IO capability today ~ 10TB/day ~ 2Gbps— Schedule random access from batch slots to many disks via LAN— Typical LAN IO capability today ~ 0.5-5 Gbyte/sec— Space reservation

8/7/06 NBCR 2006 24

Middleware lifecycleDomain science requirements.

Joint projects between OSG applicationsgroup & Middleware developers todevelop & test on community grids.

Integrate into VDT and deploy on OSG-itb.

Inclusion into OSG release & deployment on (part of) production grid.

EGEE et al.

Status of UtilizationOSG job = job submitted via OSG CE

“Accounting” of OSG jobs not (yet) required!

8/7/06 NBCR 2006 26

OSG use by Numbers32 Virtual Organizations

3 with >1000 jobs max.(all particle physics)

3 with 500-1000 max.(all outside physics)

5 with 100-500 max(particle, nuclear, and astro

physics)

8/7/06 NBCR 2006 27

Experimental Particle Physics

Bio/Eng/Med/MathCampus Grids

Non-HEP physics100 jobs

850 jobs

2250 jobs

5/05-5/06 GADU using VDS

PI from Campus Grid

8/7/06 NBCR 2006 28

Example GADU run in April

Bioinformatics App using VDS across 8 sites on OSG.

8/7/06 NBCR 2006 29

Number of running (and monitored) “OSG jobs” in June 2006.

3000

8/7/06 NBCR 2006 30

CMS Xfer on OSG in June 2006

All CMS sites have exceeded 5TB per day in June 2006.Caltech, Purdue, UCSD, UFL, UW exceeded 10TB/day.Hoping to reach 30-40TB/day capability by end of 2006.

450MByte/sec

Grid of GridsOSG enable single PIs and user communities

to operate transparently across Grid boundaries globally.

E.g.: CMS a particle physics experiment

8/7/06 NBCR 2006 32

CMS Experiment - a global community grid

Germany

CMS Experiment

Taiwan UKItaly

Data & jobs moving locally, regionally & globally within CMS grid.

Transparently across grid boundaries from campus to globus.

Florida

USA@FNAL

CERN

Caltech

Wisconsin

UCSD

France

Purdue

MIT

UNL

OSG

EGEE

8/7/06 NBCR 2006 33

Grid of Grids - Production Interop

Job submission:16,000 jobs per day submitted across EGEE & OSG via “LCG RB”.Jobs brokered transparently onto both grids.

Data Transfer:Peak IO of 5Gbps from FNAL to 32 EGEE and 7 OSG sites.All 8 CMS sites on OSG have exceeded 5TB/day goal.Caltech, FNAL, Purdue, UCSD/SDSC, UFL, UW exceed 10TB/day.

8/7/06 NBCR 2006 34

The US CMS center atFNAL transfers data to39 sites worldwide inCMS global Xfer challenge.

Peak Xfer rates of ~5Gbpsare reached.

CMS Xfer FNAL to World

8/7/06 NBCR 2006 35

Summary of OSG Status OSG facility opened July 22nd 2005. OSG facility is under steady use

— ~2-3000 jobs at all times— Mostly HEP but large Bio/Eng/Med occasionally— Moderate other physics (Astro/Nuclear)

OSG project— 5 year Proposal to DOE & NSF funded starting FY07.

— Facility & Improve/Expand/Extend/Interoperate & E&O Off to a running start … but lot’s more to do.

— Routinely exceeding 1Gbps at 6 sites— Scale by x4 by 2008 and many more sites

— Routinely exceeding 1000 running jobs per client— Scale by at least x10 by 2008

— Have reached 99% success rate for 10,000 jobs per day submission— Need to reach this routinely, even under heavy load