introduction to grid computing ci-days workshop clemson university, clemson, sc may 19, 2008 1

Introduction to Grid Computing

CI-Days workshop Clemson University, Clemson, SC May 19, 2008

1

Scaling up Science:Citation Network Analysis in Sociology

2002

1975

1990

1985

1980

2000

1995

Work of James Evans, University of Chicago,

Department of Sociology

2

Scaling up the analysis

Query and analysis of 25+ million citations Work started on desktop workstations Queries grew to month-long duration With data distributed across

U of Chicago TeraPort cluster: 50 (faster) CPUs gave 100 X speedup Many more methods and hypotheses can be tested!

Higher throughput and capacity enables deeper analysis and broader community access.

3

Computing clusters have commoditized supercomputing

Cluster Management“frontend”

Tape Backup robots

I/O Servers typically RAID fileserver

Disk Arrays Lots of Worker Nodes

A few Headnodes, gatekeepers and

other service nodes

4

Initial Grid driver: High Energy Physics

Tier2 Centre ~1 TIPS

Online System

Offline Processor Farm

~20 TIPS

CERN Computer Centre

FermiLab ~4 TIPSFrance Regional Centre

Italy Regional Centre

Germany Regional Centre

InstituteInstituteInstituteInstitute ~0.25TIPS

Physicist workstations

~100 MBytes/sec

~100 MBytes/sec

~622 Mbits/sec

~1 MBytes/sec

There is a “bunch crossing” every 25 nsecs.

There are 100 “triggers” per second

Each triggered event is ~1 MByte in size

Physicists work on analysis “channels”.

Each institute will have ~10 physicists working on one or more channels; data for these channels should be cached by the institute server

Physics data cache

~PBytes/sec

~622 Mbits/sec or Air Freight (deprecated)




Caltech ~1 TIPS

~622 Mbits/sec

Tier 0Tier 0

Tier 1Tier 1

Tier 2Tier 2

Tier 4Tier 4

1 TIPS is approximately 25,000

SpecInt95 equivalents

Image courtesy Harvey Newman, Caltech 5

Grids can process vast datasets. Many HEP and Astronomy experiments consist of:

Large datasets as inputs (find datasets) “Transformations” which work on the input datasets (process) The output datasets (store and publish)

The emphasis is on the sharing of the large datasets Programs when independent can be parallelized.

Montage Workflow: ~1200 jobs, 7 levelsNVO, NASA, ISI/Pegasus - Deelman et al.

Mosaic of M42 created on TeraGrid

= Data Transfer

= Compute Job 6

PUMA: Analysis of Metabolism

PUMA Knowledge Base

Information about proteins analyzed against ~2 million gene sequences

Analysis on Grid

Involves millions of BLAST, BLOCKS, and

other processesNatalia Maltsev et al.http://compbio.mcs.anl.gov/puma2 7

Mining Seismic data for hazard analysis (Southern Calif. Earthquake Center).

InSAR Image of theHector Mine Earthquake

• A satellitegeneratedInterferometricSynthetic Radar(InSAR) image ofthe 1999 HectorMine earthquake.

• Shows thedisplacement fieldin the direction ofradar imaging

• Each fringe (e.g.,from red to red)corresponds to afew centimeters ofdisplacement.

SeismicHazardModel

Seismicity Paleoseismology Local site effects Geologic structure

Faults

Stresstransfer

Crustal motion Crustal deformation Seismic velocity structure

Rupturedynamics 88

Grids Provide Global ResourcesTo Enable e-Science

9

Ian Foster’s Grid Checklist

A Grid is a system that:

Coordinates resources that are not subject to

centralized control

Uses standard, open, general-purpose protocols

and interfaces

Delivers non-trivial qualities of service

10

Virtual Organizations (VOs) Groups of organizations that use the Grid to share resources

for specific purposes Support a single community Deploy compatible technology and agree on working policies

Security policies - difficult Deploy different network accessible services:

Grid Information Grid Resource Brokering Grid Monitoring Grid Accounting

11

What is a grid made of ? Middleware.

Grid Protocols

Grid Resources dedicatedby UC, IU, Boston

GridStorage

GridMiddleware

Co

mp

utin

gC

luster

Grid resource time purchasedfrom commercial provider

GridMiddleware

Co

mp

utin

gC

luster

Grid resources sharedby OSG, LCG, NorduGRID

GridMiddleware

Co

mp

utin

gC

luster

Grid Client

ApplicationUser

Interface

GridMiddleware

Resource,WorkflowAnd DataCatalogs

GridStorage

GridStorage

Security to control access and protect communication (GSI) Directory to locate grid sites and services: (VORS, MDS) Uniform interface to computing sites (GRAM) Facility to maintain and schedule queues of work (Condor-G) Fast and secure data set mover (GridFTP, RFT) Directory to track where datasets live (RLS)

12

We will focus on Globus and Condor Condor provides both client & server scheduling

In grids, Condor provides an agent to queue, schedule and manage work submission

Globus Toolkit (GT) provides the lower-level middleware Client tools which you can use from a command line APIs (scripting languages, C, C++, Java, …) to build

your own tools, or use direct from applications Web service interfaces Higher level tools built from these basic components,

e.g. Reliable File Transfer (RFT)

13

Grid software is like a “stack”

Grid Security Infrastructure

JobManagement

DataManagement

Information(config & status)

Services

Core Globus Services

Standard Network Protocols

Condor Grid Client

Workflow system (explicit or ad-hoc)

Grid Application (often includes a Portal)

14

Provisioning

Grid architecture is evolving to a Service-Oriented approach.

Service-oriented Gridinfrastructure Provision physical

resources to support application workloads

ApplnService

ApplnService

Users

Workflows

Composition

Invocation

Service-oriented applications Wrap applications as

services Compose applications

into workflows

“The Many Faces of IT as Service”, Foster, Tuecke, 2005

...but this is beyond our workshop’s scope.See “Service-Oriented Science” by Ian Foster.

15

Job and resource management

16

Local Resource Manager: a batch schedulerfor running jobs on a computing cluster Popular LRMs include:

PBS – Portable Batch System LSF – Load Sharing Facility SGE – Sun Grid Engine Condor – Originally for cycle scavenging, Condor has evolved

into a comprehensive system for managing computing LRMs execute on the cluster’s head node Simplest LRM allows you to “fork” jobs quickly

Runs on the head node (gatekeeper) for fast utility functions No queuing (but this is emerging to “throttle” heavy loads)

In GRAM, each LRM is handled with a “job manager”

17

GRAM provides a uniform interfaceto diverse cluster schedulers.

User

Grid

VO

VO

VO

VO

Condor

PBS

LSF

UNIX fork()

GRAM

Site A

Site B

Site C

Site D

19

July 11-15, 2005 Lecture3: Grid Job Management

20

Condor-G:Grid Job Submission Manager

Globus GRAM ProtocolGatekeeper & Job Manager

Submit to LRM

Organization A Organization B

Condor-G

myjob1myjob2myjob3myjob4myjob5…

Data Management

21

Data management services provide the mechanisms to find, move and share data GridFTP

Fast, Flexible, Secure, Ubiquitous data transport Often embedded in higher level services

RFT Reliable file transfer service using GridFTP

Replica Location Service (RLS) Tracks multiple copies of data for speed and reliability Emerging services integrate replication and transport

Metadata management services are evolving

22

GridFTP examples

globus-url-copy

file:///home/YOURLOGIN/dataex/myfile

gsiftp://osg-edu.cs.wisc.edu/nfs/osgedu/YOURLOGIN/ex1

globus-url-copy

gsiftp://osg-edu.cs.wisc.edu/nfs/osgedu/YOURLOGIN/ex2

gsiftp://tp-osg.ci.uchicago.edu/YOURLOGIN/ex3

Grids replicate data files for faster access Effective use of the grid resources – more

parallelism Each logical file can have multiple physical

copies Avoids single points of failure Manual or automatic replication

Automatic replication considers the demand for a file, transfer bandwidth, etc.

25

File Catalog Services Replica Location Service (RLS) Phedex RefDB / PupDB

Requirements Abstract the logical file name (LFN) for a physical file maintain the mappings between the LFNs and the PFNs

(physical file names) Maintain the location information of a file

File catalogues tell you where the data is

26

RLS -Replica Location Service• RLS

• component of the data grid architecture (Globus component)• It provides access to mapping information from logical names to

physical names of items• Its goal is to reduce access latency, improve data locality, improve robustness, scalability and performance for distributed applications

• RLS produces replica catalogs (LRCs), which represent mappings between logical and physical files scattered across the storage system. • For better performance, the LRC can be indexed.

RLS -Replica Location Service RLS maps logical filenames to physical

filenames. Logical Filenames (LFN)

Names a file with interesting data in it Doesn’t refer to location (which host, or where in a

host) Physical Filenames (PFN)

Refers to a file on some filesystem somewhere Often use gsiftp:// URLs to specify

Two RLS catalogs: Local Replica Catalog (LRC) and Replica Location Index (RLI)

Local Replica Catalog (LRC)

stores mappings from LFNs to PFNs. Interaction:

Q: Where can I get filename ‘experiment_result_1’? A: You can get it from gsiftp://gridlab1.ci.uchicago.edu/home/benc/r.txt

Undesirable to have one of these for whole grid Lots of data Single point of failure

Replica Location Index (RLI) stores mappings from LFNs to LRCs. Interaction:

Q: Who can tell me about filename ‘experiment_result_1’.

A: You can get more info from the LRC at gridlab1 (Then go to ask that LRC for more info)

Failure of one RLI or LRC doesn’t break everything

RLI stores reduced set of information, so can cope with many more mappings

Globus RLS

file1→ gsiftp://serverA/file1file2→ gsiftp://serverA/file2

LRC

RLIfile3→ rls://serverB/file3file4→ rls://serverB/file4

rls://serverA:39281

file1file2

site A

file3→ gsiftp://serverB/file3file4→ gsiftp://serverB/file4

LRC

RLIfile1→ rls://serverA/file1file2→ rls://serverA/file2

rls://serverB:39281

file3file4

site B

Storage Resource Managers (SRMs) are Storage Resource Managers (SRMs) are middleware componentsmiddleware components whose function is to provide

dynamic space allocation file management

on shared storage resources on the Grid

Different implementations for underlying storage systems are based on the same SRM specification

What is SRM?What is SRM?

SRMs role in the data grid architectureSRMs role in the data grid architecture Shared storage space allocation & reservation

important for data intensive applications

Get/put files from/into spaces archived files on mass storage systems

File transfers from/to remote sites, file replication

Negotiate transfer protocols

File and space management with lifetime

support non-blocking (asynchronous) requests

Directory management

Interoperate with other SRMs

SRMs role in gridSRMs role in grid

SRM: Main conceptsSRM: Main concepts Space reservations

Dynamic space management

Pinning file in spaces

Support abstract concept of a file name: Site URL

Temporary assignment of file names for transfer: Transfer URL

Directory management and authorization

Transfer protocol negotiation

Support for peer to peer request

Support for asynchronous multi-file requests

Support abort, suspend, and resume operations Non-interference with local policies

Birmingham•

The Globus-BasedLIGO Data Grid

Replicating >1 Terabyte/day to 8 sites>40 million replicas so farMTBF = 1 month

LIGO Gravitational Wave Observatory

Cardiff

AEI/Golm

35

Grid Security

36

Grid security is a crucial component Problems being solved might be sensitive Resources are typically valuable Resources are located in distinct administrative

domains Each resource has own policies, procedures, security

mechanisms, etc. Implementation must be broadly available &

applicable Standard, well-tested, well-understood protocols;

integrated with wide variety of tools

37

Grid Security Infrastructure - GSI Provides secure communications for all the higher-level

grid services Secure Authentication and Authorization

Authentication ensures you are whom you claim to be ID card, fingerprint, passport, username/password

Authorization controls what you are permitted to do Run a job, read or write a file

GSI provides Uniform Credentials Single Sign-on

User authenticates once – then can perform many tasks

38

National GridCyberinfrastructure

39

Open Science Grid (OSG) provides shared computing resources, benefiting a broad set of disciplines

OSG incorporates advanced networking and focuses on general services, operations, end-to-end performance

Composed of a large number (>50 and growing) of shared computing facilities, or “sites”

http://www.opensciencegrid.org/

A consortium of universities and national laboratories, building a sustainable grid infrastructure for science.

40

www.opensciencegrid.org

Diverse job mix

Open Science Grid70 sites (20,000 CPUs) & growing400 to >1000 concurrent jobsMany applications + CS experiments;

includes long-running production operations

41

TeraGrid provides vast resources via a number of huge computing facilities.

42

To efficiently use a Grid, you must locate and monitor its resources. Check the availability of different grid sites Discover different grid services Check the status of “jobs” Make better scheduling decisions with

information maintained on the “health” of sites

43

OSG Resource Selection Service: VORS

44

Grid Workflow

45

A typical workflow pattern in image analysis runs many filtering apps.3a.h

align_warp/1

3a.i

3a.s.h

softmean/9

3a.s.i

3a.w

reslice/2

4a.h

align_warp/3

4a.i

4a.s.h 4a.s.i

4a.w

reslice/4

5a.h

align_warp/5

5a.i

5a.s.h 5a.s.i

5a.w

reslice/6

6a.h

align_warp/7

6a.i

6a.s.h 6a.s.i

6a.w

reslice/8

ref.h ref.i

atlas.h atlas.i

slicer/10 slicer/12 slicer/14

atlas_x.jpg

atlas_x.ppm

convert/11

atlas_y.jpg

atlas_y.ppm

convert/13

atlas_z.jpg

atlas_z.ppm

convert/15

Workflow courtesy James Dobson, Dartmouth Brain Imaging Center 46

Conclusion: Why Grids?

New approaches to inquiry based on Deep analysis of huge quantities of data Interdisciplinary collaboration Large-scale simulation and analysis Smart instrumentation Dynamically assemble the resources to tackle a new

scale of problem Enabled by access to resources & services without

regard for location & other barriers

47

Grids:Because Science needs community … Teams organized around common goals

People, resource, software, data, instruments…

With diverse membership & capabilities Expertise in multiple areas required

And geographic and political distribution No location/organization possesses all required skills and

resources

Must adapt as a function of the situation Adjust membership, reallocate responsibilities, renegotiate

resources

48

What’s next ?

https://twiki.grid.iu.edu/twiki/bin/view/Education/NextSteps

Take the self-paced course:

opensciencegrid.org/OnlineGridCourse

Contact us:

[email protected]

Periodically check OSG site:

opensciencegrid.org

Acknowledgements (OSG, Globus, Condor teams and associates)

Gabrielle Allen, LSU

Rachana Ananthakrishnan, ANL

Ben Clifford, Uchicago

Alex Sim, BNL

Alain Roy, UW - Madison

Mike Wilde, ANL

50

introduction to grid computing ci-days workshop clemson university, clemson, sc may 19, 2008 1

Documents

grid sites

initial grid driver

vast datasets

deeper analysis

citation network analysis

coordinates resources

hazard analysis southern

university of chicago