gt4 gram: a functionality and performance study stuart martin, martin feller computational...

42
GT4 GRAM: GT4 GRAM: A Functionality and A Functionality and Performance Study Performance Study Stuart Martin, Martin Feller Stuart Martin, Martin Feller Computational Institute, University of Computational Institute, University of Chicago Chicago & Argonne National Lab & Argonne National Lab TeraGrid 2007 TeraGrid 2007 Madison, WI Madison, WI

Upload: gabrielle-cannon

Post on 27-Mar-2015

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: GT4 GRAM: A Functionality and Performance Study Stuart Martin, Martin Feller Computational Institute, University of Chicago & Argonne National Lab TeraGrid

GT4 GRAM: GT4 GRAM: A Functionality and A Functionality and Performance StudyPerformance Study

Stuart Martin, Martin FellerStuart Martin, Martin FellerComputational Institute, University of Computational Institute, University of

ChicagoChicago & Argonne National Lab & Argonne National Lab

TeraGrid 2007TeraGrid 2007

Madison, WIMadison, WI

Page 2: GT4 GRAM: A Functionality and Performance Study Stuart Martin, Martin Feller Computational Institute, University of Chicago & Argonne National Lab TeraGrid

2

Contributors / CollaboratorsContributors / Collaborators

UC/ANL– Ian Foster– Peter Lane– Jarek Gawor– Ravi Madurri– Rachana Ananthakrishnan

Page 3: GT4 GRAM: A Functionality and Performance Study Stuart Martin, Martin Feller Computational Institute, University of Chicago & Argonne National Lab TeraGrid

3

GRAM - Basic Job GRAM - Basic Job Submission and Control ServiceSubmission and Control Service

A uniform service interface for remote job submission and control– Includes file staging and I/O

management– Includes reliability features– Supports basic Grid security

mechanisms– Asynchronous monitoring– Interfaces with local resource

managers, simplifies the job of metaschedulers/brokers

GRAM is not a scheduler.– No scheduling– No metascheduling/brokering

Page 4: GT4 GRAM: A Functionality and Performance Study Stuart Martin, Martin Feller Computational Institute, University of Chicago & Argonne National Lab TeraGrid

5

ComparisonComparison

Functionality– Security

– File Staging

– General Performance

– Concurrent jobs

– Sequential jobs

Page 5: GT4 GRAM: A Functionality and Performance Study Stuart Martin, Martin Feller Computational Institute, University of Chicago & Argonne National Lab TeraGrid

6

Security Functional ComparisonsSecurity Functional Comparisons

Page 6: GT4 GRAM: A Functionality and Performance Study Stuart Martin, Martin Feller Computational Institute, University of Chicago & Argonne National Lab TeraGrid

7

Privilege Limiting ModelPrivilege Limiting Model

GRAM must be able to start jobs submitted by remote users under different user ids. It must execute some code as “root”– GRAM2: Entire gatekeeper runs as root

– GRAM4: Service with sudo privs> non-root container account requires sudo to invoke operations as other users

Page 7: GT4 GRAM: A Functionality and Performance Study Stuart Martin, Martin Feller Computational Institute, University of Chicago & Argonne National Lab TeraGrid

8

AuthenticationAuthentication

A client can authenticate with GRAM with a variety of protocols– GRAM2: TLS (only)

– GRAM4: TLS, Message Level Security> Message-level WS-Security

> Channel-level WS-SecureConversation

> Choice for which to support in each deployment

Page 8: GT4 GRAM: A Functionality and Performance Study Stuart Martin, Martin Feller Computational Institute, University of Chicago & Argonne National Lab TeraGrid

9

Credential DelegationCredential Delegation

Needed by GRAM or the user’s applications to do file staging or other grid operations– GRAM2: Yes, Required

> Clients must delegate from client to service on every request

– GRAM4: Yes, Optional> Clients can choose and delegate when necessary

Page 9: GT4 GRAM: A Functionality and Performance Study Stuart Martin, Martin Feller Computational Institute, University of Chicago & Argonne National Lab TeraGrid

10

Credential RefreshCredential Refresh

Credentials have a lifetime and may expire before a job has completed execution– GRAM2: Yes– GRAM4: Yes

> A client can query for information about the WS Resource of the delegated credential

> Remaining lifetime

Page 10: GT4 GRAM: A Functionality and Performance Study Stuart Martin, Martin Feller Computational Institute, University of Chicago & Argonne National Lab TeraGrid

11

Share credential delegationShare credential delegationamong jobsamong jobs

When repeatedly interacting with the same GRAM service, a client may want to delegate once and share the delegation among multiple jobs– GRAM2: No– GRAM4: Yes

> Refreshing a credential in the delegation service that was shared among multiple job submission will results in a refresh for each job

Page 11: GT4 GRAM: A Functionality and Performance Study Stuart Martin, Martin Feller Computational Institute, University of Chicago & Argonne National Lab TeraGrid

12

Authorization CalloutsAuthorization Callouts

Following authentication, GRAM checks to see if the request should be authorized.For example, a gridmap file acting as an access control list– GRAM2: Yes - single PDP callout– GRAM4: Yes - Multiple PDP callout chain> Allows for richer policies

Parse VOMS attributes Use attributes in policy evaluations Site level black lists

Page 12: GT4 GRAM: A Functionality and Performance Study Stuart Martin, Martin Feller Computational Institute, University of Chicago & Argonne National Lab TeraGrid

13

File ManagementFile Management Functional Comparisons Functional Comparisons

Page 13: GT4 GRAM: A Functionality and Performance Study Stuart Martin, Martin Feller Computational Institute, University of Chicago & Argonne National Lab TeraGrid

14

File StagingFile Staging

Job staging before and after the user’s job is executed– GRAM2: Yes– GRAM4: Yes

Page 14: GT4 GRAM: A Functionality and Performance Study Stuart Martin, Martin Feller Computational Institute, University of Chicago & Argonne National Lab TeraGrid

15

File staging retry policyFile staging retry policy

If a file staging operation fails, it may be non-fatal and retry may be desired– GRAM2: None– GRAM4: RFT Supported

> Server defaults for all transfers can be configured

> Defaults can be overridden for a specific transfer

Page 15: GT4 GRAM: A Functionality and Performance Study Stuart Martin, Martin Feller Computational Institute, University of Chicago & Argonne National Lab TeraGrid

16

Incremental output staging Incremental output staging “streaming”“streaming”

It can be useful to obtain access to data produced by a program as it executes.– GRAM2: stdout/stderr only– GRAM4: stdout/stderr and any file

> A client can stream files via the service-side GridFTP server. This is what globusrun-ws does for stdout and stderr streaming.

Page 16: GT4 GRAM: A Functionality and Performance Study Stuart Martin, Martin Feller Computational Institute, University of Chicago & Argonne National Lab TeraGrid

17

Standard input accessStandard input access

The contents of a file can be passed to the job’s standard input– GRAM2: Yes– GRAM4: Yes

Page 17: GT4 GRAM: A Functionality and Performance Study Stuart Martin, Martin Feller Computational Institute, University of Chicago & Argonne National Lab TeraGrid

18

Throttle staging workThrottle staging work

A GRAM submission that specifies file staging imposes load on the service node executing the GRAM service.– GRAM2: No– GRAM4: Yes

> GRAM is configured for a maximum number of “worker” threads and thus a maximum number of concurrent staging operations.

Page 18: GT4 GRAM: A Functionality and Performance Study Stuart Martin, Martin Feller Computational Institute, University of Chicago & Argonne National Lab TeraGrid

19

Load balance staging workLoad balance staging work

Allow staging work to be load balanced among a set of service hosts– GRAM2: No– GRAM4: Yes

> Staging work can be distributed over several “service nodes”. For example, a separate GridFTP server can be configured for each LRM type or file system paths.

Page 19: GT4 GRAM: A Functionality and Performance Study Stuart Martin, Martin Feller Computational Institute, University of Chicago & Argonne National Lab TeraGrid

20

GeneralGeneral Functional Comparisons Functional Comparisons

Page 20: GT4 GRAM: A Functionality and Performance Study Stuart Martin, Martin Feller Computational Institute, University of Chicago & Argonne National Lab TeraGrid

21

Access protocolAccess protocol

Protocol used to interact with the service– GRAM2: proprietary HTTP– GRAM4: Web Service SOAP

> Standards based WSDL Client tooling

Page 21: GT4 GRAM: A Functionality and Performance Study Stuart Martin, Martin Feller Computational Institute, University of Chicago & Argonne National Lab TeraGrid

22

Job Description LanguageJob Description Language

The mechanism for specifying job directives.– GRAM2: RSL

> Custom string-based language

– GRAM4: JDD> Job description document (JDD) XML-based version

> Initial prototype of OGF’s JSDL specification

Page 22: GT4 GRAM: A Functionality and Performance Study Stuart Martin, Martin Feller Computational Institute, University of Chicago & Argonne National Lab TeraGrid

23

Extensible Job Description Extensible Job Description LanguageLanguage

A mechanism for passing “extensions” through GRAM to underlying local resource managers– GRAM2: Yes– GRAM4: Yes

Page 23: GT4 GRAM: A Functionality and Performance Study Stuart Martin, Martin Feller Computational Institute, University of Chicago & Argonne National Lab TeraGrid

24

Local Resource Manager InterfaceLocal Resource Manager Interface

The GRAM interface to the LRM to submit, monitor, and cancel jobs.– GRAM2: Perl scripts– GRAM4: Perl scripts + SEG

> Scheduler Event Generator (SEG) provides efficient monitoring between the GRAM service and the LRM for all jobs for all users

Page 24: GT4 GRAM: A Functionality and Performance Study Stuart Martin, Martin Feller Computational Institute, University of Chicago & Argonne National Lab TeraGrid

25

Local Resource ManagersLocal Resource Managers

Supports a range of LRMs - PBS, LSF, Condor, Fork, …– GRAM2: Yes– GRAM4: Yes

Page 25: GT4 GRAM: A Functionality and Performance Study Stuart Martin, Martin Feller Computational Institute, University of Chicago & Argonne National Lab TeraGrid

26

Fault ToleranceFault Tolerance

GRAM can recover from a container or host crash. Upon restart, GRAM will resume processing of the users job submission– GRAM2: Yes - Client initiated

> Processing resumes for a single job after the client has restarted the job manager service process

– GRAM4: Yes - Service initiated> Processing resumes for all jobs once the service container has been restarted

Page 26: GT4 GRAM: A Functionality and Performance Study Stuart Martin, Martin Feller Computational Institute, University of Chicago & Argonne National Lab TeraGrid

27

State Access: Push State Access: Push (subscription)(subscription)

Allow clients to request notifications for state changes– GRAM2: Yes - callbacks– GRAM4: Yes - WS Notifications

> Clients can subscribe for notifications to the “job status” resource property

Page 27: GT4 GRAM: A Functionality and Performance Study Stuart Martin, Martin Feller Computational Institute, University of Chicago & Argonne National Lab TeraGrid

28

State Access: PullState Access: Pull

Allow clients to get the state for a previously submitted job– GRAM2: Yes

> The service defines a proprietary operation to get the job state.

– GRAM4: Yes> The service defines a WSRF resource property that contains the value of the job state. A client can then use the standard WSRF getResourceProperty operation.

Page 28: GT4 GRAM: A Functionality and Performance Study Stuart Martin, Martin Feller Computational Institute, University of Chicago & Argonne National Lab TeraGrid

29

Audit LoggingAudit Logging

Allow an audit records to be inserted into an audit DB when a job completes– GRAM2: Yes– GRAM4: Yes

> An enhancement was contributed by Gerson Galang (APAC) to insert the record at the beginning of the job and to update the audit record after submission and again at job end.

Page 29: GT4 GRAM: A Functionality and Performance Study Stuart Martin, Martin Feller Computational Institute, University of Chicago & Argonne National Lab TeraGrid

30

At Most Once Job SubmissionAt Most Once Job Submission

A simple request-reply job submission protocol has the problem that if the reply message is lost, a client cannot know whether a job has been started. Measures need to be taken to ensure that the same job is not submitted twice.– GRAM2: Yes - 2-phase commit

> Requires an extra round trip, plus a delay on the service to begin processing

– GRAM4: Yes - UUID on create> The client supplies a client-created unique ID (UUID) and the GRAM4 service guarantees not to start a job with a duplicate ID

Page 30: GT4 GRAM: A Functionality and Performance Study Stuart Martin, Martin Feller Computational Institute, University of Chicago & Argonne National Lab TeraGrid

31

Job CancellationJob Cancellation

Allow a job to be cancelled– GRAM2: Yes

> Proprietary operation

– GRAM4: Yes> WSRF standard “Destroy” operation

Page 31: GT4 GRAM: A Functionality and Performance Study Stuart Martin, Martin Feller Computational Institute, University of Chicago & Argonne National Lab TeraGrid

32

Job Lifetime ManagementJob Lifetime Management

Allow a client to control when a job’s state is cleaned up– GRAM2: Yes

> Implements a set of job directives and operations

– GRAM4: Yes> Standard WS-ResourceLifetime operations

Page 32: GT4 GRAM: A Functionality and Performance Study Stuart Martin, Martin Feller Computational Institute, University of Chicago & Argonne National Lab TeraGrid

33

Maximum Active JobsMaximum Active Jobs

The Maximum number of jobs that the service can manage– GRAM2: ~250

> Due to each job Job Manager process querying the LRM separately

– GRAM4: 32,000> Limited by the number of directories that can be created in a directory

Page 33: GT4 GRAM: A Functionality and Performance Study Stuart Martin, Martin Feller Computational Institute, University of Chicago & Argonne National Lab TeraGrid

34

Parallel Job SupportParallel Job Support

Support for MPI jobs “jobtype = MPI”– GRAM2: Yes– GRAM4: Yes

Page 34: GT4 GRAM: A Functionality and Performance Study Stuart Martin, Martin Feller Computational Institute, University of Chicago & Argonne National Lab TeraGrid

35

MPICH-G SupportMPICH-G Support

Support for multi-site MPI– GRAM2: Yes

> Client-side DUROC and service-side DUCT service

– GRAM4: Yes> Multi-job and rendezvous Web Services> MPIg support coming soon

Page 35: GT4 GRAM: A Functionality and Performance Study Stuart Martin, Martin Feller Computational Institute, University of Chicago & Argonne National Lab TeraGrid

36

Basic Execution Service (BES) Basic Execution Service (BES) InterfaceInterface

Support for OGSA BES for job submission– GRAM2: No– GRAM4: Prototyped

> Working on plans to initially support JSDL with the current GRAM4 port type, then add support for BES too

Page 36: GT4 GRAM: A Functionality and Performance Study Stuart Martin, Martin Feller Computational Institute, University of Chicago & Argonne National Lab TeraGrid

37

Performance ComparisonsPerformance Comparisons

Page 37: GT4 GRAM: A Functionality and Performance Study Stuart Martin, Martin Feller Computational Institute, University of Chicago & Argonne National Lab TeraGrid

38

Concurrent JobsConcurrent Jobs(as in paper)(as in paper)

Stage

In

Stage

Out

File Clean Up

Unique Job Dir

GRAM2 GRAM4

None None No No 2552 2100

1X10KB 1X10KB No No 2608 3779

1X10KB 1X10KB Yes Yes 2698 5695

Average seconds per 1000 jobsCondor-g to GRAM to Condor LRM

Page 38: GT4 GRAM: A Functionality and Performance Study Stuart Martin, Martin Feller Computational Institute, University of Chicago & Argonne National Lab TeraGrid

39

Concurrent JobsConcurrent Jobs(as will be in GT 4.0.5)(as will be in GT 4.0.5)

Stage

In

Stage

Out

File Clean Up

Unique Job Dir

GRAM2 GRAM4

None None No No 2552 2176

1X10KB 1X10KB No No 2608 2147

1X10KB 1X10KB Yes Yes 2698 2254

Average seconds per 1000 jobsCondor-g to GRAM to Condor LRM

Page 39: GT4 GRAM: A Functionality and Performance Study Stuart Martin, Martin Feller Computational Institute, University of Chicago & Argonne National Lab TeraGrid

40

Improving performance forImproving performance forstaging jobsstaging jobs

Adding local method call mechanism for general use in Java WS Core (4.0.5)– GRAM is doing this with RFT– Any service which calls another in-process service could make similar modifications for local calls and likely benefit from improved performance

Adding caching of the GridFTP server connections in RFT (4.0.6)

Page 40: GT4 GRAM: A Functionality and Performance Study Stuart Martin, Martin Feller Computational Institute, University of Chicago & Argonne National Lab TeraGrid

41

Sequential JobsSequential Jobs

Delegation

Stage

In

Stage

Out

GRAM2 GRAM4

None None None N/A 1.70

Per Job None None 1.07 3.53

Per Job 1X10KB None 1.78 5.57

Shared 1X10KB None N/A 5.41

Per Job 1X10KB 1X10KB 2.44 9.08

Shared 1X10KB 1X10KB N/A 7.91

Average seconds per job (Fork)

Page 41: GT4 GRAM: A Functionality and Performance Study Stuart Martin, Martin Feller Computational Institute, University of Chicago & Argonne National Lab TeraGrid

42

Sequential JobsSequential Jobs

Delegation

Stage

In

Stage

Out

GRAM2 GRAM4

None None None N/A 1.46

Per Job None None 1.07 3.42

Per Job 1X10KB None 1.78 3.46

Shared 1X10KB None N/A 3.51

Per Job 1X10KB 1X10KB 2.44 5.25

Shared 1X10KB 1X10KB N/A 3.67

Average seconds per job (Fork)

Page 42: GT4 GRAM: A Functionality and Performance Study Stuart Martin, Martin Feller Computational Institute, University of Chicago & Argonne National Lab TeraGrid

43

For More InformationFor More Information

Stuart Martin - [email protected] Martin Feller - [email protected]