globus activities within infn massimo sgaravatto infn padova for the infn globus group...

28
Globus activities within INFN Massimo Sgaravatto INFN Padova for the INFN Globus group [email protected]

Post on 20-Dec-2015

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Globus activities within INFN Massimo Sgaravatto INFN Padova for the INFN Globus group globus@infn.it

Globus activities within INFN

Massimo SgaravattoINFN Padova

for the INFN Globus [email protected]

Page 2: Globus activities within INFN Massimo Sgaravatto INFN Padova for the INFN Globus group globus@infn.it

Globus activities within INFN WP “Installation and Evaluation of the Globus

Toolkit” of the INFN-GRID Project Goal: evaluate the Globus toolkit as a GRID

framework providing basic services Which services can be useful ? What is necessary to integrate/modify ? What is missing ?

Duration: 6 months Results of this first evaluation used to plan future

activities

Page 3: Globus activities within INFN Massimo Sgaravatto INFN Padova for the INFN Globus group globus@infn.it

Tasks Security Information Service Resource Manager Globus deployment Data Access and Migration Fault Monitoring Execution Environment Management

Page 4: Globus activities within INFN Massimo Sgaravatto INFN Padova for the INFN Globus group globus@infn.it

Globus installed on ~ 35 machines in 11 sites

TORINO

PADOVA

BARI

PALERMO

FIRENZE

PAVIA

MILANO

GENOVA

NAPOLI

CAGLIARI

TRIESTE

ROMA

PISA

L’AQUILA

CATANIA

BOLOGNA

UDINE

TRENTO

PERUGIA

LNF

LNGS

SASSARI

LECCE

LNS

LNL

SALERNO

COSENZA

S.Piero

FERRARA

PARMACNAF

Status

ROMA2

Page 5: Globus activities within INFN Massimo Sgaravatto INFN Padova for the INFN Globus group globus@infn.it

Security (GSI) Already done:

Evaluation of the Globus security architecture

We like the general architecture, but: Granting local "identities" based only on certificate

subjects allows the existence of multiple valid certificates for the same subject

Authentication library not in sync with OpenSSL development

Cryptic diagnostics (e.g. "certificate chain too long" when the CA policy check fails)

Globus certificates (for hosts and users) signed by INFN certification authority

Page 6: Globus activities within INFN Massimo Sgaravatto INFN Padova for the INFN Globus group globus@infn.it

Security (GSI) To do:

Definition and implementation of architecture of CAs

Up to task force of the DataGrid project Make certificate requests easier Periodic update of CRL “Management” of grid-mapfile updates

I.e.: a certain Globus resource must be available to all members of a specific physics group

Page 7: Globus activities within INFN Massimo Sgaravatto INFN Padova for the INFN Globus group globus@infn.it

Information Service (GIS) Already done:

INFN MDS server serving Globus 1.1.1 and 1.1.2 installations

Lot of problems using the “default” American MDS server

Definition and implementation of test architecture of GIS (for Globus 1.1.3)

Web interface for browsing

Page 8: Globus activities within INFN Massimo Sgaravatto INFN Padova for the INFN Globus group globus@infn.it

Dc=bo, Dc=infn,dc=it,o=grid

Bologna

GIIS

INFN ATLAS GIIS

GIIS

Dc=mi,Dc=infn,dc=it,o=grid

Exp=atlas, o=grid

Top Level INFN GIIS

Dc=infn,dc=it,o=grid

Milano

GIS Architecture (test phase)

GRIS

ImplementedImplemented using INFNGRID distribution

To be implemented

Page 9: Globus activities within INFN Massimo Sgaravatto INFN Padova for the INFN Globus group globus@infn.it

Information Service (GIS) To do:

Netscape LDAP server as Top level INFN GIIS Tests on performance and scalability

Results used to define and implement the GIS architecture

Review the information gathered from the various machines and published in the GIS

Other tools and interfaces for Grid users and administrators

Page 10: Globus activities within INFN Massimo Sgaravatto INFN Padova for the INFN Globus group globus@infn.it

Resource Management (GRAM) Already done:

Job submission tests using Globus tools (globusrun, globus-job-run, globus-job-submit)

GRAM as uniform interface to different underlying resource management systems (LSF, Condor, PBS)

Some bugs found and fixed Standard output and error for vanilla Condor jobs globus-job-status …

Some bugs can be solved without major re-design and/or re-implementation:

For LSF the RSL parameter (count=x) is translated into: bsub –n x … Should be: bsub … x times

… Two major problems:

Scalability Fault tolerance

Page 11: Globus activities within INFN Massimo Sgaravatto INFN Padova for the INFN Globus group globus@infn.it

Globus GRAM Architecture

Client

LSF/ Condor/ PBS/ …

Globus front-end machine

Jobmanager

Job

pc1% globusrun –b –r pc2.pd.infn.it/jobmanager-xyz \ –f file.rsl

file.rsl:&(executable=/diskCms/startcmsim.sh)(stdin=/diskCms/PythiaOut/filename(stdout=/diskCms/Cmsim/filename)(count=1)

pc1 pc2

Page 12: Globus activities within INFN Massimo Sgaravatto INFN Padova for the INFN Globus group globus@infn.it

Scalability One jobmanager for each globusrun If I want to submit 1000 jobs ???

1000 globusrun 1000 jobmanagers running in the front-end machine !!!

%globusrun –b –r pc2.infn.it/jobmanager-xyz –f file.rslfile.rsl:

&(executable=/diskCms/startcmsim.sh)(stdin=/diskCms/PythiaOut/filename)(stdout=/diskCms/CmsimOut/filename)(count=1000)

It is not possible to specify in the RSL file 1000 different input files and 1000 different output files …

$(Process) in Condor Problems with job monitoring (globus-job-status) Therefore (count=x) with x>1 not very useful !

Page 13: Globus activities within INFN Massimo Sgaravatto INFN Padova for the INFN Globus group globus@infn.it

Fault tolerance The jobmanager is not persistent If the jobmanager can’t be contacted,

Globus assumes that the job(s) has been completed

Example of problem Submission of n jobs on a cluster managed

by a local resource management systems Reboot of the front end machine The jobmanager(s) doesn’t restart

Orphan jobs Globus assumes that the jobs have been successfully completed

Page 14: Globus activities within INFN Massimo Sgaravatto INFN Padova for the INFN Globus group globus@infn.it

Resource Management (GRAM) Already done:

Submission of Condor jobs to Globus resources (Condor-G and GlideIn mechanisms)

Evaluation of RSL as uniform language to specify resources The RSL syntax model seems suitable to define even

complicated resource specification expressions The common set of RSL attributes is often not sufficient

The attributes not belonging to the common set are ignored More flexibility is required

Resource administrators should be allowed to define new attributes and users should be allowed to use them in resource specification expressions (Condor Class-Ads model)

Same language to describe the offered resources and the requested resources (Condor Class-Ads model) seems a better approach

Page 15: Globus activities within INFN Massimo Sgaravatto INFN Padova for the INFN Globus group globus@infn.it

Resource Management (GRAM) Already done:

“Cooperation” between GRAM and GIS The information on characteristics and status of

local resources and on jobs is not enough As local resources we must consider Farms and not

the single workstations Other information (i.e. total and available CPU

power) needed The default schema must be integrated with other

info provided by the underlying resource management systems or by specific agents

Page 16: Globus activities within INFN Massimo Sgaravatto INFN Padova for the INFN Globus group globus@infn.it

GRAM & Condor & GIS

Page 17: Globus activities within INFN Massimo Sgaravatto INFN Padova for the INFN Globus group globus@infn.it

GRAM & LSF & GIS

Must be fixed

Page 18: Globus activities within INFN Massimo Sgaravatto INFN Padova for the INFN Globus group globus@infn.it

Jobs & GIS Info on Globus jobs published in the GIS:

User Subject of certificate Local user name

RSL string Globus job id LSF/Condor/… job id Status: Run/Pending/…

Page 19: Globus activities within INFN Massimo Sgaravatto INFN Padova for the INFN Globus group globus@infn.it

Resource Management (GRAM) To do:

Tests with GRAM API Tests with real applications and real environments

(CMS fall production) Already started

Memory leak in the job manager ?!?!?!? Solve the problems Identity a set of useful attributes of a Condor pool,

LSF cluster, PBS cluster that should be reported to the GIS, and integrate the default schema

Let’s start with information provided by the underlying resource management system

Second step: specific agents

Page 20: Globus activities within INFN Massimo Sgaravatto INFN Padova for the INFN Globus group globus@infn.it

GRID Globus deployment

Tools to enable local administrators to deploy the GRID software (now Globus 1.1.3 and related packages: OpenLDAP, …) Reduce complexity and manpower necessary

for installation Decrease errors during installations Collect bug fixes Include INFN customizations

Certificates (for hosts and users) signed by INFN CA … but user certificates signed by Globus CA are

accepted as well Preliminary architecture for GIS

Page 21: Globus activities within INFN Massimo Sgaravatto INFN Padova for the INFN Globus group globus@infn.it

GRID First step (July 2000)

Software distribution available on AFS Fixes for bugs found during first Globus

evaluations included INFNGRID installation guide

Instructions for INFN customizations included

Scripts to make certain steps (i.e. post-install operations) automatic

Page 22: Globus activities within INFN Massimo Sgaravatto INFN Padova for the INFN Globus group globus@infn.it

GRID Second step (now)

Pre-compiled distribution (available now for Linux Red Hat 6.1): INFNGRID 1.1

Script for installation and deployment: infngrid-install

Users decide to use INFN customizations or “standard” setup

Would you like the INFN setup (Y/N) ?

(1) Copy INFNGRID tar files from /afs/infn.it/project/infngrid/1.1/Linux to download dir(2) Decompress and untar INFNGRID distribution files in install dir(3) Configure INFNGRID software(4) Globus Setup(5) Configure GRAM services(6) Globus local deploy(7) GIIS Configuration====================================================

Condor and LSF

Page 23: Globus activities within INFN Massimo Sgaravatto INFN Padova for the INFN Globus group globus@infn.it

GRID Second step

Script for post install operations: globus-root-setup

Installation instructions for special environments (configuration of client machines, shared install-directory) included

List of included bug fixes Status

Tests performed in different environments (INFN, CERN, FNAL) “Officially” released Available to DATAGRID partners

(1) Modify system files and reactive the inetd daemon(2) Change owner to root of certain files for tighter security (3) Modify system wide login files (4) Start/restart Globus now(5) Configure gsi-wuftpd and restart the inetd daemon

Page 24: Globus activities within INFN Massimo Sgaravatto INFN Padova for the INFN Globus group globus@infn.it

GRID Next steps

Configuration of PBS as local resource management systems: 1.2

Support for Solaris 2.6: 1.2 We don’t plan (at least now) to support other platforms

Improvement of current no-precompiled distribution Eventual use of infngrid-istall script for both pre-

compiled and non pre-compiled distribution “Unattended” installation Management of updates Inclusion of GDMP: 1.2 Inclusion of other GRID software packages ?? Other works will be “triggered” by local administrators

and users

Page 25: Globus activities within INFN Massimo Sgaravatto INFN Padova for the INFN Globus group globus@infn.it

Data Management Already done:

Preliminary tests with GASS and gsiftp To do:

Tests with GlobusFTP and Replica Catalog Software (Globus Data Grid Alpha Release 2)

Page 26: Globus activities within INFN Massimo Sgaravatto INFN Padova for the INFN Globus group globus@infn.it

GARA

Preliminary tests considering both network and CPU advance reservation

Client

GARA API

GARA Network

Resource Manager

sunlab3 sunlab2

CISCO 7500 CISCO 7200

VC 100 MbpsFE FE

Server

Page 27: Globus activities within INFN Massimo Sgaravatto INFN Padova for the INFN Globus group globus@infn.it

Other tasks Fault Monitoring (HBM)

Evaluation of HBM for fault detection (for “system” and “user” processes)

Data collectors (implementing automatic recovery mechanisms)

… but the HBM package is not seeing active development

Execution Environment Management (GEM) Evaluation of GEM as service for code migration … but the GEM service now provides only limited

capabilities (executable staging)

Page 28: Globus activities within INFN Massimo Sgaravatto INFN Padova for the INFN Globus group globus@infn.it

Other info http://www.pd.infn.it/~sgaravat/ INFN-GRID/Globus