enabling grids for e-science the glite workload management system alessandro maraschini...
TRANSCRIPT
Enabling Grids for E-sciencE
www.eu-egee.org
The gLite Workload Management SystemAlessandro Maraschini [email protected]
OGF20, Manchester, UK2007, May 10
OGF20, Manchester, UK 2007, May 10th 2
Enabling Grids for E-sciencE
Contents
• WMS– System overview, partners, task, components
• JDL– Language overview– JobTypes: single /compounds
• News– New Functionalities– Latest Activities
• Future Plans– Future Implementations & Activities– WMS and WfMS
• Tests– Middleware Testing Activities & Results
OGF20, Manchester, UK 2007, May 10th 3
Enabling Grids for E-sciencE
• Workload Management System (WMS)– Italian and Czech clusters
Part of Joint Research Activity 1 (JRA1)
• Partners involved – INFN– Datamat – CESNET
• Provides Distribution and Management of tasks across resources available on a Grid– Accepts a request of execution of a Job from a client– Finds appropriate resources to satisfy the Job– Follows the Job until completion.
Introduction: gLite WMS
OGF20, Manchester, UK 2007, May 10th 4
Enabling Grids for E-sciencE
WMS Architecture: core components
• WMProxy– Accepts Request from User– Checks/Authentication/Authorization– Sets up Local File System– Forwards request to WM
• Workload Manager (WM)– Look for the appropriate Computing Element (CE)
Matchmaking operation
– Forwards request to CondorC• Logging & Bookkeeping (LB)
– Tracks jobs in terms of events gathered from various gLite components– Processes the incoming events to give a higher level view on the job
states– Recently introduction of LBProxy
Lightweight local LB service “dedicated” to WMS components Asynchronously logs info to the actual (usually remote) LB
OGF20, Manchester, UK 2007, May 10th 5
Enabling Grids for E-sciencE
WMS Architecture overview
Job Controller CondorG
gLite WMS
Workload Manager
LB Proxy
WMProxyUserInterface
LB Server
gLite
CE
LCG
CE
Job Controller
CondorC
Log Monitor
OGF20, Manchester, UK 2007, May 10th 8
Enabling Grids for E-sciencE
JDL: overview
• Job Description Language (JDL)– gLite approach to Request Description– Allows the user to provide job execution needed information
Characteristics of the application Requirements/preferences about resources Customized hints for gLite WMS on how to handle the application
• Supported Job Types– Single Jobs– Compound Jobs
Workflows (DAGs) Collections, Parametric Jobs
OGF20, Manchester, UK 2007, May 10th 9
Enabling Grids for E-sciencE
JDL: Single Types
• Single Jobs– Normal: single and simple batch job with no peculiar requirements– MPICH: a parallel application to be run on the nodes of a cluster using
the MPICH implementation of the message passing interface new MPI flavours support planned
– Interactive: a job whose standard streams are forwarded to the submitting client, that can actually interact and steer the job execution by providing real-time input information
• Previously Supported Job Types– Not supported anymore:
Checkpointable Jobs Partitionable Jobs
– Deprecation due Lack of feedback from users It seems they are not used at all
– Focus on improving support for “really used” job types
OGF20, Manchester, UK 2007, May 10th 10
Enabling Grids for E-sciencE
JDL: Compound Jobs
• Definition– Aggregation of Single/Normal Jobs
• Benefits– One Shot submission for (up to thousands of) jobs
Single call to WMProxy server Single AuthN and AuthZ process Submission time reduction
– Single Identification to manage all jobs (father Job) Not an actual Job, used to monitor the whole bunch
– Sharing of files between jobs
OGF20, Manchester, UK 2007, May 10th 11
Enabling Grids for E-sciencE
JDL: Compound Types
• Compound Jobs: Workflows– Implemented as Directed Acyclic Graphs (DAGs)– Set of jobs where the input, output or execution of one of more
jobs may depend on one or more other jobs– Dependencies represent time constraints: a child cannot start
before all parents have successfully completed
nodeEnodeDnodeC
nodeI
nodeF
nodeH nodeG
nodeBnodeA Father
OGF20, Manchester, UK 2007, May 10th 12
Enabling Grids for E-sciencE
JDL: Compound Types
• Compound Jobs: Parametric Jobs– Parameterized description of a Job
Parameter sweep usage
– Automatically converted on WMS side – Generate a (possibly) huge number of (similar) jobs– No dependencies between nodes
nodeDnodeCnodeB
nodeA nodeE
Father
OGF20, Manchester, UK 2007, May 10th 13
Enabling Grids for E-sciencE
JDL: Compound Types
• Compound Jobs: Collections– A set of possibly heterogeneous jobs that can be specified
within a single JDL description– No dependencies among specified jobs
nodeD
nodeCnodeB
nodeA nodeE
Father
OGF20, Manchester, UK 2007, May 10th 14
Enabling Grids for E-sciencE
New Functionalities: WMProxy
• WMProxy server– Replaces the old C++ based socket connection service– Implements an interoperable interface
Web Service based WS-I compliant
• WMProxy client– Provides C++ based WMS command-line User Interface (UI),
which executes all the needed operation automatically– Provides multi language (C++, Java and Python) provided APIs
OGF20, Manchester, UK 2007, May 10th 15
Enabling Grids for E-sciencE
New Functionalities: ICE- CREAM
• CREAM service: Computing Resource Execution And Management service– CE with Web service interface– WMS requests are directly forwarded to CREAM based CEs
through ICE
• ICE: Interface to Cream Environment– Basically reproduces the Job Controller / CondorC / Log Monitor
layer needed by gLite/LCG CEs
OGF20, Manchester, UK 2007, May 10th 16
Enabling Grids for E-sciencE
New Functionalities: ICE- CREAM
Job Controller CondorG
gLite WMS
Workload Manager
LB Proxy
WMProxyUserInterface
ICE
LB Server
gLite
CE
LCG
CE
CREAM
Job Controller
CondorC
LogMonitor
OGF20, Manchester, UK 2007, May 10th 17
Enabling Grids for E-sciencE
New Functionalities: Sandbox Files
• Sandbox Archiving and Sharing– Job sandbox files can be automatically compressed– Different jobs can share the same sandbox
dramatically reduces network traffic allowes the user to save time and bandwidth
• Sandbox Remote Specification– User can store files directly on a remote machine– No intermediate copies – JobWrapper will download directly
from WorkerNode– Reduces server load
• Supported File Transfer– Full support (input & output files) for protocols:
gridftp https
OGF20, Manchester, UK 2007, May 10th 18
Enabling Grids for E-sciencE
New Functionalities: Bulk-MM
• Bulk-MatchMaking– Natural completion of the Bulk Submission– Allow single Matchmaking of similar jobs in one shot
Jobs equivalence are based upon specifying “significant attributes” Jobs whose significant attribues are literally equal are equivalent
– Target Jobs: Bunch of Independent Jobs Mainly Collections and Parametric Jobs Originally managed with Condor DAGMan
– Allows Submission to CREAM-based CEs Provides additional boost in WMS performance
– Saves time & resources
OGF20, Manchester, UK 2007, May 10th 19
Enabling Grids for E-sciencE
Other New Functionalities
• Service Discovery– Provides services information by performing queries to external
databases of different kinds (RGMA, BDII)– Client side
Queries for available WMProxy Endpoints on the net Does not need user commands manual reconfiguration
– Server side Queries for available LB servers where to Log Job information
• Job Files Perusal– Performs a monitoring activity on the actual output files
produced by a job during its lifecycle: – Adds important pieces of information not available by simple
status monitoring and that were before available only at job completion
OGF20, Manchester, UK 2007, May 10th 20
Enabling Grids for E-sciencE
gLite New Activities
• New platforms widely deployed on the infrastructure– In particular Scientific Linux 4 and 64-bit architectures
• Migration to ETICS build system– High Flexibility– Addresses multiple platform support
almost impossible using the old gLite build system
– All WMS components build achieved– WMS Ongoing activity: Integration/deployment
Software not yet fully deployed Client side manual installation fully working Server side installation not yet available (almost achieved)
OGF20, Manchester, UK 2007, May 10th 21
Enabling Grids for E-sciencE
gLite WMS Ongoing Restructuring
• gLite Restructuring:– All new features development stopped for 6 months– improving usability & portability
Multi platform (Structural changes needed)
– Cleaning up sections that cause build and porting difficulties– Removing/Reducing Dependencies on external software to
ease installation and deployment
• Goals:– Easier Service maintainance and Usage
Will increase stability and throughput
– Toward a “gLighter” User Interface Identify and remove all unnecessary dependencies
OGF20, Manchester, UK 2007, May 10th 22
Enabling Grids for E-sciencE
gLite WMS Future
• Improving Logging and Error Reporting– Common syslog-like logging format
• Windows working prototype gLite porting on Microsoft Windows platforms
• Improving interoperability– Supercomputing 06
Working Prototype for Demo Basic Execution Service (BES) Job Submission Description Language (JSDL)
OGF20, Manchester, UK 2007, May 10th 23
Enabling Grids for E-sciencE
WfMS and gLite WMS
• Possible “external integration” with external existing Workflow frameworks– Still to be discussed and planned
• A proposal for a Workflow Mangement System Integrated within WMS under discussion– Running on top of gLite Middleware – Abstract and Generic Representation of Workflow
Internally usage of Petri Net model Externally translation mechanisms from different language front
ends
OGF20, Manchester, UK 2007, May 10th 25
Enabling Grids for E-sciencE
Test & Result
• Intense testing and constant bug fixing activities have been performed over the last months – Improved job submission rate– Improved service stability – New Functionalities tested and adopted by experiments
• Production quality test Results:– 16K jobs/day over one week of submissions– No manual intervention on server
Stable memory usage
– 0.3% of jobs in non-final state Aborted jobs mostly due to expired user credentials Was about 5% before Bulk-MM support
OGF20, Manchester, UK 2007, May 10th 27
Enabling Grids for E-sciencE
Some Links
• WMS – http://egee-jra1-wm.mi.infn.it/egee-jra1-wm/
• WMProxy– http://trinity.datamat.it/projects/EGEE/wiki/wiki.php
• LB– http://egee.cesnet.cz/en/JRA1/index.html
• CREAM– http://grid.pd.infn.it/cream
• JDL– http://edms.cern.ch/document/590869/1