Text
SAGA-based Frameworks: Supporting Application Usage Modes
Shantenu Jha
Director, Cyber-Infrastructure Development, CCT
Asst Research Professor, CS
e-Science Institute, Edinburgh
http://www.cct.lsu.edu/~sjha
http://saga.cct.lsu.edu
Text
Outline (1)
Understanding Distributed Applications (DA) Differ from HPC or || App, Challenges of DA DA Development Objectives (IDEAS)
Understanding SAGA (and the SAGA-Landscape) Rough Taxonomy of Distributed Applications Using SAGA to develop Distributed Applications Examples: Application & Application Frameworks
Discuss how IDEAS are met Some SAGA-based Tools and Projects Adv. Of Standards
Derive (Initial) User Requirements for FutureGrid
Understanding Distributed ApplicationsCritical Perspectives
The number of applications that utilize multiple sites sequentially, concurrently or asynchronously is low (~5%):
Not referring to tightly-coupled across multiple-sites Distributed CI: Is the whole > than the sum of the parts?
Managing data and applications across multiple resources is (increasingly) hard:
Distributed Data/Jobs vs Bring it to the Computing Compute where data is or Data to where computing is
Challenges qualitatively and quantitatively set to get worse: Increasing complexity, heterogeneity and scale
Understanding Distributed Applications Distributed Applications Require:
Coordination over Multiple & Distributed sites: Scale-up and Scale-out
Peta/Exa/Atta - Scientific Applications requiring multiple-runs, ensembles, workflows etc.
Core characteristics of logically and physically distributed applications are the SAME
Application Usage Mode: Composed using Application as the UNIT of execution Not a workflow (i.e., composed using control and data flow)
Usage Mode: Closer to an Abstract Workflow (template) Examples: Run once; or Set of copies of an application with
varied input data (Ensemble); Loosely-Coupled ensembles..
Text
• Fundamentally a hard problem:• Dynamical Resource, Heterogeneous resources• Add to it: Complex underlying infrastructure
• Programming Systems for Distributed Applications:• Incomplete? Customization? Extensibility?• What should end-user control? Must control?
• Computational Models of Distributed Computing• Range of DA, no clear taxonomy• More than (peak) performance• Application Usage Mode
• Inter-play of Application, Infrastructure, Usage Mode
Understanding Distributed Applications Development Challenges
Understanding Distributed Applications Implicit vs Explicit ?
Which approach (implicit vs explicit) is used depends: How the application is used?
Need to control/marshall more than one resource? Why distributed resources are being used? How much can be kept out of the application?
Can’t predict in advance? Not obvious what to do, application-specific metric
If possible, Applications should not be explicitly distributed GATEWAYS approach:
Implicit for the end-users Supporting Applications? Or Application Usage Modes?
Understanding Distributed Applications Development Objectives
Interoperability: Ability to work across multiple distributed resources
Distributed Scale-Out: The ability to utilize multiple distributed resources concurrently
Extensibility: Support new patterns/abstractions, different programming systems, functionality & Infrastructure
Adaptivity: Response to fluctuations in dynamic resource and availability of dynamic data
Simplicity: Accommodate above distributed concerns at different levels easily…
Challenge: How to develop DA effectively and efficiently with the above as first-class objectives?
Text
SAGA: Basic Philosophy There exists a lack of Programmatic approaches that:
Provide general-purpose common grid functionality for applications and thus hide underlying complexity, varying semantics..
Hides “bad” heterogeneity, means to address “good” heterogeneity Building blocks upon which to construct higher-levels of
functionality and abstractions Meets the need for a Broad Spectrum of Application:
Simple Distributed Scripts, Gateways, Smart Applications and Production Grade Tooling, Workflow…
Simple, integrated, stable, uniform and high-level interface Simple and Stable: 80:20 restricted scope and Standard Integrated: Similar semantics & style across commonly used
distributed functional requirements Uniform: Same interface for different distributed systems
SAGA: Provides Application* developers with basic units required to compose high-functionality across different distributed systems
(*) One person’s Application is another person’s Tool
SAGA: In a Thousand Words
Text
SAGA: Job SubmissionRole of Adaptors (middleware binding)
SAGA Job API: Example
SAGA Job Package
SAGA File Package
File API: Example
SAGA Advert
SAGA Advert API: Example
SAGA: Other Packages
SAGA: Implementations Currently there are several implementations under active
development: C++ Reference Implementation (LSU) -- OMII-UK
http://saga.cct.lsu.edu/cpp/ Java Implementation (VU Amsterdam), part of the
OMII-UK projecthttp://saga.cct.lsu.edu/java/
JSAGA (IN2P3/CNRS)http://grid.in2p3.fr/jsaga/
DEISA (partial) job, file package C++: Currently at v1.3.3 (October 2009) Python bindings to the C++ available
Good faith effort to keep things working
SAGA: Available Adaptors
Job Adaptors Fork (localhost), SSH, Condor, Globus GRAM2, OMII
GridSAM,Amazon EC2, Platform LSF
File Adaptors Local FS, Globus GridFTP, Hadoop Distributed Filesystem
(HDFS),CloudStore KFS, OpenCloud Sector-Sphere
Replica Adaptors PostgreSQL/SQLite3, Globus RLS
Advert Adaptors PostgreSQL/SQLite3, Hadoop H-Base, Hypertable
SAGA: Available Adaptors
Other Adaptors Default RPC / Stream / SD
Planned Adaptors CURL file adaptor, gLite job adaptor
Open issues: Consolidating the Adaptor code base and adding
rigorous tests in order to improve adaptor quality Capability Provider Interface (CPI - the ‘Adaptor
API’) is not documented or standardized (yet), but looking at existing adaptor code should get you started if you want to develop your own adaptor
Proof by example..
SAGA and Distributed Applications
Taxonomy of Distributed Application Example of Distributed Execution Mode:
Implicitly Distributed 1000 job submissions on the TG
SAGA shell example/tutorial Example of Explicit Coordination and Distribution
Explicitly Distributed DAG-based Workflows EnKF-HM application
Example of SAGA-based Frameworks MapReduce, Pilot-Jobs
Development Distributed Application Frameworks
Frameworks: Logical structure for Capturing Application Requirements, Characteristics & Patterns
Pattern: Commonly recurring modes of computation Programming, Deployment, Execution, Data-access..
Abstraction: Mechanism to support patterns and application characteristics
Frameworks designed to either:• Support Patterns: Map-Reduce, Master-Worker,
Hierarchical Job-Submission• Provide the abstractions and/or support the requirements
& characteristics of applications• i.e. Encode a Usage-Mode using a Framework
Abstractions for Distributed Computing (1) BigJob: Container Task
Adaptive:
Type A: Fix number of replicas; vary cores assigned
to each replica.
Type B: Fix the size of replica, vary number of replicas
(Cool Walking)
-- Same temperature range (adaptive sampling)
-- Greater temperature range (enhanced
dynamics)
Abstractions for Distributed Computing (2)SAGA Pilot-Job (Glide-In)
Coordinate Deployment & Scheduling of Multiple Pilot-Jobs
Distributed Adaptive Replica Exchange (DARE)Scale-Out, Dynamic Resource Allocation and Aggregation
Multi-Physics Runtime FrameworksExtensibility
Coupled Multi-Physics require two distinct, but concurrent simulations
Can co-scheduling be avoided?
Adaptive execution model: Yes
Load-balancing required. Capability comes for free!
First demonstrated multi-platform Pilot-Job:
TG(MD) – Condor (CFD)
Dynamic Execution Reduced Time to Solution
Ensemble Kalman Filters Heterogeneous Sub-Tasks
Ensemble Kalman filters (EnKF), are recursive filters to handle large, noisy data; use the EnKF for history matching and reservoir characterization
EnKF is a particularly interesting case of irregular, hard-to-predict run time characteristics:
Results: Scale-Out Performance
Using more machines decreases the TTC and variation between experiments
Using BQP decreases the TTC & variation between experiments further
Lowest time to completion achieved when using BQP and all available resources
Performance Advantage from Scale-Out
But Why does BQP Help?
Understanding Distributed Applications Development Objectives Redux
Interoperability: Ability to work across multiple distributed resources
SAGA: Middleware Agnostic Distributed Scale-Out: The ability to utilize multiple
distributed resources concurrently Support Multiple Pilot-Jobs: Ranger, Abe, QB
Extensibility: Support new patterns/abstractions, different programming systems, functionality & Infrastructure
Pilot-Job also Coupled CFD-MD, Integrated BQP Adaptivity: Response to fluctuations in dynamic resource
and availability of dynamic data Simplicity: Accommodate above distributed concerns at
different levels easily…
SAGA: Bridging the Gap between Infrastructure and Applications
Focus on Application Development and
Characteristics, not infrastructure details
Text
SAGA-based Tools and Projects
JSAGA from IN2P3 (Lyon) http://grid.in2p3.fr/jsaga/index.html Slides Ack: Sylvain Renaud
GANGA-DIANE (EGEE) http://faust.cct.lsu.edu/trac/saga/wiki/Applications/GangaSAGA Slides Ack: Jackub Mosciki, Massimo L, O. Weidner
NAREGI/KEK (Active) DESHL
DEISA-based Shell and Workflow library XtreemOS SD Specification
With gLite adaptors
Advantage of Standards
36
JSAGA: Implementer and user of SAGAJSAGA: Implementer and user of SAGA
JSAGA JSAGA uses SAGA uses SAGA in a module, in a module, which hides heterogeneity of which hides heterogeneity of grid infrastructuresgrid infrastructures
JSAGA JSAGA implements SAGA implements SAGA to hide to hide heterogeneity of heterogeneity of middlewaresmiddlewares
ApplicationsApplications
jobsjobscollectioncollection
JSAGAJSAGA
SAGASAGA
core enginecore engine+ plug-ins+ plug-insJSAGAJSAGA
Legacy APIsLegacy APIs
JSAGA 37
Projects using JSAGAProjects using JSAGA
Elis@– a web portal for submitting jobs to industrial and
research grid infrastructures
SimExplorer– a set of tools for managing simulation experiments– includes a workflow engine that submit jobs to
heterogeneous distributed computing resources
JJS– a tool for running efficiently short-life jobs on EGEE
JUX– a multi-protocols file browser
//
DIANE INTEGRATIONDIANE INTEGRATION cont. cont.
Diane without SAGA Diane with SAGA
Maste
r
Maste
r
Agents
scheduling
Agents
schedulingHeterogeneous resourcesallocation (Ganga + Ganga/SAGA)
Applications on heterogeneous Applications on heterogeneous resourcesresources
Ganga/gLite
Ganga/SAGA (to TeraGrid)
Ganga/SAGA (to *)
Payload distribution
Payload distribution
Application-aware
(and resource-
aware) scheduling
Federating
resources!
(Not in this demo: cloud
resources, additional Grid
infrastructures…)
AcknowledgementsSAGA Team and DPA Team and the UK-EPSRC (UK EPSRC: DPA, OMII-UK , OMII-UK PAL)
People:SAGA D&D: Hartmut Kaiser, Ole Weidner, Andre Merzky, Joohyun Kim, Lukasz
Lacinski, João Abecasis, Chris Miceli, Bety Rodriguez-MillaSAGA Users: Andre Luckow, Yaakoub el-Khamra, Kate Stamou, Cybertools
(Abhinav Thota, Jeff, N. Kim), Owain KenwayGoogle SoC: Michael Miceli, Saurabh Sehgal, Miklos ErdelyiCollaborators and Contributors: Steve Fisher & Group, Sylvain Renaud
(JSAGA), Go Iwai & Yoshiyuki Watase (KEK)DPA: Dan Katz, Murray Cole, Manish Parashar, Omer Rana, Jon Weissman