service oriented science ian foster argonne national laboratory university of chicago univa...
TRANSCRIPT
Service Oriented Science
Ian FosterArgonne National Laboratory
University of ChicagoUniva Corporation
2
Two Exciting Things That I Won’t Talk About
Globus Toolkit v4 (release: April 30, 2005) Robustness, performance, usability, testing,
documentation, standards compliance E.g., GRAM supports 30,000 active jobs 180+ people on alpha tester list New functionality: data management, security,
registry, OGSA-DAI, C hosting, etc. Our work with DAGman, Condor-G, Condor
> 1 Million jobs (we estimate) run over the last year from many application domains
Mike Wilde’s talk (yesterday) gave details
3
Instead: Scaling eScience
Dimensions of scaling Service-oriented science Separating concerns: hosting eScience
communities
eScience [n]: Large-scale science carried out through distributed collaborations—often leveraging access to large-scale data & computing
4
Dimensions of Scaling:For Example, U.S. Dept of Energy
Lawrence BerkeleyNational Lab
•Advanced Light Source•National Center for Electron Microscopy
•National Energy Research Scientific Computing Facility
Los Alamos NeutronScience Center
Univ. of IL• Electron Microscopy Center
for Materials Research • Center for Microanalysis of
Materials
MIT•Bates Accelerator Center
•Plasma Science & Fusion Center
SC User FacilitiesInstitutions that Use SC Facilities
Fermi National Accelerator Lab•Tevatron
Stanford Linear Accelerator Center
•B-Factory•Stanford Synchrotron Radiation Laboratory
Princeton Plasma Physics Lab
GeneralAtomics
- DIII-D Tokamak
SC Laboratories
Pacific Northwest National Lab
• Environmental Molecular Sciences Lab
Argonne National Lab• Intense Pulsed Neutron Source•Advanced Photon Source•Argonne Tandem Linac Accelerator System
BrookhavenNational Lab
•Relativistic Heavy Ion Collider
•National Synchrotron Light Source
Oak Ridge National Lab•High-Flux Isotope Reactor Surface Modification & Characterization Center
•Spallation Neutron Source (under construction)
Thomas Jefferson NationalAccelerator Facility
•Continuous Electron Beam Accelerator Facility
Physics AcceleratorsSynchrotron Light SourcesNeutron SourcesSpecial Purpose FacilitiesLarge Fusion Experiments
Sandia Combustion Research Facility
James R. MacDonaldLaboratory
5
Dimensions of Scaling:E.g., U.S. Dept of Energy
Goal: Any DOE scientist can access any DOE computer, software, data, instrument ~25,000 scientists* (vs. ~1000 DOE certs) ~1000 instruments** (vs. maybe 10 online?) ~1000 scientific applns** (vs. 2 Fusion services) ~10 PB of interesting data** (vs. 100TB on ESG) ~100,000 computers* (vs. ~3000 on OSG)
Not to mention many external partners
I.e., we need to scale by 2-3 orders of magnitude to have DOE-wide impact!
* Rough estimate; ** WAG
6
Scaling eScience
Dimensions of scaling Service-oriented science Separating concerns: hosting eScience
communities
eScience [n]: Large-scale science carried out through distributed collaborations—often leveraging access to large-scale data & computing
7
Scaling eScience:A Services Approach
Take the “Grid” moniker seriously Not “discover, deploy, debug, monitor, resubmit,
…” but “plug in and tune out” For example
GriPhyN virtual data service dispatches analysis tasks to campus or national Grid
Campus CHARMM service dispatches large jobs to national resources
Online biology service serves thousands, uses national resources to preprocess data
I.e., eScience as “service”
8For Example: BLASTing for Protein Knowledge
Blasting complete NR DB for sequence similarity and function characterization Knowledge Base
PUMA enables researchers to find information about a specific protein after having been analyzed against the complete set of sequenced genomes (NR file: ~ 2 million sequences)
Analysis on the Grid
The analysis of protein sequences occurs in the background in the grid environment. Millions of processes are started since several tools are run to analyze each sequence, such as finding protein similarities (BLAST), protein family domain searches (BLOCKS), and structural characteristics of the protein.
9
Provisioning Service-oriented
infrastructure Provision physical
resources to support application workloads
Service-Oriented ScienceRequires Grid Technology
Service-oriented applications Wrap applications as
(Web) services Compose applications
into workflows ApplnService
ApplnService
Users
Workflows
Composition
Invocation
10
Grid Technology asService-Oriented Infrastructure
IBM
IBM
Uniform interfaces,security mechanisms,Web service transport,
monitoring
Computers StorageSpecialized resource
UserApplication
UserApplication
UserApplication
IBM
IBM
GRAM GridFTPHost EnvUser Svc
DAIS
Database
ToolTool Reliable
FileTransfer
MyProxy
Host EnvUser Svc
MDS-Index
11
Scaling eScience
Dimensions of scaling Service-oriented science Separating concerns: hosting
eScience communities
eScience [n]: Large-scale science carried out through distributed collaborations—often leveraging access to large-scale data & computing
12Scaling eScience:A Range of Approaches
Cookie cutter Standard h/w + s/w E.g., BIRN, PlanetLab, NEES Simple deployment, limited scalability
Service ecology Standard interfaces, many service providers E.g., NVO, bioinformatics Powerful model, limited service capacity
General-purpose infrastructure Standard resource provider interfaces E.g., TeraGrid, OSG Need to work out how to host services
13
Scaling eScience:Separating Concerns
Content Stuff that a community cares about: data, metadata,
software, analyses, instruments Community responsibility
Middleware/function Plumbing needed for community to function: membership,
data mgmt, registry, workflow Can often be provided by others
Resources The physical devices required to support community
content, function, computation Need not be the concern of individual users!
14
Domain-independentDomain-dependent
Content
Function
Resources
Experimental apparatus Servers, storage, networks
Metadatacatalog
Dataarchive
Simulationserver
Certificateauthority
Simulationcode
Exptdesign
Telepresencemonitor
SimulationcodeExpt
output
Electronicnotebook
Portalserver
Scaling eScience:Separating Concerns
15
Virtualizing Resources(K. Keahey et al.)
“Virtual workspace” as a core abstraction Computer(s), network(s), configuration(s)
Multiple implementation technologies Dynamic accounts (e.g., gLite deployment) Virtual machines (current prototyping)
E.g., “OSG virtual cluster” A collection of virtual machines running standard OSG
software (Virtual Data Toolkit) Instantiation by a resource provider makes it
immediately accessible as an OSG cluster Load (3 nodes): 1.3 sec; start: 0.7 sec
16
Summary
Q: How to scale eScience? A1: Virtualization: eScience as service
AKA “science gateways” Service-oriented infrastructure for
management & provisioning A2: Separation of concerns
Allow providers to host communities by providing resources & function
Virtual workspaces as an enabling technology
17
For More Information
Globus Alliance www.globus.org
Globus Consortium www.globusconsortium.com
Global Grid Forum www.ggf.org
Open Science Grid www.opensciencegrid.org
Background information www.mcs.anl.gov/~foster
2nd Editionwww.mkp.com/grid2