december 2009 data integration in grid environments alex poulovassilis, birkbeck, u. of london
TRANSCRIPT
December 2009
Data Integration in Grid Environments Alex Poulovassilis, Birkbeck, U. of London
Lecture Overview
Part I: Grid Computing Grid Architectures Grid Standards
Part II: The ISPIDER project – a grid application in
Bioinformatics The AutoMed project OGSA-DAI and OGSA-DQP middleware DAI/DQP/AutoMed interoperability in ISPIDER
Part I
Grid Computing Grid Architectures Grid Standards
What is Grid Computing?
the term first arose in the mid 1990s and it is also known as utility computing
the world is full of computing resources connected by networks, but their distribution, heterogeneity and autonomy make it hard for such resources to be shared
the development of grid computing has been motivated by the need for flexible, secure and coordinated resource sharing to solve large-scale computing problems
this resource sharing is between dynamic collections of individuals, institutions and resources, which collectively form a Virtual Organisation (VO)
What is Grid Computing?
resource sharing includes shared usage of hardware, software, data resources, sensor networks, etc.
this sharing is necessary in order to solve in a collaborative fashion large scale computing problems arising in science, engineering and business
the sharing is controlled, with providers of resources defining what may be shared and under what conditions
How did grid computing come about?
arose in academia (e-science), with Ian Foster and Carl Kesselman leading the development of the Globus toolkit
was then picked up by industry e.g. SUN, IBM, HP, Oracle driving forces were:
(a) computationally-intensive scientific problems e.g. simulations(b) scientific problems involving huge quantities of data e.g. analysis of large data sets
leading to so-called Computational Grids and Data Grids grid computing may be viewed as an extension of the
WWW (information sharing) to sharing of general computing resources
The international grid community
the Global Grid Forum (GGF) was formed in 2000 as a community of researchers, users and vendors aiming to exchange ideas on grid development and deployment, and to develop specifications for grid standards
the Enterprise Grid Alliance (EGA) was formed in 2004 as a non-profit, vendor organisation formed to develop grid computing in industry
GGF and EGA merged in 2006 into the new Open Grid Forum
there has been much funding of grid research and development in the EU, regionally, and nationally over the past decade
How is grid different from distributed computing?
in grid computing, the focus is on large-scale resource sharing, requiring authentication, authorisation, resource discovery, resource scheduling and costing
in grid applications, presentation services access the functionality provided by service-oriented grid middleware, which virtualises the dynamic deployment of the actual computing resources
by contrast, in client-server or multi-tier applications, the presentation, application and back-end services and resources are separated, but are fixed and their deployment is known
What is a Grid Architecture?
was originally envisaged as being organised into layers: • Application• Collective: see next slide• Resource: protocols for initiation, monitoring and control of
the computing and data resources• Connectivity: communication and authentication protocols• Fabric: the physical computing and data resources
the components in each layer share common characteristics and build on the services provided by lower layers
the Resource and Connectivity services can be implemented over lower-level resources at the Fabric layer
and can in turn support a range of higher-level services at the Collective and Application layers
Collective Layer
comprises global protocols and services that capture interactions across collections of resources e.g.• directory services to search for resources by name or
attributes such as type, availability, load• allocation, scheduling and brokering services to
request allocation of one or more resources for a specific task, and scheduling of tasks on resources
• monitoring and diagnostic services to monitor the execution of tasks
• workload management systems - for specifying and executing workflows consisting of multiple tasks
• accounting and payment services gathering resource usage information
Subsequent developments in Grid Architectures
no longer a layered architecture but a service oriented architecture (SOA) e.g. the Open Grid Services Architecture
services are loosely coupled peers that can interact with each other to achieve a given capability e.g. • a service may extend the capabilities of another
service in order to provide its own functionality • a service may compose the capabilities of other
services to provide higher-level functionality
This leads to a 3-tier Architecture
Applications Service pool Resources
• the service pool is the grid middleware• below it are the actual physical resources• above it are the applications which access the service
pool• the location and nature of the actual resources is
transparent to the applications • thus the grid middleware allows resource virtualisation• applications use services as and when needed• and pay only for this usage, hence also the term utility
computing
Different types of grids
departmental grids: • built on clusters or groups of clusters owned by one
department of an enterprise enterprise grids:
• sharing of common resources by many departments of an enterprise
partner grids: • involving several partner institutions, known to each
other and with common goals open grids:
• anyone can join and become a resource provider and/or resource user
Status of grid computing today
numerous commercial departmental and enterprise grids are in operation today • e.g. for drug discovery, stock market trading,
integrated circuit design, enterprise resource planning
also many international partner grids • e.g. in high energy physics, life sciences, design and
engineering, computational chemistry, astrophysics, earth sciences
software to support open grids is also emerging
Grid Standards
Numerous Grid products are available from various organisations and vendors
these need to be able to interoperate, and this is made possible by the development of standards e.g. the Open Grid Services Architecture (OGSA)
OGSA is based on Web Services web standards of relevance to the grid include: HTTP (transport),
XML (data format), SOAP (message syntax), WSDL (web service definition), UDDI (web service registry), WS-Security, BPEL (workflow definition)
Grid Standards
a key area in which Grid requirements have motivated new WS standards is in the representation and manipulation of state
standard WSs are stateless from the point of view of the requester of the service
OGSA assumes that service interfaces and service behaviours are defined as in WSRF (Web Service Resource Framework)
WSRF defines how state should be modelled, accessed and managed; how services should be grouped; and how faults should be modelled
Also, WS-Notification defines notification mechanisms that support subscription to, and notification of, changes to services and to their state
OGSA vs other Web Service environments
apart from state, the other major difference of OGSA compared with other WS environments is that grid environments are not static:
in contrast to standard WSs, grid services can be created and deployed dynamically
the set of available resources and their load at any time may be highly variable, while still requiring application requirements and SLAs (service-level agreements) to be met
failures to meet SLAs or occurrences of faults may require dynamic restart of executions on other alternative resources
there is thus a need for monitoring of grid applications and for responding dynamically to their needs until completion
Implementing Grid services
associated with a Grid Service are a set of Service Data Elements (SDEs)
these are XML documents and represent information about grid service instances, allowing their discovery and management
each Grid Service “port type” has an associated set of SDEs
different types of Grid Service are realised by providing different sets of port types
Background Reading for Part I
Grid Cafe, http://www.gridcafe.org/ The EGEE project, www.eu-egee.org/ Worldwide LHC (Large Hadron Collider) Computing Grid,
http://lcg.web.cern.ch/LCG/public/ Open Grid Forum, www.ogf.org Globus toolkit, www.globus.org/alliance/publications/
Part II
The ISPIDER project – a grid application in Bioinformatics
The AutoMed project OGSA-DAI and OGSA-DQP middleware DAI/DQP/AutoMed interoperability in ISPIDER
The ISPIDER Project
Partners: Birkbeck, EBI, Manchester, UCL Requirements:
• There are vast amounts of heterogeneous proteomics data being produced via a variety of new techniques
• Proteomics is the study of the protein complement of the genome
• It is targeted at the elucidation of biological function from genomic data
• There is a need for interoperability between autonomous proteomics data resources
• And for complex analyses over integrated virtual resources
Genome: DNA sequences of 4 bases (A,C,G,T)
RNA: copy of DNA
sequence
Protein: sequence of 20
amino acids
A gene
Biological data: Genes Proteins Biological Function
Permanent copy Temporary copy Product (eachtriple of RNA
bases encodes an amino acid)
FUNCTION
Job
BiologicalProcesses
This slide is adapted from Nigel Martin’s Lecture Notes on Bioinformatics
Aims of ISPIDER
Hence, the development of a Proteomics Grid Infrastructure, using existing proteomics resources and developing new ones; also developing new proteomics clients for querying, visualisation, workflow etc.
The development of such a system is beneficial for a number of reasons: • Access to more data sources yields more reliable
analyses• Integrating resources increases the breadth of
information available for the biologist• Enables new analyses to be undertaken which would
have been prohibitively difficult or impossible with just the individual resources
ISPIDER Architecture
Some ISPIDER data resources
gpmDB See http://gpmdb.thegpm.org• a publicly available database with more than 2 million
proteins and almost 470,000 unique peptide identifications• provides access to a wealth of peptide identifications from
a range of different laboratories and instruments PEDRo http://pedrodb.man.ac.uk:8080/pedrodb
• provides access to a collection of descriptions of experimental data sets in proteomics
PepSeeker http://nwsr.smith.man.ac.uk/pepseeker• developed as part of the ISPIDER project and targeted at
the identification stage of the proteomics pipeline• currently holds over 50,000 proteins and 50,000 unique
peptide identifications
myGrid / DQP / AutoMed Middleware
myGrid: provides a workflow environment over web/grid services, allowing high-level integration of data and applications for in-silico experiments in biology
OGSA-DQP: provides distributed query processing over Grid enabled data resources
AutoMed: provides heterogeneous data integration functionality over distributed data sources (the AutoMed project partners are Birkbeck and Imperial College)
ISPIDER research: integration of AutoMed and DAI/DQP (topic of this lecture); also integration of AutoMed and myGrid workflows
Motivation for AutoMed
Data Integration (DI) is the process of creating an integrated resource which• combines data from a variety of autonomous data sources• in order to support new queries and analyses
the data sources may be heterogeneous in terms of their: • data model, query interfaces, query processing
capabilities, database schema or data exchange format, data types used, nomenclature adopted
this poses several challenges, leading to several methodologies, architectures and systems being developed to support DI
these aim to abstract out data transformation and aggregation logic from application programs into generic data integration software
AutoMed
Supports a metamodel, the Hypergraph Data Model (HDM), in terms of which higher-level modelling languages can be defined – so extensible with new modelling languages
After a modelling language has been specified in terms of the HDM, a set of primitive schema transformations become available for schemas expressed in that language
Schemas can be incrementally transformed and integrated by applying to them a sequence of primitive transformations
Schemas may or may not have data associated with them: so virtual, materialised (data warehousing) or hybrid integration can be supported
Transformations are accompanied by queries, allowing data and query translation between source and target schemas
AutoMed Architecture
Global Query Processor
Global Query Optimiser
Schema Evolution Tool
Schema Transformationand Integration Tools
Model Definition Tool
Schema and Transformation
Repository
Model Definitions Repository
Wrapper
Distributed Data Sources
Global Query Processing in AutoMed
We handle query language heterogeneity by translation into/from a intermediate query language – IQL
A query Q expressed in a high-level query language such as SQL on a global schema S would first be translated into IQL
For example, the following IQL query on a global schema retrieves all identifications for the protein with accession number ENSP00000339074:
[id | {id,an} <- <<Protein,accession_number>>; an=`ENSP00000339074'] View definitions are then derived from the transformation
pathways between S and the data source schemas (in this case gpmDB, PEDRo and PepSeeker)
These view definitions are substituted into Q, reformulating it into an IQL query over source schema constructs
Global Query Processing in AutoMed (cont’d)
E.g. for Q as above the reformulated query is: [id | {id,an} <- [{id2lsid [`pepseeker.proteinhit:', toString d], x}| {d,x}<- distinct [{k,x}|{k,x}<-
<<proteinhit,ProteinID>>]] ++ [{id2lsid [`pedro.protein:', toString d], x}| {d,x}<- <<protein,accession_num>>] ++ [{id2lsid [`gpmdb.proseq:', toString d],x}| {d,x}<-<<proseq,label>>]; an=`ENSP00000339074']
Global Query Processing (cont’d)
Query optimisation then occurs One goal of this is to generate the largest possible sub-
queries that can be submitted to data source Wrappers for translation into the data source query languages and evaluation by the data sources
Query evaluation then follows, during which the AutoMed Evaluator submits to Wrappers sub-queries that they are able to translate into the data source query language (currently, AutoMed supports wrappers for SQL, OQL, XPath, XQuery and flat-file data resources)
The Wrappers submit sub-queries to data sources, and translate sub-query results back into the IQL type system
The Evaluator then undertakes any further necessary query evaluation to combine sub-query results
OGSA-DAI and OGSA-DQP
OGSA-DAI (Data Access and Integration)• delivers data access, transport and metadata
services for the grid• there are other OGSA services that focus on data
derivation, consistency and replication services
OGSA-DQP (Distributed Query Processing) • provides services for the compilation, optimisation
and distributed evaluation of queries over grid data resources accessed via OGSA-DAI
OGSA-DAI functionality
provides a consistent interface to data resources regardless of the underlying technology e.g. relational (Oracle, DB2, MySQL) or XML (Xindice; eXist)
OGSA-DAI extends standard Grid Services with several new port types, including • Grid Data Service (GDS)
OGSA-DAI functionality
Grid Data Service (GDS):• accepts requests, in the form of XML documents,
instructing the Grid Service instance to interact with a database in order to create, retrieve, update or delete data
• its primary operation is perform through which such requests are passed to the GS
• a request may consist of a collection of linked activities e.g. a data access, followed by a data translation, followed by a data delivery
• all of these can be bundled into one request in order to reduce the number of round trips required between the client and the service
OGSA-DQP functionality
This implements the GDS and GDT port types from OGSA-DAI and
also adds two new port types: GDQS and GQES. Grid Distributed Query Service (GDQS):
can interact with known registries to obtain the schemas of data resources and also information about computational resources
this set-up phase occurs once in the lifetime of a GDQS instance
clients can then submit a query to the GDQS via the GDS port-type, using a perform call
this is compiled, optimised and partitioned into a distributed query execution plan each of whose partitions will be scheduled for execution at different GQESs (see below)
the GDQS uses this information to create the necessary GQES instances on their designated execution nodes, and hands over to each GQES the partition assigned to it
OGSA-DQP functionality
Grid Query Evaluation Service (GQES): each GQES instance is an execution node in a
distributed query execution plan it is responsible for that part of the execution plan
allocated to it by the GDQS it implements a physical algebra over other Grid Data
Services encapsulated within these other GDSs are the data
resources whose schemas were imported during the GDQS set-up phase
DAI/DQP/AutoMed Interoperability
Data sources wrapped with OGSA-DAI
AutoMed-DAI wrappers extract data sources’ metadata
Semantic integration of data sources using transformation pathways
IQL queries submitted to an integrated schema are reformulated to IQL queries on the data sources, using the transformation pathways
Submitted to DQP for evaluation (not AutoMed)
AutoMed Wrappers
AutoMedRepository
OGSA-DAIActivity
OGSA-DAIActivity
OGSA-DAIActivity
DB
AutoMedwrapper
AutoMedwrapper
AutoMedwrapper
DistributedQuery Processor
IntegratedAutoMed Schema
AutoMedSchema
AutoMedSchema
AutoMedSchema
AutoMedQuery Processor
IQL query
OQL query
OGSA-DAIService
OGSA-DAIService
OGSA-DAIService
DBDB
AutoMed DQPwrapper
OQL result
IQL result
IQL query
IQL result
The AutoMed-DAI Wrapper
The AutoMed-DAI wrapper requests the schema of the data source using an OGSA-DAI service
The service replies with the source schema encoded an in XML response document
The AutoMed-DAI wrapper creates the corresponding schema in the AutoMed repository
AutoMedwrapper
AutoMedSchema
OGSA-DAIService
schema request
DB
XMLresponse
The AutoMed-DQP Wrapper
The AutoMed-DQP wrapper undertakes two tasks:• needs to inform AutoMed of the subset of IQL that it is
capable of translating into OQL• is responsible for making interactions with OGSA-DQP
transparent to the remainder of the AutoMed infrastructure
On receiving an IQL query, the AutoMed-DQP wrapper first translates it into the equivalent OQL query
The OQL query is then sent to OGSA-DQP for evaluation The reply from OGSA-DQP is in the form of an XML
response document containing the query results The AutoMed-DQP wrapper translates these results into the
IQL type system, and returns the result to AutoMed's evaluator for any further necessary evaluation
Background Reading for Part II
• OGSA Version 1.0 document, January 2005• “Service-Based Distributed Querying on the Grid” by
Alpdemir et al., Proc. of the 1st International Conference on Service Oriented Computing", 2003, pp 467-482
• “The design and implementation of grid database services in OGSA-DAI” by Antonioletti et al., Concurrency - Practice and Experience, Vol 17, No 2-4, 2005, pp 357-376