july 27, 2005high performance distributed computing 05 recording and using provenance in a protein...
TRANSCRIPT
July 27, 2005High Performance Distributed Computing 05
Recording and Using Provenance in a Protein Compressibility Experiment
Paul Groth, Simon Miles, Weijian Fang, Sylvia C. Wong,
Klaus-Peter Zauner and Luc Moreau
University of Southampton
July 27, 2005High Performance Distributed Computing 05
Outline
Biology The Workflow Use Cases Provenance Implementation Evaluation Conclusion
July 27, 2005High Performance Distributed Computing 05
Biology Determine how protein sequences
(chains of amino acids) fold into a 3D structure?
Which part of DNA translates into one protein sequence?
Structure of protein sequences may help to answer these questions.
Structure can be quantified by textual compressibility.
Determine the amino acid groupings that maximize compressibility?
July 27, 2005High Performance Distributed Computing 05
The Workflow
Get Sequences Make a Sample Recode Sample Compress and Measure Shuffle the sample Compress and Measure each permutation Collate all measures Produce the average compressibility
July 27, 2005High Performance Distributed Computing 05
Use Case (1)
A bioinformatician, A, downloads sequence data of microbial proteins from the database RefSeq.
Runs the compressibility experiment. A later performs the same experiment on
the same sequence data, again downloaded from RefSeq.
A compares the two experiment results and notices a difference.
A determines whether the difference was caused by the algorithms changing
July 27, 2005High Performance Distributed Computing 05
Use Case (2)
A bioinformatician performs an experiment on a FASTA sequence encoding a protein.
A reviewer, later determines whether or not the sequence was in fact processed by a service that meaningfully processes protein sequences only.
July 27, 2005High Performance Distributed Computing 05
Provenance
Use case’s related to process Provenance Definition:
The provenance of a result is the process that led to that result.
o This is a conceptual definition.
July 27, 2005High Performance Distributed Computing 05
Documentation of Process
Conceive a computer based representation of provenance
We represent the provenance of some data by documenting the process that led to the data: documentation can be complete or partial; it can be accurate or inaccurate; it can present conflicting or consensual
views of the actors involved; it can provide operational details of
execution or it can be abstract.
July 27, 2005High Performance Distributed Computing 05
Heterogeneity
This is a heterogeneous applicationHas shell scripts, java programs, web
services Heterogeneity is common in Grid
based appsLCG Atlas - Athena & VDT coexist
Support for plugging-in different execution environments
July 27, 2005High Performance Distributed Computing 05
Provenance “Lifecycle”
ApplicationApplication
Results
ProvenanceStore
Record Documentation of Process
Query to retrieve the provenance of a result
July 27, 2005High Performance Distributed Computing 05
Use Case 1: Do services differ between experiments?
ProvenanceStore
Retrieve documentation of experiments
Service A
• ……….• ………• ……………..
Service A
• ……….• ………• ……………..• ….
Highlight differences in services between experiments
July 27, 2005High Performance Distributed Computing 05
Implementation
Implemented as a VDT workflow Scheduled by Condor
Each service, script, command records process documentation into a provenance store. Uses PReServ: a web services implementation of a provenance store
July 27, 2005High Performance Distributed Computing 05
AxisHandler
AxisHandler
Provenance Service
Backend Store Interface
DatabaseStore
In-MemoryStore …
Backend Stores
PS Client Side
Library
PS Client Side
Library
Web Service WS Client
Query Actor WS
PS Client Side
Library
WS Calls
Java Calls
PReServ Implementation Diagram
July 27, 2005High Performance Distributed Computing 05
Evaluation Deployment
Runs on VMWare deployment consistencyease of development
Workflow is executed on one machine PReServ runs on another machine
July 27, 2005High Performance Distributed Computing 05
Recording Performance
July 27, 2005High Performance Distributed Computing 05
Query Performance
July 27, 2005High Performance Distributed Computing 05
Both recording and query times are linear 10% overhead for asynchronous recording Our provenance concept / system are grounded
in a number of use cases The experiment is ready to be moved to a
cluster or a grid Southampton Cluster A Grid
Will allow us to test scalability
Conclusion
July 27, 2005High Performance Distributed Computing 05
Contact Info
Paul [email protected]
www.pasoa.org- use case descriptions- papers- PReServ software
July 27, 2005High Performance Distributed Computing 05
Configuration
Redhat Linux 9.1 on VMWare on Windows XP
Pentium P4 2.8 GHZ 1.5 GB RAM PReServ on another machine
Database backend Berkley JDB 100 Mb local ethernet