july 27, 2005high performance distributed computing 05 recording and using provenance in a protein...

19
July 27, 2005 High Performance Distributed Computing 05 Recording and Using Provenance in a Protein Compressibility Experiment Paul Groth, Simon Miles, Weijian Fang, Sylvia C. Wong, Klaus-Peter Zauner and Luc Moreau University of Southampton

Upload: charlotte-bradley

Post on 14-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: July 27, 2005High Performance Distributed Computing 05 Recording and Using Provenance in a Protein Compressibility Experiment Paul Groth, Simon Miles,

July 27, 2005High Performance Distributed Computing 05

Recording and Using Provenance in a Protein Compressibility Experiment

Paul Groth, Simon Miles, Weijian Fang, Sylvia C. Wong,

Klaus-Peter Zauner and Luc Moreau

University of Southampton

Page 2: July 27, 2005High Performance Distributed Computing 05 Recording and Using Provenance in a Protein Compressibility Experiment Paul Groth, Simon Miles,

July 27, 2005High Performance Distributed Computing 05

Outline

Biology The Workflow Use Cases Provenance Implementation Evaluation Conclusion

Page 3: July 27, 2005High Performance Distributed Computing 05 Recording and Using Provenance in a Protein Compressibility Experiment Paul Groth, Simon Miles,

July 27, 2005High Performance Distributed Computing 05

Biology Determine how protein sequences

(chains of amino acids) fold into a 3D structure?

Which part of DNA translates into one protein sequence?

Structure of protein sequences may help to answer these questions.

Structure can be quantified by textual compressibility.

Determine the amino acid groupings that maximize compressibility?

Page 4: July 27, 2005High Performance Distributed Computing 05 Recording and Using Provenance in a Protein Compressibility Experiment Paul Groth, Simon Miles,

July 27, 2005High Performance Distributed Computing 05

The Workflow

Get Sequences Make a Sample Recode Sample Compress and Measure Shuffle the sample Compress and Measure each permutation Collate all measures Produce the average compressibility

Page 5: July 27, 2005High Performance Distributed Computing 05 Recording and Using Provenance in a Protein Compressibility Experiment Paul Groth, Simon Miles,

July 27, 2005High Performance Distributed Computing 05

Use Case (1)

A bioinformatician, A, downloads sequence data of microbial proteins from the database RefSeq.

Runs the compressibility experiment. A later performs the same experiment on

the same sequence data, again downloaded from RefSeq.

A compares the two experiment results and notices a difference.

A determines whether the difference was caused by the algorithms changing

Page 6: July 27, 2005High Performance Distributed Computing 05 Recording and Using Provenance in a Protein Compressibility Experiment Paul Groth, Simon Miles,

July 27, 2005High Performance Distributed Computing 05

Use Case (2)

A bioinformatician performs an experiment on a FASTA sequence encoding a protein.

A reviewer, later determines whether or not the sequence was in fact processed by a service that meaningfully processes protein sequences only.

Page 7: July 27, 2005High Performance Distributed Computing 05 Recording and Using Provenance in a Protein Compressibility Experiment Paul Groth, Simon Miles,

July 27, 2005High Performance Distributed Computing 05

Provenance

Use case’s related to process Provenance Definition:

The provenance of a result is the process that led to that result.

o This is a conceptual definition.

Page 8: July 27, 2005High Performance Distributed Computing 05 Recording and Using Provenance in a Protein Compressibility Experiment Paul Groth, Simon Miles,

July 27, 2005High Performance Distributed Computing 05

Documentation of Process

Conceive a computer based representation of provenance

We represent the provenance of some data by documenting the process that led to the data: documentation can be complete or partial; it can be accurate or inaccurate; it can present conflicting or consensual

views of the actors involved; it can provide operational details of

execution or it can be abstract.

Page 9: July 27, 2005High Performance Distributed Computing 05 Recording and Using Provenance in a Protein Compressibility Experiment Paul Groth, Simon Miles,

July 27, 2005High Performance Distributed Computing 05

Heterogeneity

This is a heterogeneous applicationHas shell scripts, java programs, web

services Heterogeneity is common in Grid

based appsLCG Atlas - Athena & VDT coexist

Support for plugging-in different execution environments

Page 10: July 27, 2005High Performance Distributed Computing 05 Recording and Using Provenance in a Protein Compressibility Experiment Paul Groth, Simon Miles,

July 27, 2005High Performance Distributed Computing 05

Provenance “Lifecycle”

ApplicationApplication

Results

ProvenanceStore

Record Documentation of Process

Query to retrieve the provenance of a result

Page 11: July 27, 2005High Performance Distributed Computing 05 Recording and Using Provenance in a Protein Compressibility Experiment Paul Groth, Simon Miles,

July 27, 2005High Performance Distributed Computing 05

Use Case 1: Do services differ between experiments?

ProvenanceStore

Retrieve documentation of experiments

Service A

• ……….• ………• ……………..

Service A

• ……….• ………• ……………..• ….

Highlight differences in services between experiments

Page 12: July 27, 2005High Performance Distributed Computing 05 Recording and Using Provenance in a Protein Compressibility Experiment Paul Groth, Simon Miles,

July 27, 2005High Performance Distributed Computing 05

Implementation

Implemented as a VDT workflow Scheduled by Condor

Each service, script, command records process documentation into a provenance store. Uses PReServ: a web services implementation of a provenance store

Page 13: July 27, 2005High Performance Distributed Computing 05 Recording and Using Provenance in a Protein Compressibility Experiment Paul Groth, Simon Miles,

July 27, 2005High Performance Distributed Computing 05

AxisHandler

AxisHandler

Provenance Service

Backend Store Interface

DatabaseStore

In-MemoryStore …

Backend Stores

PS Client Side

Library

PS Client Side

Library

Web Service WS Client

Query Actor WS

PS Client Side

Library

WS Calls

Java Calls

PReServ Implementation Diagram

Page 14: July 27, 2005High Performance Distributed Computing 05 Recording and Using Provenance in a Protein Compressibility Experiment Paul Groth, Simon Miles,

July 27, 2005High Performance Distributed Computing 05

Evaluation Deployment

Runs on VMWare deployment consistencyease of development

Workflow is executed on one machine PReServ runs on another machine

Page 15: July 27, 2005High Performance Distributed Computing 05 Recording and Using Provenance in a Protein Compressibility Experiment Paul Groth, Simon Miles,

July 27, 2005High Performance Distributed Computing 05

Recording Performance

Page 16: July 27, 2005High Performance Distributed Computing 05 Recording and Using Provenance in a Protein Compressibility Experiment Paul Groth, Simon Miles,

July 27, 2005High Performance Distributed Computing 05

Query Performance

Page 17: July 27, 2005High Performance Distributed Computing 05 Recording and Using Provenance in a Protein Compressibility Experiment Paul Groth, Simon Miles,

July 27, 2005High Performance Distributed Computing 05

Both recording and query times are linear 10% overhead for asynchronous recording Our provenance concept / system are grounded

in a number of use cases The experiment is ready to be moved to a

cluster or a grid Southampton Cluster A Grid

Will allow us to test scalability

Conclusion

Page 18: July 27, 2005High Performance Distributed Computing 05 Recording and Using Provenance in a Protein Compressibility Experiment Paul Groth, Simon Miles,

July 27, 2005High Performance Distributed Computing 05

Contact Info

Paul [email protected]

www.pasoa.org- use case descriptions- papers- PReServ software

Page 19: July 27, 2005High Performance Distributed Computing 05 Recording and Using Provenance in a Protein Compressibility Experiment Paul Groth, Simon Miles,

July 27, 2005High Performance Distributed Computing 05

Configuration

Redhat Linux 9.1 on VMWare on Windows XP

Pentium P4 2.8 GHZ 1.5 GB RAM PReServ on another machine

Database backend Berkley JDB 100 Mb local ethernet