invited talk @ dcc09 workshop

56
IDCC’09, London - P.Missier Paolo Missier Information Management Group School of Computer Science, University of Manchester, UK with additional material by Sean Bechhofer and Matthew Gamble, e-Labs design group, University of Manchester 1 Scientific Workflow Management System Research objects, myExperiment, and Open Provenance for collabora;ve E‐science REPRISE workshop ‐ IDCC’09 Janus Provenance

Upload: paolo-missier

Post on 11-May-2015

674 views

Category:

Technology


1 download

DESCRIPTION

Presentation at the REPRISE workshop, Digital Curation Conference 2009, London

TRANSCRIPT

Page 1: Invited talk @ DCC09 workshop

IDCC’09, London - P.Missier

Paolo Missier Information Management Group School of Computer Science, University of Manchester, UK

with additional material by Sean Bechhofer and Matthew Gamble, e-Labs design group, University of Manchester

1

Scientific Workflow Management System

Researchobjects,myExperiment,andOpenProvenanceforcollabora;veE‐science

REPRISEworkshop‐IDCC’09

JanusProvenance

Page 2: Invited talk @ DCC09 workshop

IDCC’09, London - P.Missier

Momentum on sharing and collaboration

2http://www.nature.com/news/specials/datasharing/index.html

Prepublication data sharing: Nature 461, 168-170 (10 September 2009) | doi:10.1038/461168a; Published online 9 September 2009

The Toronto group: Toronto International Data Release Workshop Authors, Nature 461, 168–169 (2009)

Special issue of Nature on Data Sharing (Sept. 2009)

Page 3: Invited talk @ DCC09 workshop

IDCC’09, London - P.Missier

Momentum on sharing and collaboration

• timeliness requires rapid sharing• repurposing• the Human Genome project use case

2http://www.nature.com/news/specials/datasharing/index.html

Prepublication data sharing: Nature 461, 168-170 (10 September 2009) | doi:10.1038/461168a; Published online 9 September 2009

The Toronto group: Toronto International Data Release Workshop Authors, Nature 461, 168–169 (2009)

Special issue of Nature on Data Sharing (Sept. 2009)

Page 4: Invited talk @ DCC09 workshop

IDCC’09, London - P.Missier

Momentum on sharing and collaboration

• timeliness requires rapid sharing• repurposing• the Human Genome project use case

2http://www.nature.com/news/specials/datasharing/index.html

Prepublication data sharing: Nature 461, 168-170 (10 September 2009) | doi:10.1038/461168a; Published online 9 September 2009

The Toronto group: Toronto International Data Release Workshop Authors, Nature 461, 168–169 (2009)

Special issue of Nature on Data Sharing (Sept. 2009)

• Ongoing debate in several communities– Clinical trials [1]– Earth Sciences -- ESIP - data preservation / stewardship, 2009– Long established in some communities - Atmospheric sciences,

1998 [2]• Science Commons recommendations for Open Science

– Open Science recommendations from Science Commons (July 2008) [link]

Page 5: Invited talk @ DCC09 workshop

Reference scenario

3

workflowexecution

workflow+

input datasetspecification

Page 6: Invited talk @ DCC09 workshop

Reference scenario

3

workflowexecution

workflow+

input datasetspecification

?

Page 7: Invited talk @ DCC09 workshop

Reference scenario

3

workflowexecution

workflow+

input datasetspecification

outcome(data)

outcome(provenance)

?

Page 8: Invited talk @ DCC09 workshop

Reference scenario

3

workflowexecution

workflow+

input datasetspecification

outcome(data)

outcome(provenance)

ResearchObject

Packaging

?

Page 9: Invited talk @ DCC09 workshop

Reference scenario

3

workflowexecution

workflow+

input datasetspecification

outcome(data)

outcome(provenance)

ResearchObject

Packaging

?

Page 10: Invited talk @ DCC09 workshop

Reference scenario

3

workflowexecution

workflow+

input datasetspecification

outcome(data)

outcome(provenance)

ResearchObject

Packaging

browse query

unbundle reuse

?

Page 11: Invited talk @ DCC09 workshop

Reference scenario

3

workflowexecution

workflow+

input datasetspecification

outcome(data)

outcome(provenance)

Data-mediatedimplicit

collaboration

ResearchObject

Packaging

browse query

unbundle reuse

?

Page 12: Invited talk @ DCC09 workshop

IDCC’09, London - P.Missier

Collaboration through data

1.Packaging:– standards for self-descriptive data + metadata bundles:

Research Objects

2.Content:– data format standardization efforts – metadata representation

• process provenance–workflow provenance

3.Container:– a repository for Research Objects 4

What is needed for B to make sense of A’s data?

Page 13: Invited talk @ DCC09 workshop

IDCC’09, London - P.Missier

Collaboration through data

1.Packaging:– standards for self-descriptive data + metadata bundles:

Research Objects

2.Content:– data format standardization efforts – metadata representation

• process provenance–workflow provenance

3.Container:– a repository for Research Objects 4

What is needed for B to make sense of A’s data?

Page 14: Invited talk @ DCC09 workshop

IDCC’09, London - P.Missier

Collaboration through data

1.Packaging:– standards for self-descriptive data + metadata bundles:

Research Objects

2.Content:– data format standardization efforts – metadata representation

• process provenance–workflow provenance

3.Container:– a repository for Research Objects 4

What is needed for B to make sense of A’s data?

Page 15: Invited talk @ DCC09 workshop

IDCC’09, London - P.Missier

Collaboration through data

1.Packaging:– standards for self-descriptive data + metadata bundles:

Research Objects

2.Content:– data format standardization efforts – metadata representation

• process provenance–workflow provenance

3.Container:– a repository for Research Objects 4

What is needed for B to make sense of A’s data?

Page 16: Invited talk @ DCC09 workshop

Common pathways

QTLPaul’sPackPaul’sResearchObject

Page 17: Invited talk @ DCC09 workshop

Results

Logs

Results

Paper

Slides

Workflow 16

Workflow 13

Common pathways

QTLPaul’sPackPaul’sResearchObject

Page 18: Invited talk @ DCC09 workshop

Results

Logs

Results

Paper

Slides

Workflow 16

Workflow 13

Common pathways

QTLPaul’sPackPaul’sResearchObject

Representation

Page 19: Invited talk @ DCC09 workshop

Results

Logs

Results

Paper

Slides

Workflow 16

Workflow 13

Common pathways

QTLPaul’sPackPaul’sResearchObject

Representation

Domain Relations

Page 20: Invited talk @ DCC09 workshop

Results

Logs

Results

Paper

SlidesFeeds intoproduces

Included in

produces

Published in

produces

Included in

Included in Included in

Published in

Workflow 16

Workflow 13

Common pathways

QTLPaul’sPackPaul’sResearchObject

Representation

Domain Relations

Page 21: Invited talk @ DCC09 workshop

Results

Logs

Results

Paper

SlidesFeeds intoproduces

Included in

produces

Published in

produces

Included in

Included in Included in

Published in

Workflow 16

Workflow 13

Common pathways

QTLPaul’sPackPaul’sResearchObject

Representation

Domain Relations

Aggregation

Page 22: Invited talk @ DCC09 workshop

Results

Logs

Results

Metadata

Paper

SlidesFeeds intoproduces

Included in

produces

Published in

produces

Included in

Included in Included in

Published in

Workflow 16

Workflow 13

Common pathways

QTLPaul’sPackPaul’sResearchObject

Representation

Domain Relations

Aggregation

Page 23: Invited talk @ DCC09 workshop

ORE: representing generic aggregations

6

Resource Map(descriptor)

Data structure

http://www.openarchives.org/ore/1.0/primer.html section 4

A. Pepe, M. Mayernik, C.L. Borgman, and H.V. Sompel, "From Artifacts to Aggregations: Modeling Scientific Life Cycles on the Semantic Web," Journal of the American Society for Information Science and Technology (JASIST), to appear, 2009.

Page 24: Invited talk @ DCC09 workshop

Page 25: Invited talk @ DCC09 workshop

Content: Workflow provenance

8

A detailed trace of workflow execution- tasks performed, data transformations

- inputs used, outputs produced

Page 26: Invited talk @ DCC09 workshop

Content: Workflow provenance

8

A detailed trace of workflow execution- tasks performed, data transformations

- inputs used, outputs produced

Page 27: Invited talk @ DCC09 workshop

Content: Workflow provenance

8

lister

gene_id

output

pathway_genes

get pathwaysby genes1

merge pathways

concat gene pathway ids

A detailed trace of workflow execution- tasks performed, data transformations

- inputs used, outputs produced

Page 28: Invited talk @ DCC09 workshop

• To establish quality, relevance, trust

• To track information attribution through complex transformations

• To describe one’s experiment to others, for understanding / reuse

• To provide evidence in support of scientific claims

• To enable post hoc process analysis for improvement, re-design

IDCC’09, London - P.Missier

Why provenance matters, if done right

The W3C Incubator on Provenance has been collecting numerous use cases:http://www.w3.org/2005/Incubator/prov/wiki/Use_Cases#

Page 29: Invited talk @ DCC09 workshop

IDCC’09, London - P.Missier

What users expect to learn

• Causal relations:- which pathways come from which genes?- which processes contributed to producing an

image?- which process(es) caused data to be incorrect?- which data caused a process to fail?

• Process and data analytics:– analyze variations in output vs an input

parameter sweep (multiple process runs)– how often has my favourite service been

executed? on what inputs?– who produced this data?– how often does this pathway turn up when the

input genes range over a certain set S?

10

lister

gene_id

output

pathway_genes

get pathwaysby genes1

merge pathways

concat gene pathway ids

Page 30: Invited talk @ DCC09 workshop

IDCC’09, London - P.Missier

Open Provenance Model• graph of causal dependencies involving data and processors• not necessarily generated by a workflow!• v1.0.1 currently open for comments

11

A PwasGeneratedBy (R)

AP used (R)

A1

P3

A2

A3

A4

wgb(R1)

wgb(R2)

used(R3)

used(R4)

P1wgb(R5)

P2wgb(R6)

to enable provenance metadata exchange

Goal:

standardize causal dependencies

Page 31: Invited talk @ DCC09 workshop

IDCC’09, London - P.Missier

The 3rd provenance challenge

• Chosen workflow from the Pan-STARRS project– Panoramic Survey Telescope & Rapid Response Syste

• http://twiki.ipaw.info/bin/view/Challenge/ThirdProvenanceChallenge

• Goal: – demonstrate “provenance interoperability” at query level

12

Page 32: Invited talk @ DCC09 workshop

The 3rd provenance challenge workflow

13

read input file

load database

verify

Page 33: Invited talk @ DCC09 workshop

The 3rd provenance challenge workflow

13

read input file

load database

verify

Page 34: Invited talk @ DCC09 workshop

Team A

OPM and query-interoperability

14

encode W as WA

run WA

Q(prov(WA))exportprov(WA)

prov(WA) execute query Q

OPM(prov(WA))

Page 35: Invited talk @ DCC09 workshop

Team A

OPM and query-interoperability

14

encode W as WA

run WA

Q(prov(WA))exportprov(WA)

prov(WA) execute query Q

OPM(prov(WA))

Team B

import execute query Q

PWA = import(OPM(prov(WA)))

Q(PWA)

Page 36: Invited talk @ DCC09 workshop

Team A

OPM and query-interoperability

14

encode W as WA

run WA

Q(prov(WA))exportprov(WA)

prov(WA) execute query Q

OPM(prov(WA))

?Team B

import execute query Q

PWA = import(OPM(prov(WA)))

Q(PWA)

Page 37: Invited talk @ DCC09 workshop

OPM in Taverna

15

skippable

Page 39: Invited talk @ DCC09 workshop

OPM in Taverna

15

➡ the answer to any TP query can be viewed as an OPM graph

➡ encoded as RDF/XML (using the Tupelo provenance API)

skippable

Page 40: Invited talk @ DCC09 workshop

Additional requirements

16

Page 41: Invited talk @ DCC09 workshop

Additional requirements

16

• Artifact values require uniform common identifier scheme– each group used artifacts to refer to its own data results– but those results were expressed using proprietary

naming conventions– Linked Data in OPM?

Page 42: Invited talk @ DCC09 workshop

Additional requirements

16

• Artifact values require uniform common identifier scheme– each group used artifacts to refer to its own data results– but those results were expressed using proprietary

naming conventions– Linked Data in OPM?

• OPM accounts for structural causal relationships– additional domain-specific knowledge required– attaching semantic annotations to OPM graph nodes

Page 43: Invited talk @ DCC09 workshop

Additional requirements

16

• Artifact values require uniform common identifier scheme– each group used artifacts to refer to its own data results– but those results were expressed using proprietary

naming conventions– Linked Data in OPM?

• OPM accounts for structural causal relationships– additional domain-specific knowledge required– attaching semantic annotations to OPM graph nodes

• OPM graphs can grow very large– reduce size by exporting only query results

• Taverna approach– multiple levels of abstraction

• through OPM accounts (“points of view”)

Page 44: Invited talk @ DCC09 workshop

Query results as OPM graphs

17

encode W as WA

run WA

Q(prov(WA))exportprov(WA)

prov(WA)execute query Q

OPM(prov(WA))

Page 45: Invited talk @ DCC09 workshop

Query results as OPM graphs

17

encode W as WA

run WA

Q(prov(WA))exportprov(WA)

prov(WA)execute query Q

OPM(prov(WA))

Page 46: Invited talk @ DCC09 workshop

Query results as OPM graphs

17

encode W as WA

run WA

Q(prov(WA))

prov(WA)execute query Q

OPM(prov(WA)) exportQ(prov(WA))

Page 47: Invited talk @ DCC09 workshop

Query results as OPM graphs

17

encode W as WA

run WA

Q(prov(WA))

prov(WA)execute query Q

exportQ(prov(WA))

OPM(Q(prov(WA)))

Page 48: Invited talk @ DCC09 workshop

Query results as OPM graphs

17

encode W as WA

run WA

Q(prov(WA))

prov(WA)execute query Q

exportQ(prov(WA))

OPM(Q(prov(WA)))

- Approach implemented in Taverna 2.1

- Internal provenance DB with ad hoc query language

- To be released soon

Page 49: Invited talk @ DCC09 workshop

Full-fledged data-mediated collaborations

18

exp. A

resultdatasets

A

ResearchObject

Aresult

provenanceA

workflow A +input A

Page 50: Invited talk @ DCC09 workshop

Full-fledged data-mediated collaborations

18

exp. A

resultdatasets

A

ResearchObject

Aresult

provenanceA

workflow A +input A

Page 51: Invited talk @ DCC09 workshop

Full-fledged data-mediated collaborations

18

result A → input B

exp. A

resultdatasets

A

ResearchObject

Aresult

provenanceA

workflow A +input A

Page 52: Invited talk @ DCC09 workshop

Full-fledged data-mediated collaborations

18

result A → input B

exp. A

exp. B

resultdatasets

A

ResearchObject

Aresult

provenanceA

workflow A +input A

resultdatasets

B

ResearchObject

Bresult

provenanceB

workflow B+input B

Page 53: Invited talk @ DCC09 workshop

Full-fledged data-mediated collaborations

18

resultdatasets

B

ResearchObjectA+B

resultprovenance

A + B

workflow B +inputB

resultdatasets

A

workflow A +input A

result A → input B

Page 54: Invited talk @ DCC09 workshop

Full-fledged data-mediated collaborations

18

resultdatasets

B

ResearchObjectA+B

resultprovenance

A + B

workflow B +inputB

resultdatasets

A

workflow A +input A

result A → input B

Provenance composition accounts for implicit

collaboration

Page 55: Invited talk @ DCC09 workshop

Full-fledged data-mediated collaborations

18

resultdatasets

B

ResearchObjectA+B

resultprovenance

A + B

workflow B +inputB

resultdatasets

A

workflow A +input A

result A → input B

Provenance composition accounts for implicit

collaboration

Aligned with focus of upcoming Provenance Challenge 4:“connect my provenance to yours" into a whole OPM provenance graph.

Page 56: Invited talk @ DCC09 workshop

Contacts

19

The myGrid Consortium (Manchester, Southampton)

JanusProvenance

http://www.myexperiment.org

http://mygrid.org.uk

Me: [email protected]