invited talk @ dcc09 workshop

Post on 11-May-2015

675 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

Presentation at the REPRISE workshop, Digital Curation Conference 2009, London

TRANSCRIPT

IDCC’09, London - P.Missier

Paolo Missier Information Management Group School of Computer Science, University of Manchester, UK

with additional material by Sean Bechhofer and Matthew Gamble, e-Labs design group, University of Manchester

1

Scientific Workflow Management System

Researchobjects,myExperiment,andOpenProvenanceforcollabora;veE‐science

REPRISEworkshop‐IDCC’09

JanusProvenance

IDCC’09, London - P.Missier

Momentum on sharing and collaboration

2http://www.nature.com/news/specials/datasharing/index.html

Prepublication data sharing: Nature 461, 168-170 (10 September 2009) | doi:10.1038/461168a; Published online 9 September 2009

The Toronto group: Toronto International Data Release Workshop Authors, Nature 461, 168–169 (2009)

Special issue of Nature on Data Sharing (Sept. 2009)

IDCC’09, London - P.Missier

Momentum on sharing and collaboration

• timeliness requires rapid sharing• repurposing• the Human Genome project use case

2http://www.nature.com/news/specials/datasharing/index.html

Prepublication data sharing: Nature 461, 168-170 (10 September 2009) | doi:10.1038/461168a; Published online 9 September 2009

The Toronto group: Toronto International Data Release Workshop Authors, Nature 461, 168–169 (2009)

Special issue of Nature on Data Sharing (Sept. 2009)

IDCC’09, London - P.Missier

Momentum on sharing and collaboration

• timeliness requires rapid sharing• repurposing• the Human Genome project use case

2http://www.nature.com/news/specials/datasharing/index.html

Prepublication data sharing: Nature 461, 168-170 (10 September 2009) | doi:10.1038/461168a; Published online 9 September 2009

The Toronto group: Toronto International Data Release Workshop Authors, Nature 461, 168–169 (2009)

Special issue of Nature on Data Sharing (Sept. 2009)

• Ongoing debate in several communities– Clinical trials [1]– Earth Sciences -- ESIP - data preservation / stewardship, 2009– Long established in some communities - Atmospheric sciences,

1998 [2]• Science Commons recommendations for Open Science

– Open Science recommendations from Science Commons (July 2008) [link]

Reference scenario

3

workflowexecution

workflow+

input datasetspecification

Reference scenario

3

workflowexecution

workflow+

input datasetspecification

?

Reference scenario

3

workflowexecution

workflow+

input datasetspecification

outcome(data)

outcome(provenance)

?

Reference scenario

3

workflowexecution

workflow+

input datasetspecification

outcome(data)

outcome(provenance)

ResearchObject

Packaging

?

Reference scenario

3

workflowexecution

workflow+

input datasetspecification

outcome(data)

outcome(provenance)

ResearchObject

Packaging

?

Reference scenario

3

workflowexecution

workflow+

input datasetspecification

outcome(data)

outcome(provenance)

ResearchObject

Packaging

browse query

unbundle reuse

?

Reference scenario

3

workflowexecution

workflow+

input datasetspecification

outcome(data)

outcome(provenance)

Data-mediatedimplicit

collaboration

ResearchObject

Packaging

browse query

unbundle reuse

?

IDCC’09, London - P.Missier

Collaboration through data

1.Packaging:– standards for self-descriptive data + metadata bundles:

Research Objects

2.Content:– data format standardization efforts – metadata representation

• process provenance–workflow provenance

3.Container:– a repository for Research Objects 4

What is needed for B to make sense of A’s data?

IDCC’09, London - P.Missier

Collaboration through data

1.Packaging:– standards for self-descriptive data + metadata bundles:

Research Objects

2.Content:– data format standardization efforts – metadata representation

• process provenance–workflow provenance

3.Container:– a repository for Research Objects 4

What is needed for B to make sense of A’s data?

IDCC’09, London - P.Missier

Collaboration through data

1.Packaging:– standards for self-descriptive data + metadata bundles:

Research Objects

2.Content:– data format standardization efforts – metadata representation

• process provenance–workflow provenance

3.Container:– a repository for Research Objects 4

What is needed for B to make sense of A’s data?

IDCC’09, London - P.Missier

Collaboration through data

1.Packaging:– standards for self-descriptive data + metadata bundles:

Research Objects

2.Content:– data format standardization efforts – metadata representation

• process provenance–workflow provenance

3.Container:– a repository for Research Objects 4

What is needed for B to make sense of A’s data?

Common pathways

QTLPaul’sPackPaul’sResearchObject

Results

Logs

Results

Paper

Slides

Workflow 16

Workflow 13

Common pathways

QTLPaul’sPackPaul’sResearchObject

Results

Logs

Results

Paper

Slides

Workflow 16

Workflow 13

Common pathways

QTLPaul’sPackPaul’sResearchObject

Representation

Results

Logs

Results

Paper

Slides

Workflow 16

Workflow 13

Common pathways

QTLPaul’sPackPaul’sResearchObject

Representation

Domain Relations

Results

Logs

Results

Paper

SlidesFeeds intoproduces

Included in

produces

Published in

produces

Included in

Included in Included in

Published in

Workflow 16

Workflow 13

Common pathways

QTLPaul’sPackPaul’sResearchObject

Representation

Domain Relations

Results

Logs

Results

Paper

SlidesFeeds intoproduces

Included in

produces

Published in

produces

Included in

Included in Included in

Published in

Workflow 16

Workflow 13

Common pathways

QTLPaul’sPackPaul’sResearchObject

Representation

Domain Relations

Aggregation

Results

Logs

Results

Metadata

Paper

SlidesFeeds intoproduces

Included in

produces

Published in

produces

Included in

Included in Included in

Published in

Workflow 16

Workflow 13

Common pathways

QTLPaul’sPackPaul’sResearchObject

Representation

Domain Relations

Aggregation

ORE: representing generic aggregations

6

Resource Map(descriptor)

Data structure

http://www.openarchives.org/ore/1.0/primer.html section 4

A. Pepe, M. Mayernik, C.L. Borgman, and H.V. Sompel, "From Artifacts to Aggregations: Modeling Scientific Life Cycles on the Semantic Web," Journal of the American Society for Information Science and Technology (JASIST), to appear, 2009.

Content: Workflow provenance

8

A detailed trace of workflow execution- tasks performed, data transformations

- inputs used, outputs produced

Content: Workflow provenance

8

A detailed trace of workflow execution- tasks performed, data transformations

- inputs used, outputs produced

Content: Workflow provenance

8

lister

gene_id

output

pathway_genes

get pathwaysby genes1

merge pathways

concat gene pathway ids

A detailed trace of workflow execution- tasks performed, data transformations

- inputs used, outputs produced

• To establish quality, relevance, trust

• To track information attribution through complex transformations

• To describe one’s experiment to others, for understanding / reuse

• To provide evidence in support of scientific claims

• To enable post hoc process analysis for improvement, re-design

IDCC’09, London - P.Missier

Why provenance matters, if done right

The W3C Incubator on Provenance has been collecting numerous use cases:http://www.w3.org/2005/Incubator/prov/wiki/Use_Cases#

IDCC’09, London - P.Missier

What users expect to learn

• Causal relations:- which pathways come from which genes?- which processes contributed to producing an

image?- which process(es) caused data to be incorrect?- which data caused a process to fail?

• Process and data analytics:– analyze variations in output vs an input

parameter sweep (multiple process runs)– how often has my favourite service been

executed? on what inputs?– who produced this data?– how often does this pathway turn up when the

input genes range over a certain set S?

10

lister

gene_id

output

pathway_genes

get pathwaysby genes1

merge pathways

concat gene pathway ids

IDCC’09, London - P.Missier

Open Provenance Model• graph of causal dependencies involving data and processors• not necessarily generated by a workflow!• v1.0.1 currently open for comments

11

A PwasGeneratedBy (R)

AP used (R)

A1

P3

A2

A3

A4

wgb(R1)

wgb(R2)

used(R3)

used(R4)

P1wgb(R5)

P2wgb(R6)

to enable provenance metadata exchange

Goal:

standardize causal dependencies

IDCC’09, London - P.Missier

The 3rd provenance challenge

• Chosen workflow from the Pan-STARRS project– Panoramic Survey Telescope & Rapid Response Syste

• http://twiki.ipaw.info/bin/view/Challenge/ThirdProvenanceChallenge

• Goal: – demonstrate “provenance interoperability” at query level

12

The 3rd provenance challenge workflow

13

read input file

load database

verify

The 3rd provenance challenge workflow

13

read input file

load database

verify

Team A

OPM and query-interoperability

14

encode W as WA

run WA

Q(prov(WA))exportprov(WA)

prov(WA) execute query Q

OPM(prov(WA))

Team A

OPM and query-interoperability

14

encode W as WA

run WA

Q(prov(WA))exportprov(WA)

prov(WA) execute query Q

OPM(prov(WA))

Team B

import execute query Q

PWA = import(OPM(prov(WA)))

Q(PWA)

Team A

OPM and query-interoperability

14

encode W as WA

run WA

Q(prov(WA))exportprov(WA)

prov(WA) execute query Q

OPM(prov(WA))

?Team B

import execute query Q

PWA = import(OPM(prov(WA)))

Q(PWA)

OPM in Taverna

15

skippable

OPM in Taverna

15

➡ the answer to any TP query can be viewed as an OPM graph

➡ encoded as RDF/XML (using the Tupelo provenance API)

skippable

Additional requirements

16

Additional requirements

16

• Artifact values require uniform common identifier scheme– each group used artifacts to refer to its own data results– but those results were expressed using proprietary

naming conventions– Linked Data in OPM?

Additional requirements

16

• Artifact values require uniform common identifier scheme– each group used artifacts to refer to its own data results– but those results were expressed using proprietary

naming conventions– Linked Data in OPM?

• OPM accounts for structural causal relationships– additional domain-specific knowledge required– attaching semantic annotations to OPM graph nodes

Additional requirements

16

• Artifact values require uniform common identifier scheme– each group used artifacts to refer to its own data results– but those results were expressed using proprietary

naming conventions– Linked Data in OPM?

• OPM accounts for structural causal relationships– additional domain-specific knowledge required– attaching semantic annotations to OPM graph nodes

• OPM graphs can grow very large– reduce size by exporting only query results

• Taverna approach– multiple levels of abstraction

• through OPM accounts (“points of view”)

Query results as OPM graphs

17

encode W as WA

run WA

Q(prov(WA))exportprov(WA)

prov(WA)execute query Q

OPM(prov(WA))

Query results as OPM graphs

17

encode W as WA

run WA

Q(prov(WA))exportprov(WA)

prov(WA)execute query Q

OPM(prov(WA))

Query results as OPM graphs

17

encode W as WA

run WA

Q(prov(WA))

prov(WA)execute query Q

OPM(prov(WA)) exportQ(prov(WA))

Query results as OPM graphs

17

encode W as WA

run WA

Q(prov(WA))

prov(WA)execute query Q

exportQ(prov(WA))

OPM(Q(prov(WA)))

Query results as OPM graphs

17

encode W as WA

run WA

Q(prov(WA))

prov(WA)execute query Q

exportQ(prov(WA))

OPM(Q(prov(WA)))

- Approach implemented in Taverna 2.1

- Internal provenance DB with ad hoc query language

- To be released soon

Full-fledged data-mediated collaborations

18

exp. A

resultdatasets

A

ResearchObject

Aresult

provenanceA

workflow A +input A

Full-fledged data-mediated collaborations

18

exp. A

resultdatasets

A

ResearchObject

Aresult

provenanceA

workflow A +input A

Full-fledged data-mediated collaborations

18

result A → input B

exp. A

resultdatasets

A

ResearchObject

Aresult

provenanceA

workflow A +input A

Full-fledged data-mediated collaborations

18

result A → input B

exp. A

exp. B

resultdatasets

A

ResearchObject

Aresult

provenanceA

workflow A +input A

resultdatasets

B

ResearchObject

Bresult

provenanceB

workflow B+input B

Full-fledged data-mediated collaborations

18

resultdatasets

B

ResearchObjectA+B

resultprovenance

A + B

workflow B +inputB

resultdatasets

A

workflow A +input A

result A → input B

Full-fledged data-mediated collaborations

18

resultdatasets

B

ResearchObjectA+B

resultprovenance

A + B

workflow B +inputB

resultdatasets

A

workflow A +input A

result A → input B

Provenance composition accounts for implicit

collaboration

Full-fledged data-mediated collaborations

18

resultdatasets

B

ResearchObjectA+B

resultprovenance

A + B

workflow B +inputB

resultdatasets

A

workflow A +input A

result A → input B

Provenance composition accounts for implicit

collaboration

Aligned with focus of upcoming Provenance Challenge 4:“connect my provenance to yours" into a whole OPM provenance graph.

Contacts

19

The myGrid Consortium (Manchester, Southampton)

JanusProvenance

http://www.myexperiment.org

http://mygrid.org.uk

Me: pmissier@acm.org

top related