invited talk @ dcc09 workshop
DESCRIPTION
Presentation at the REPRISE workshop, Digital Curation Conference 2009, LondonTRANSCRIPT
IDCC’09, London - P.Missier
Paolo Missier Information Management Group School of Computer Science, University of Manchester, UK
with additional material by Sean Bechhofer and Matthew Gamble, e-Labs design group, University of Manchester
1
Scientific Workflow Management System
Researchobjects,myExperiment,andOpenProvenanceforcollabora;veE‐science
REPRISEworkshop‐IDCC’09
JanusProvenance
IDCC’09, London - P.Missier
Momentum on sharing and collaboration
2http://www.nature.com/news/specials/datasharing/index.html
Prepublication data sharing: Nature 461, 168-170 (10 September 2009) | doi:10.1038/461168a; Published online 9 September 2009
The Toronto group: Toronto International Data Release Workshop Authors, Nature 461, 168–169 (2009)
Special issue of Nature on Data Sharing (Sept. 2009)
IDCC’09, London - P.Missier
Momentum on sharing and collaboration
• timeliness requires rapid sharing• repurposing• the Human Genome project use case
2http://www.nature.com/news/specials/datasharing/index.html
Prepublication data sharing: Nature 461, 168-170 (10 September 2009) | doi:10.1038/461168a; Published online 9 September 2009
The Toronto group: Toronto International Data Release Workshop Authors, Nature 461, 168–169 (2009)
Special issue of Nature on Data Sharing (Sept. 2009)
IDCC’09, London - P.Missier
Momentum on sharing and collaboration
• timeliness requires rapid sharing• repurposing• the Human Genome project use case
2http://www.nature.com/news/specials/datasharing/index.html
Prepublication data sharing: Nature 461, 168-170 (10 September 2009) | doi:10.1038/461168a; Published online 9 September 2009
The Toronto group: Toronto International Data Release Workshop Authors, Nature 461, 168–169 (2009)
Special issue of Nature on Data Sharing (Sept. 2009)
• Ongoing debate in several communities– Clinical trials [1]– Earth Sciences -- ESIP - data preservation / stewardship, 2009– Long established in some communities - Atmospheric sciences,
1998 [2]• Science Commons recommendations for Open Science
– Open Science recommendations from Science Commons (July 2008) [link]
Reference scenario
3
workflowexecution
workflow+
input datasetspecification
Reference scenario
3
workflowexecution
workflow+
input datasetspecification
?
Reference scenario
3
workflowexecution
workflow+
input datasetspecification
outcome(data)
outcome(provenance)
?
Reference scenario
3
workflowexecution
workflow+
input datasetspecification
outcome(data)
outcome(provenance)
ResearchObject
Packaging
?
Reference scenario
3
workflowexecution
workflow+
input datasetspecification
outcome(data)
outcome(provenance)
ResearchObject
Packaging
?
Reference scenario
3
workflowexecution
workflow+
input datasetspecification
outcome(data)
outcome(provenance)
ResearchObject
Packaging
browse query
unbundle reuse
?
Reference scenario
3
workflowexecution
workflow+
input datasetspecification
outcome(data)
outcome(provenance)
Data-mediatedimplicit
collaboration
ResearchObject
Packaging
browse query
unbundle reuse
?
IDCC’09, London - P.Missier
Collaboration through data
1.Packaging:– standards for self-descriptive data + metadata bundles:
Research Objects
2.Content:– data format standardization efforts – metadata representation
• process provenance–workflow provenance
3.Container:– a repository for Research Objects 4
What is needed for B to make sense of A’s data?
IDCC’09, London - P.Missier
Collaboration through data
1.Packaging:– standards for self-descriptive data + metadata bundles:
Research Objects
2.Content:– data format standardization efforts – metadata representation
• process provenance–workflow provenance
3.Container:– a repository for Research Objects 4
What is needed for B to make sense of A’s data?
①
IDCC’09, London - P.Missier
Collaboration through data
1.Packaging:– standards for self-descriptive data + metadata bundles:
Research Objects
2.Content:– data format standardization efforts – metadata representation
• process provenance–workflow provenance
3.Container:– a repository for Research Objects 4
What is needed for B to make sense of A’s data?
①
②
IDCC’09, London - P.Missier
Collaboration through data
1.Packaging:– standards for self-descriptive data + metadata bundles:
Research Objects
2.Content:– data format standardization efforts – metadata representation
• process provenance–workflow provenance
3.Container:– a repository for Research Objects 4
What is needed for B to make sense of A’s data?
①
③
②
Common pathways
QTLPaul’sPackPaul’sResearchObject
①
Results
Logs
Results
Paper
Slides
Workflow 16
Workflow 13
Common pathways
QTLPaul’sPackPaul’sResearchObject
①
Results
Logs
Results
Paper
Slides
Workflow 16
Workflow 13
Common pathways
QTLPaul’sPackPaul’sResearchObject
Representation
①
Results
Logs
Results
Paper
Slides
Workflow 16
Workflow 13
Common pathways
QTLPaul’sPackPaul’sResearchObject
Representation
Domain Relations
①
Results
Logs
Results
Paper
SlidesFeeds intoproduces
Included in
produces
Published in
produces
Included in
Included in Included in
Published in
Workflow 16
Workflow 13
Common pathways
QTLPaul’sPackPaul’sResearchObject
Representation
Domain Relations
①
Results
Logs
Results
Paper
SlidesFeeds intoproduces
Included in
produces
Published in
produces
Included in
Included in Included in
Published in
Workflow 16
Workflow 13
Common pathways
QTLPaul’sPackPaul’sResearchObject
Representation
Domain Relations
Aggregation
①
Results
Logs
Results
Metadata
Paper
SlidesFeeds intoproduces
Included in
produces
Published in
produces
Included in
Included in Included in
Published in
Workflow 16
Workflow 13
Common pathways
QTLPaul’sPackPaul’sResearchObject
Representation
Domain Relations
Aggregation
①
ORE: representing generic aggregations
6
Resource Map(descriptor)
Data structure
http://www.openarchives.org/ore/1.0/primer.html section 4
A. Pepe, M. Mayernik, C.L. Borgman, and H.V. Sompel, "From Artifacts to Aggregations: Modeling Scientific Life Cycles on the Semantic Web," Journal of the American Society for Information Science and Technology (JASIST), to appear, 2009.
②
Content: Workflow provenance
8
A detailed trace of workflow execution- tasks performed, data transformations
- inputs used, outputs produced
③
Content: Workflow provenance
8
A detailed trace of workflow execution- tasks performed, data transformations
- inputs used, outputs produced
③
Content: Workflow provenance
8
lister
gene_id
output
pathway_genes
get pathwaysby genes1
merge pathways
concat gene pathway ids
A detailed trace of workflow execution- tasks performed, data transformations
- inputs used, outputs produced
③
• To establish quality, relevance, trust
• To track information attribution through complex transformations
• To describe one’s experiment to others, for understanding / reuse
• To provide evidence in support of scientific claims
• To enable post hoc process analysis for improvement, re-design
IDCC’09, London - P.Missier
Why provenance matters, if done right
The W3C Incubator on Provenance has been collecting numerous use cases:http://www.w3.org/2005/Incubator/prov/wiki/Use_Cases#
IDCC’09, London - P.Missier
What users expect to learn
• Causal relations:- which pathways come from which genes?- which processes contributed to producing an
image?- which process(es) caused data to be incorrect?- which data caused a process to fail?
• Process and data analytics:– analyze variations in output vs an input
parameter sweep (multiple process runs)– how often has my favourite service been
executed? on what inputs?– who produced this data?– how often does this pathway turn up when the
input genes range over a certain set S?
10
lister
gene_id
output
pathway_genes
get pathwaysby genes1
merge pathways
concat gene pathway ids
IDCC’09, London - P.Missier
Open Provenance Model• graph of causal dependencies involving data and processors• not necessarily generated by a workflow!• v1.0.1 currently open for comments
11
A PwasGeneratedBy (R)
AP used (R)
A1
P3
A2
A3
A4
wgb(R1)
wgb(R2)
used(R3)
used(R4)
P1wgb(R5)
P2wgb(R6)
to enable provenance metadata exchange
Goal:
standardize causal dependencies
IDCC’09, London - P.Missier
The 3rd provenance challenge
• Chosen workflow from the Pan-STARRS project– Panoramic Survey Telescope & Rapid Response Syste
• http://twiki.ipaw.info/bin/view/Challenge/ThirdProvenanceChallenge
• Goal: – demonstrate “provenance interoperability” at query level
12
The 3rd provenance challenge workflow
13
read input file
load database
verify
The 3rd provenance challenge workflow
13
read input file
load database
verify
Team A
OPM and query-interoperability
14
encode W as WA
run WA
Q(prov(WA))exportprov(WA)
prov(WA) execute query Q
OPM(prov(WA))
Team A
OPM and query-interoperability
14
encode W as WA
run WA
Q(prov(WA))exportprov(WA)
prov(WA) execute query Q
OPM(prov(WA))
Team B
import execute query Q
PWA = import(OPM(prov(WA)))
Q(PWA)
Team A
OPM and query-interoperability
14
encode W as WA
run WA
Q(prov(WA))exportprov(WA)
prov(WA) execute query Q
OPM(prov(WA))
?Team B
import execute query Q
PWA = import(OPM(prov(WA)))
Q(PWA)
OPM in Taverna
15
skippable
OPM in Taverna
15
skippable
OPM in Taverna
15
➡ the answer to any TP query can be viewed as an OPM graph
➡ encoded as RDF/XML (using the Tupelo provenance API)
skippable
Additional requirements
16
Additional requirements
16
• Artifact values require uniform common identifier scheme– each group used artifacts to refer to its own data results– but those results were expressed using proprietary
naming conventions– Linked Data in OPM?
Additional requirements
16
• Artifact values require uniform common identifier scheme– each group used artifacts to refer to its own data results– but those results were expressed using proprietary
naming conventions– Linked Data in OPM?
• OPM accounts for structural causal relationships– additional domain-specific knowledge required– attaching semantic annotations to OPM graph nodes
Additional requirements
16
• Artifact values require uniform common identifier scheme– each group used artifacts to refer to its own data results– but those results were expressed using proprietary
naming conventions– Linked Data in OPM?
• OPM accounts for structural causal relationships– additional domain-specific knowledge required– attaching semantic annotations to OPM graph nodes
• OPM graphs can grow very large– reduce size by exporting only query results
• Taverna approach– multiple levels of abstraction
• through OPM accounts (“points of view”)
Query results as OPM graphs
17
encode W as WA
run WA
Q(prov(WA))exportprov(WA)
prov(WA)execute query Q
OPM(prov(WA))
Query results as OPM graphs
17
encode W as WA
run WA
Q(prov(WA))exportprov(WA)
prov(WA)execute query Q
OPM(prov(WA))
Query results as OPM graphs
17
encode W as WA
run WA
Q(prov(WA))
prov(WA)execute query Q
OPM(prov(WA)) exportQ(prov(WA))
Query results as OPM graphs
17
encode W as WA
run WA
Q(prov(WA))
prov(WA)execute query Q
exportQ(prov(WA))
OPM(Q(prov(WA)))
Query results as OPM graphs
17
encode W as WA
run WA
Q(prov(WA))
prov(WA)execute query Q
exportQ(prov(WA))
OPM(Q(prov(WA)))
- Approach implemented in Taverna 2.1
- Internal provenance DB with ad hoc query language
- To be released soon
Full-fledged data-mediated collaborations
18
exp. A
resultdatasets
A
ResearchObject
Aresult
provenanceA
workflow A +input A
Full-fledged data-mediated collaborations
18
exp. A
resultdatasets
A
ResearchObject
Aresult
provenanceA
workflow A +input A
Full-fledged data-mediated collaborations
18
result A → input B
exp. A
resultdatasets
A
ResearchObject
Aresult
provenanceA
workflow A +input A
Full-fledged data-mediated collaborations
18
result A → input B
exp. A
exp. B
resultdatasets
A
ResearchObject
Aresult
provenanceA
workflow A +input A
resultdatasets
B
ResearchObject
Bresult
provenanceB
workflow B+input B
Full-fledged data-mediated collaborations
18
resultdatasets
B
ResearchObjectA+B
resultprovenance
A + B
workflow B +inputB
resultdatasets
A
workflow A +input A
result A → input B
Full-fledged data-mediated collaborations
18
resultdatasets
B
ResearchObjectA+B
resultprovenance
A + B
workflow B +inputB
resultdatasets
A
workflow A +input A
result A → input B
Provenance composition accounts for implicit
collaboration
Full-fledged data-mediated collaborations
18
resultdatasets
B
ResearchObjectA+B
resultprovenance
A + B
workflow B +inputB
resultdatasets
A
workflow A +input A
result A → input B
Provenance composition accounts for implicit
collaboration
Aligned with focus of upcoming Provenance Challenge 4:“connect my provenance to yours" into a whole OPM provenance graph.
Contacts
19
The myGrid Consortium (Manchester, Southampton)
JanusProvenance
http://www.myexperiment.org
http://mygrid.org.uk