session talk @ agu09
DESCRIPTION
Presentation at the AGU'09 Fall Meeting, San Francisco, CA, Dec. 2009TRANSCRIPT
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
Paolo Missier Information Management Group School of Computer Science, University of Manchester, UK
with additional material by Sean Bechhofer and Matthew Gamble, e-Labs design group, University of Manchester
Scientific Workflow Management System
Towardssystema-cinforma-onexchangeandreuseine‐laboratories
AGUFallmee-ng,Dec.2009
JanusProvenance
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
Momentum on sharing and collaboration
2
http://www.nature.com/news/specials/datasharing/index.html
Special issue of Nature on Data Sharing (Sept. 2009)
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
Momentum on sharing and collaboration
• timeliness requires rapid sharing• repurposing• the Human Genome project use case
2
http://www.nature.com/news/specials/datasharing/index.html
Special issue of Nature on Data Sharing (Sept. 2009)
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
Momentum on sharing and collaboration
• timeliness requires rapid sharing• repurposing• the Human Genome project use case
2
http://www.nature.com/news/specials/datasharing/index.html
Special issue of Nature on Data Sharing (Sept. 2009)
• Debate is much further along in Earth Sciences– ESIP - data preservation / stewardship, 2009– Long established in some communities - Atmospheric sciences,
1998 [1]• Science Commons recommendations for Open Science
– (July 2008) [link]
[1] Strebel DE, Landis DR, Huemmrich KF, Newcomer JA, Meeson BW: The FIFE Data Publication Experiment. Journal of the Atmospheric Sciences 1998, 55:1277-1283
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
Collaboration in workflow-based science
workflowexecution
workflow+
input datasetspecification
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
Collaboration in workflow-based science
workflowexecution
workflow+
input datasetspecification
outcome(data)
outcome(provenance)
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
Collaboration in workflow-based science
workflowexecution
workflow+
input datasetspecification
outcome(data)
outcome(provenance)
ResearchObject
Packaging
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
Collaboration in workflow-based science
workflowexecution
workflow+
input datasetspecification
outcome(data)
outcome(provenance)
ResearchObject
Packaging
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
Collaboration in workflow-based science
workflowexecution
workflow+
input datasetspecification
outcome(data)
outcome(provenance)
ResearchObject
Packaging
browse query
unbundle reuse
Paul
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
Collaboration in workflow-based science
workflowexecution
workflow+
input datasetspecification
outcome(data)
outcome(provenance)
ResearchObject
Packaging
browse query
unbundle reuse
Data-mediatedimplicit
collaborationPaul
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
Collaboration in workflow-based science
outcome(data)
outcome(provenance)
ResearchObject
Packaging
browse query
unbundle reuse
Data-mediatedimplicit
collaboration
What is needed for Paul to make sense of third party data?
Paul
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
Collaboration in workflow-based science
outcome(data)
outcome(provenance)
ResearchObject
Packaging
browse query
unbundle reuse
Data-mediatedimplicit
collaboration
①
What is needed for Paul to make sense of third party data?
Paul
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
Collaboration in workflow-based science
outcome(data)
outcome(provenance)
ResearchObject
Packaging
browse query
unbundle reuse
Data-mediatedimplicit
collaboration
①②
What is needed for Paul to make sense of third party data?
Paul
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
Collaboration in workflow-based science
outcome(data)
outcome(provenance)
ResearchObject
Packaging
browse query
unbundle reuse
Data-mediatedimplicit
collaboration
①②
③
What is needed for Paul to make sense of third party data?
Paul
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
Common pathways
QTLPaul’sPackPaul’sResearchObject
①
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
Results
Logs
Results
Paper
Slides
Workflow 16
Workflow 13
Common pathways
QTLPaul’sPackPaul’sResearchObject
①
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
Results
Logs
Results
Paper
Slides
Workflow 16
Workflow 13
Common pathways
QTLPaul’sPackPaul’sResearchObject
Representation
①
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
Results
Logs
Results
Paper
Slides
Workflow 16
Workflow 13
Common pathways
QTLPaul’sPackPaul’sResearchObject
Representation
Domain Relations
①
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
Results
Logs
Results
Paper
SlidesFeeds intoproduces
Included in
produces
Published in
produces
Included in
Included in Included in
Published in
Workflow 16
Workflow 13
Common pathways
QTLPaul’sPackPaul’sResearchObject
Representation
Domain Relations
①
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
Results
Logs
Results
Metadata
Paper
SlidesFeeds intoproduces
Included in
produces
Published in
produces
Included in
Included in Included in
Published in
Workflow 16
Workflow 13
Common pathways
QTLPaul’sPackPaul’sResearchObject
Representation
Domain Relations
Aggregation
①
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
ORE: representing generic aggregations
Resource Map(descriptor)
Data structure
http://www.openarchives.org/ore/1.0/primer.html section 4
A. Pepe, M. Mayernik, C.L. Borgman, and H.V. Sompel, "From Artifacts to Aggregations: Modeling Scientific Life Cycles on the Semantic Web," Journal of the American Society for Information Science and Technology (JASIST), to appear, 2009.
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
②
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
Content: Workflow provenance
A detailed trace of workflow execution- tasks performed, data transformations
- inputs used, outputs produced
③
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
Content: Workflow provenance
A detailed trace of workflow execution- tasks performed, data transformations
- inputs used, outputs produced
③
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
Content: Workflow provenance
lister
gene_id
output
pathway_genes
get pathwaysby genes1
merge pathways
concat gene pathway ids
A detailed trace of workflow execution- tasks performed, data transformations
- inputs used, outputs produced
③
• To establish quality, relevance, trust
• To track information attribution through complex transformations
• To describe one’s experiment to others, for understanding / reuse
• To provide evidence in support of scientific claims
• To enable post hoc process analysis for improvement, re-design
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
Why provenance matters, if done right
The W3C Incubator on Provenance has been collecting numerous use cases:http://www.w3.org/2005/Incubator/prov/wiki/Use_Cases#
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
What users expect to learn
• Causal relations:- which pathways come from which genes?- which processes contributed to producing an
image?- which process(es) caused data to be incorrect?- which data caused a process to fail?
• Process and data analytics:– analyze variations in output vs an input
parameter sweep (multiple process runs)– how often has my favourite service been
executed? on what inputs?– who produced this data?– how often does this pathway turn up when the
input genes range over a certain set S?
9
lister
gene_id
output
pathway_genes
get pathwaysby genes1
merge pathways
concat gene pathway ids
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
Open Provenance Model• graph of causal dependencies involving data and processors• not necessarily generated by a workflow!• v1.1 out soon
A PwasGeneratedBy (R)
AP used (R)
A1
P3
A2
A3
A4
wgb(R1)
wgb(R2)
used(R3)
used(R4)
P1wgb(R5)
P2wgb(R6)
to enable provenance metadata exchange
Goal:
standardize causal dependencies
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
Additional requirements on OPM• Artifact values require uniform common identifier
scheme– Linked Data in OPM?
• OPM accounts for structural causal relationships– additional domain-specific knowledge required– attaching semantic annotations to OPM graph nodes
• OPM graphs can grow very large– reduce size by exporting only query results
• Taverna approach– multiple levels of abstraction
• through OPM accounts (“points of view”)
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
Additional requirements on OPM• Artifact values require uniform common identifier
scheme– Linked Data in OPM?
• OPM accounts for structural causal relationships– additional domain-specific knowledge required– attaching semantic annotations to OPM graph nodes
• OPM graphs can grow very large– reduce size by exporting only query results
• Taverna approach– multiple levels of abstraction
• through OPM accounts (“points of view”)
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
Query results as OPM graphs
run W
Q(prov(W))exportprov(WA)
prov(W)execute query Q
exportQ(prov(W))
OPM(Q(prov(W)))
- Approach implemented in the Taverna 2.1 workflow system
- Internal provenance DB with ad hoc query language
Just released!
W
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
Full-fledged data-mediated collaborations
exp. A
resultdatasets
A
ResearchObject
Aresult
provenanceA
workflow A +input A
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
Full-fledged data-mediated collaborations
exp. A
resultdatasets
A
ResearchObject
Aresult
provenanceA
workflow A +input A
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
Full-fledged data-mediated collaborations
result A → input B
exp. A
resultdatasets
A
ResearchObject
Aresult
provenanceA
workflow A +input A
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
Full-fledged data-mediated collaborations
result A → input B
exp. A
exp. B
resultdatasets
A
ResearchObject
Aresult
provenanceA
workflow A +input A
resultdatasets
B
ResearchObject
Bresult
provenanceB
workflow B+input B
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
Full-fledged data-mediated collaborations
resultdatasets
B
ResearchObjectA+B
resultprovenance
A + B
workflow B +inputB
resultdatasets
A
workflow A +input A
result A → input B
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
Full-fledged data-mediated collaborations
resultdatasets
B
ResearchObjectA+B
resultprovenance
A + B
workflow B +inputB
resultdatasets
A
workflow A +input A
result A → input B
Provenance composition accounts for implicit
collaboration
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
Full-fledged data-mediated collaborations
resultdatasets
B
ResearchObjectA+B
resultprovenance
A + B
workflow B +inputB
resultdatasets
A
workflow A +input A
result A → input B
Provenance composition accounts for implicit
collaboration
Aligned with focus of upcoming Provenance Challenge 4:“connect my provenance to yours" into a whole OPM provenance graph.
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
Contacts
The myGrid Consortium (Manchester, Southampton)
JanusProvenance
http://www.myexperiment.org
http://mygrid.org.uk