2013 12-16 provenance and annotations
DESCRIPTION
Presenting PROV-O, PAV, Open Annotation Model and Research Object (RO). Powerpoint source: https://skydrive.live.com/view.aspx?cid=37935FEEE4DF1087&resid=37935FEEE4DF1087%21668&app=PowerPoint&wdo=1 See also: http://practicalprovenance.wordpress.com/ http://www.w3.org/TR/prov-primer/ http://www.w3.org/TR/prov-o/ http://www.researchobject.org/ http://www.openannotation.org/spec/core/TRANSCRIPT
Provenance and annotationsStian Soiland-Reyes
myGrid, University of Manchester
HeRC CHIPSET meeting, Manchester, 2013-12-16This work is licensed under aCreative Commons Attribution 3.0 Unported License
What is provenance?
By Dr Stephen Dannlicensed under Creative Commons Attribution-ShareAlike 2.0 Generichttp://www.flickr.com/photos/stephendann/3375055368/
Derivationhow did it change?
Activitywhat happens to it?
Originwhere is it from?
What is provenance?
By Dr Stephen Dannlicensed under Creative Commons Attribution-ShareAlike 2.0 Generichttp://www.flickr.com/photos/stephendann/3375055368/
Attributionwho did it?
Licensingcan I use it?
Attributeswhat is it?
Annotationswhat do others say about it?
Aggregationwhat is it part of?
Date and toolwhen was it made?using what?
AttributionWho collected this sample? Who helped?
Which lab performed the sequencing?
Who did the data analysis?
Who curated the results?
Who produced the raw data this analysis is based on?
Who wrote the analysis workflow?
Why do I need this?
i. To be recognized for my work
ii. Who should I give credits to?
iii. Who should I complain to?
iv. Can I trust them?
v. Who should I make friends with?
prov:wasAttributedToprov:actedOnBehalfOfdct:creatordct:publisherpav:authoredBypav:contributedBypav:curatedBypav:createdBypav:importedBypav:providedBy...
RolesPersonOrganizationSoftwareAgent
Agent types
AliceThe lab
Data
wasAttributedTo
actedOnBehalfOf
http://practicalprovenance.wordpress.com/
DerivationWhich sample was this metagenome sequenced from?
Which meta-genomes was this sequence extracted from?
Which sequence was the basis for the results?
What is the previous revision of the new results?
Why do I need this?
i. To verify consistency (did I usethe correct sequence?)
ii. To find the latest revision
iii. To backtrack where a diversionappeared after a change
iv. To credit work I depend on
v. Auditing and defence for peer review
wasDerivedFrom
wasQuotedFrom
Sequence
New results
wasDerivedFrom
Sample
Meta -genome
Old results
wasRevisionOf
wasInfluencedBy
Activities
What happened? When? Who?
What was used and generated?
Why was this workflow started?
Which workflow ran? Where?
Why do I need this?
i. To see which analysis was performed
ii. To find out who did what
iii. What was the metagenome used for?
iv. To understand the whole process“make me a Methods section”
v. To track down inconsistencies
used
wasGeneratedBy
wasStartedAt
"2012-06-21"
Metagenome
Sample
wasAssociatedWith
Workflow server
wasInformedBy
wasStartedBy
Workflow run
wasGeneratedBy
Results
Sequencing
wasAssociatedWith
Alice
hadPlan
Workflow definition
hadRole
Lab technician
Results
PROV model
http://www.w3.org/TR/prov-primer/
Copyright © 2013 W3C® (MIT, ERCIM, Keio, Beihang), All Rights Reserved.
Provenance Working Group
PROV implementationsAERS-LD agentSwitch Amalgame
Annotation Inference
Framework
APROVeD: Automatic Provenance Derivation
checker.pl CollabMap cProv csv2rdf4lod-automation
D2R Server DataFAQs DBpedia DeFactoDublin Core to
PROV mapping
Earth System Science Server
Global Change Information
SystemHedgehog
Human Computation
ontology
Informed Rural Passenger
Information Infrastructure
ISO_19115_Lineage
Music Ontology
OBIAMAOECD Linked
Data
Open Provenance Model for Workflows (OPMW)
OpenUp Prov
Oracle Enterprise Transactions
Controls Governor
PAV Provenance,
Authoring and Versioning
PML 3.0Policy
Reasoning Framework
PoN P-planPROV Python library
prov-api prov-check
Provenance Environment
(ProvEn) Services
Provenance for Earth Science
Provenance server
Provenance Vocabulary
Prov-genPROV-N to Neo4J DB mapping
PROVoKingProv
ToolboxProv-
Validatorprovx2o Pubby
PubFlow Provenance
Archive
Quality Assessment Framework
QuerioCity research
prototypeRaw2LD recoprov roevo
Semantic Proteomics Dashboard (SemPoD)
SIGNAStatJR eBook
system
SysPro Taverna tavernaprovTinga
Provenance Service
TriplifyTWC
Healthdata
University of Southampton
Open Data
WebLab-PROV
wfprov
Wings Provenance
ExportYanfeng Shu
http://dx.doi.org/10.6084/m9.figshare.878099
PROV-N PROV-O PROV-XMLPROV-JSONLegend:
Source (2013-04-16):http://www.w3.org/TR/prov-implementations/
Open Annotation Data Model
http://www.openannotation.org/spec/core/core.html
Copyright © 2012-2013 the Contributors to the Open Annotation Core Data Model Specification, published by the Open Annotation Community Group under the W3C
Community Contributor License Agreement (CLA).
Example: David’s slides are about ClinicalCodes
http://dev.mygrid.org.uk/wiki/download/attachments/16384498/daspringate_clinicalcodes_HeRC.pdf
https://clinicalcodes.rss.mhs.man.ac.uk/
foaf:primaryTopic
Option 1: The FOAF vocabulary
The primaryTopic property relates a document to the main thing that the document is about.
Example: David’s slides are about ClinicalCodes
http://dev.mygrid.org.uk/wiki/download/attachments/16384498/daspringate_clinicalcodes_HeRC.pdf
https://clinicalcodes.rss.mhs.man.ac.uk/
annotation
oa:hasBodyoa:hasTarget
Option 2: Open Annotation Data Model
Annotations have provenance
annotationoa:hasBody oa:hasTarget
oa:annotatedBy
Stian Soiland-Reyes
foaf:name
pav:authoredBy
David A. Springate
foaf:name
© 2013 David A. Springate
pav:createdBy
David A. Springate
foaf:name
pav:retrievedBy
http://purl.org/pav/htmlWho is the “creator” of the slides, is it David or Stian?With PAV we can differentiate content authoring from upload
Annotations have provenance
annotationoa:hasBody oa:hasTarget
http://orcid.org/0000-0001-9842-9718
oa:annotatedBy
Stian Soiland-Reyes
foaf:name
pav:authoredBy
David A. Springate
foaf:name
© 2013 David A. Springate
pav:createdBy
David A. Springate
foaf:name
pav:retrievedBy
http://purl.org/pav/html
Which David…? Need a common identifier ORCID
Annotations as first-class citizens
annotationoa:hasBody oa:hasTarget
oa:motivatedBy
oa:bookmarking
oa:classifying
oa:commenting
oa:describing
oa:editing
oa:highlighting
oa:identifying
oa:linking
oa:moderating
oa:questioning
oa:replying
oa:tagging
…
JSON
Turtle
Provenance of what?
Who made the (content of) this data set? Who maintains it?
Who wrote this document? Who uploaded it?
Which CSV was this Excel file imported from?
Who wrote this description? When? How did we get it?
What is the state of these guidelines? Are they official?
What did the guidelines look like before? (Revisions) – are there newer versions?
What new resources have been derived from this data set?
http://www.researchobject.org/
RESEARCH OBJECT (RO)
http://www.researchobject.org/
Research objects goal: Openly share everything about your experiments, including how those things are related
What is in a research object?A Research Object bundles and relates digital resources of a scientific experiment or investigation:
Data used and results produced in experimental study
Methods employed to produce and analyse that data
Provenance and settings for the experiments
People involved in the investigation
Annotations about these resources, that are essential to the understanding and interpretation of the scientific outcomes captured by a research object
http://www.researchobject.org/
Gathering everythingResearch Objects (RO) aggregate related resources, their provenance and annotations
Conveys “everything you need to know” about a study/experiment/analysis/dataset/workflow
Shareable, evolvable, contributable, citable
ROs have their own provenance and lifecycles
Research object model at a glance
Research Object
ResourceResource
Resource
AnnotationAnnotation
Annotation
oa:hasTarget
ResourceResourceAnnotation graph
oa:hasBody
ore:aggregates
Manifest
Why Research Objects?i. To share your research materials
(RO as a social object)
ii. To facilitate reproducibility and reuse of methods
iii. To be recognized and cited(even for constituent resources)
iv. To preserve results and prevent decay (curation of workflow definition; using provenance for partial rerun)
A Research objecthttp://alpha.myexperiment.org/packs/387
Annotations in research objectsTypes: “This document contains an hypothesis”
Relations: “These datasets are consumed by that tool”
Provenance: “These results came from this workflow run”
Descriptions: “Purpose of this step is to filter out invalid data”
Comments: “This method looks useful, but how do I install it?”
Examples: “This is how you could use it”
Annotation guidelines – which properties?Descriptions: dct:title, dct:description, rdfs:comment, dct:publisher, dct:license, dct:subject
Provenance: dct:created, dct:creator, dct:modified, pav:providedBy, pav:authoredBy, pav:contributedBy, roevo:wasArchivedBy, pav:createdAt
Provenance relations: prov:wasDerivedFrom, prov:wasRevisionOf, wfprov:usedInput, wfprov:wasOutputFrom
Social networking: oa:Tag, mediaont:hasRating, roterms:technicalContact, cito:isDocumentedBy, cito:isCitedBy
Dependencies: dcterms:requires, roterms:requiresHardware, roterms:requiresSoftware, roterms:requiresDataset
Typing: wfdesc:Workflow, wf4ever:Script, roterms:Hypothesis, roterms:Results, dct:BibliographicResource
Saving a research object: RO bundle
Single, transferrable research object
Self-contained snapshot
Which files in ZIP, which are URIs? (Up to user/application)
Regular ZIP file, explored and unpacked with standard tools
JSON manifest is programmatically accessible without RDF understanding
Works offline and in desktop applications – no REST API access required
Basis for RO-enabled file formats, e.g. Taverna run bundle
Exchanged with myExperiment and RO tools
Workflow Results Bundle
workflowrun.prov.ttl(RDF)
outputA.txt
outputC.jpg
outputB/
https://w3id.org/bundle
intermediates/
1.txt2.txt
3.txt
de/def2e58b-50e2-4949-9980-fd310166621a.txt
inputA.txtworkflow
URI references
attribution
executionenvironment
Aggregating in Research Object
ZIP folder structure (RO Bundle)
mimetype
application/vnd.wf4ever.robundle+zip
.ro/manifest.json
RO Bundle
What is aggregated? File In ZIP or external URI
Who made the RO? When?
Who?
External URIs placed in folders
Embedded annotation
External annotation, e.g. blogpost
JSON-LD context RDF
RO provenance
.ro/manifest.json
Format
Note: JSON "quotes" not shown above for brevity
http://json-ld.org/
http://orcid.org/
https://w3id.org/bundle
http://mayor2.dia.fi.upm.es/oeg-upm/files/dgarijo/motifAnalysisSite/
<h3 property="dc:title">Common Motifs in Scientific Workflows:<br>An Empirical Analysis</h3>
<body resource="http://www.oeg-upm.net/files/dgarijo/motifAnalysisSite/" typeOf="ore:Aggregation ro:ResearchObject">
Research Object as RDFahttp://www.oeg-upm.net/files/dgarijo/motifAnalysisSite/
<li><a property="ore:aggregates" href="t2_workflow_set_eSci2012.v.0.9_FGCS.xls"typeOf="ro:Resource">Analytics for Taverna workflows</a></li>
<li><a property="ore:aggregates" href="WfCatalogue-AdditionalWingsDomains.xlsx“typeOf="ro:Resource">Analytics for Wings workflows</a></li>
<span property="dc:creator prov:wasAttributedTo"resource="http://delicias.dia.fi.upm.es/members/DGarijo/#me"></span>