the role of annotation in reproducibility (empirical 2014)
DESCRIPTION
Invited presentation at ESWC2014 Empirical workshopTRANSCRIPT
The role of annotation in reproducibility
ESWC2014 Empirical workshop26/05/2014
Contributors: my PhD students Olga Giraldo, Daniel Garijo, and Idafen Santana, and the Wf4Ever team
Oscar [email protected]
@ocorchohttps://www.slideshare.com/ocorcho
Setting the context of this presentation
Our main assumption
“We are not so good at describing our experiments, and this has a negative impact in reproducibility (and understandability, and
conservation, and reconstruction)”• Let’s see if this happens in different areas of scientific
research• In vitro experiments in Plant Biology• In silico experiments in several domains
• The challenge• Let’s use annotation as a means to increase reproducibility• Note: see the last slide on terminology
Ingredients for reproducibility
Ingredients for reproducibility
The role of laboratory protocols in Life Sciences
Laboratory Protocols
http://mibbi.sourceforge.net/about.shtml
Laboratory protocols supportthe scientific results
Laboratory Protocols
• Written in natural language• Generally, presented in a “recipe” style
• Description of a sequence of operations that include inputs and outputs• Step-by-step descriptions of procedures• A protocol is a type of workflow
• They must be described in sufficient and unambiguous detail. • To enable another agent (human or machine)
to replicate the original experiment.• Specific journals: Biotechniques, CSH
protocols, Current protocols, GMR, Jove, Protocol exchange, Plant methods, Plos One, Springer protocols
Detailed instructions on journal’s guides for authors
And other useful elements, including ontologies
It maintains checklists that promote how to report an experiment.
It models the design of an investigation. Including protocols, instrumentation, materials and data generated.
Aims to formalize knowledge about the organization, execution and analysis of scientific experiments.
EXPO
EXACTIt provides a model for the description of experiment actions.
Minimal information models, check lists, and even ontologies
However…
• Ambiguity is the norm
• Let’s make an analysis on protocols written for the plant biology community
• Incubate the centrifuge tubes in a water bath.
• Incubate the samples for 5 min with gentle shaking.
• Rinse DNA briefly in 1-2 ml of wash.
• Incubate at -20C overnight.
Protocol
Analysis of Laboratory Protocols
Repository Number of Protocols
Biotechniques 8
CSH protocols 11
Current protocols 25
GMR 4
Jove 21
Protocol exchange 12
Plant methods 10
Plos One 3
Springer protocols 5
Total 99
Minimal Information to Report a Laboratory Protocol
Our modelOcurrence in other models
TITLE 100%AUTHOR 100%INTRODUCTION Purpose 89%Provenance of the protocol 89%Applications of the protocol 89%Comparison with other protocols 89%Limitations 89%MATERIALS Sample 100%· strain or line genotype · Developmental stage · Organism part (tissue) Laboratory consumables/supplies · Laboratory consumable name 22%· Manufacturer name 11%· Laboratory consumable ID (catalog number) 11%Buffer recipes · Buffer name 67%· Chemical compound name 67%· Initial concentration of chemical compound 67%· Final concentration or amount of chemical compound
56%
· Storage conditions 56%· Cautions 56%· Hints 67%
Our modelOcurrence in other models
Reagent · Reagent name 100%· Reagent vendor or manufacturer 100%· Reagent ID (catalog number) 100%Kit · Kit name 100%· Kit vendor or manufacturer 100%· Kit ID (catalog number) 56%Primer · Primer name 67%· Primer sequence 89%· Primer vendor or manufacturer 33%Equipment · Equipment name 67%· Equipment vendor or Manufacturer 67%· Equipment ID (catalog number) 67%Software · Software name 67%· Software version 67%METHODS/PROCEDURE Protocol 100%· Cautions 56%· Critical steps 56%· Pause point 33%· Hints 22%· Troubleshooting 44%
How to Formalize the Protocols?
• Incubate the centrifuge tubes at 65°C in a water bath for 10 min.
• Rinse DNA briefly in 1-2 ml of wash.
• Incubate at -20C overnight.
Protocol
indicate different length of time2 seconds?, 5-10 seconds?...
Object: centrifuge tubes, water bathUnit of measure: 65C, 10 min.Action: incubate.
SMARTProtocols ontology
• http://vocab.linkeddata.es/SMARTProtocols/
Currently working on protocol annotation
plant material
instrument name
manufacturer
Buffer recipe
Reagent nameLaboratory consumable name
Source: Biotechniques
Meta-information about content
Content
Plant material Arabidopsis thaliana (rosette leaves, flowers, siliques),… and Larix decidua (young needles)
Instrument name Leitz DMRB microscope
manufacturer Leica Micro-systems
Buffer recipe 50 mM EDTA, 1.4% SDS
Reagent name 96% ethanol ~ absolute ethanol
Laboratory consumable name
2-mL tube, zeolite beads
15
From the wet lab to our computers
Lab book
Digital Log
Laboratory Protocol (recipe)
Workflow
Experiment
Ingredients for reproducibility
Scientific Workflows
17
“Template defining the set of tasks needed to carry out a computational experiment” [1]
• Inputs
• Steps
• Intermediate results
• Outputs
• Data driven, usually represented as Directed Acyclic Graphs (DAGs)
[1] Ewa Deelman, Dennis Gannon, Matthew Shields, Ian Taylor, Workflows and e-science: an overview of workflow system features and capabilities, Future Generation Computer Systems 25 (5) (2009) 528–540.
18
Plenty of workflow tools and platforms: Taverna, Wings, LONI Pipeline
What do I want from these workflows and repositories?
19
• As a designer: Discovery
• Workflows with similar functionality fragments/methods
• Design based in previous templates.
• As user/reuser/reviewer: Understandability, Exploration
• Search workflows by functionality
• Commonalities between execution runs
• Component categorization
• Reproducibility
Workflow 1
Working on different aspects of workflow preservation
• Workflow representation• Plan/template representation• Provenance trace representation• Link between templates and traces
• Creation of abstractions/motifs in scientific workflows• Abstraction catalog• Find how different workflows are
related
• Understandability and reuse of scientific workflows• Relation between the
workflows involved in thesame experiment (Research Objects)
20
CH1: Can we export an abstract template of the method being represented?CH2: How do we interoperate with other workflow results?CH3: How do we access the workflow results?CH4: How do we link an abstract method with several implementations?
CH5: How can we detect what are the typical operations in scientific workflows?CH6: How can we detect them automatically?
CH7: Which workflow parts are related to other workflows?CH8: How do workflows depend on the other parts of the experiments?
21
Overview
• Empirical analysis on 260 workflow templates from Taverna, Wings, Galaxy and Vistrails
• Catalog of recurring patterns: scientific workflow motifs.
• Data Oriented Motifs
• Workflow Oriented Motifs
• Understandability and reuse
Catalog
http://sensefinancial.com/wp-content/uploads/2012/02/contribution.jpg
Common motifs in scientific workflows: An empirical analysis. Garijo, D.; Alper, P.; Belhajjame, K.; Corcho, O.; Gil, Y.; and Goble, C. Future Generation Computer Systems, . 2013
22
Approach
• Reverse-engineer the set of current practices in workflowdevelopment through an analysis of empirical evidence
• Identify workflow abstractions that would facilitateunderstandability and therefore effective re-use
23
Motif CatalogData-Oriented Motifs (What?)
Data Retrieval
Data Preparation
Format Transformation
Input Augmentation and Output Splitting
Data Organisation
Data Analysis
Data Curation/Cleaning
Data Moving
Data Visualisation
Workflow-Oriented Motifs (How?)
Intra-Workflow Motifs
Stateful (Asynchronous) Invocations
Stateless (Synchronous) Invocations
Internal Macros
Human Interactions
Inter-Workflow Motifs
Atomic Workflows
Composite Workflows
Workflow Overloading
Ontology Purl: http://purl.org/net/wf-motifs
Macro abstraction detection
Problem statement:
Given a repository of workflow templates (either abstract or specific) or workflow execution traces, what are the workflow fragments I can deduce from it?
Useful for:• Systems like Taverna and Wings: (Many templates, little annotation to
relate them)• Finding relationships between workflows and sub-workflows.
• Most used fragments, most executed, etc.
• Systems like GenePattern, LONI Pipeline and Galaxy: (Many runs, nearly no templates published)
• Proposing new templates with the popular fragments.
24
25
Common workflow fragment detection
[Holder et al 1994]: Substructure Discovery in the SUBDUE System L. B. Holder, D. J. Cook, and S. Djoko. AAAI Workshop on Knowledge Discovery, pages 169-180, 1994.
•Given a collection of workflows, which are the most common fragments?• Common sub-graphs among the collection
• Sub-graph isomorphism (NP-complete)
•We use subgraph mining algorithms• Graph Grammar learning
• The rules of the grammar are the workflow fragments
• Graph based hierarchical clustering• Each cluster corresponds to a workflow fragment
• Iterative algorithm with two measures for compressing the graph:• Minimum Description Length (MDL)• Size
26
Exporting the fragment results: Wf-FD model
http://purl.org/net/wf-fd
27
Exporting the fragment results: Wf-FD model
Ingredients for reproducibility
Preserving the infrastructure
http://vocab.linkeddata.es/wicus/
Working on different aspects of workflow preservation
• Workflow representation• Plan/template representation• Provenance trace representation• Link between templates and traces
• Creation of abstractions/motifs in scientific workflows• Abstraction catalog• Find how different workflows are
related
• Understandability and reuse of scientific workflows• Relation between the
workflows involved in thesame experiment (Research Objects)
30
CH1: Can we export an abstract template of the method being represented?CH2: How do we interoperate with other workflow results?CH3: How do we access the workflow results?CH4: How do we link an abstract method with several implementations?
CH5: How can we detect what are the typical operations in scientific workflows?CH6: How can we detect them automatically?
CH7: Which workflow parts are related to other workflows?CH8: How do workflows depend on the other parts of the experiments?
31
What is a Research Object?
• Aggregation of resources that bundles together the contents of a research work:
• Data• Experiments• Examples• Bibliography• Annotations• Provenance• ROs• Etc.
http://www.researchobject.org/
Workflow-Centric Research Objects: First Class Citizens in Scholarly Discourse. Belhajjame, K.; Corcho, O.; Garijo, D.; Zhao, J.; Missier, P.; Newman, D.; Palma, R.; Bechhofer, S.; Garcıa, E.; Manuel, .G. J.; Klyne, G.; Page, K.; Roos, M.; Ruiz, J. E.; Soiland-Reyes, S.; Verdes-Montenegro, L.; De Roure, D.; and Goble, C. In Proceedings of the Second International Conference on the Future of Scholarly Communication and Scientific Publishing Sepublica2012, page 1-12, Hersonissos, 2012
ROHub and rohub.linkeddata.es
http://www.rohub.org/rodl/ http://rohub.linkeddata.es/
Workflow (and RO) Preservation Checklists
Acknowledgements
34
:collaboratesWith
:collaboratesWith:collaboratesWith
:collaboratesWith
:supervises :supervises
:yolandGil
:khalidBelhajjame
:varunRatnakar
:caroleGoble
:pinarAlper
:danielGarijo
:collaboratesWith:collaboratesWith
:idafenSantana
:olgaGiraldo
Laboratory Protocols
Wf Infrastructure
:supervises
:oscarCorcho
OEG
The role of annotation in reproducibility
ESWC2014 Empirical workshop26/05/2014
Contributors: my PhD students Olga Giraldo, Daniel Garijo, and Idafen Santana, and the Wf4Ever team
Oscar [email protected]
@ocorchohttps://www.slideshare.com/ocorcho
A final note on terminology
PreservationRestoratio
n
Conserva
tion Reconstruction
Source: Idafen Santana; Inspired by [Goble, 2012]