the role of annotation in reproducibility (empirical 2014)

36
The role of annotation in reproducibility ESWC2014 Empirical workshop 26/05/2014 Contributors: my PhD students Olga Giraldo, Daniel Garijo, and Idafen Santana, and the Wf4Ever team Oscar Corcho [email protected] @ocorcho https://www.slideshare.com/ocorcho

Upload: oscar-corcho

Post on 19-Jan-2015

437 views

Category:

Science


2 download

DESCRIPTION

Invited presentation at ESWC2014 Empirical workshop

TRANSCRIPT

Page 1: The role of annotation in reproducibility (Empirical 2014)

The role of annotation in reproducibility

ESWC2014 Empirical workshop26/05/2014

Contributors: my PhD students Olga Giraldo, Daniel Garijo, and Idafen Santana, and the Wf4Ever team

Oscar [email protected]

@ocorchohttps://www.slideshare.com/ocorcho

Page 2: The role of annotation in reproducibility (Empirical 2014)

Setting the context of this presentation

Our main assumption

“We are not so good at describing our experiments, and this has a negative impact in reproducibility (and understandability, and

conservation, and reconstruction)”• Let’s see if this happens in different areas of scientific

research• In vitro experiments in Plant Biology• In silico experiments in several domains

• The challenge• Let’s use annotation as a means to increase reproducibility• Note: see the last slide on terminology

Page 3: The role of annotation in reproducibility (Empirical 2014)

Ingredients for reproducibility

Page 4: The role of annotation in reproducibility (Empirical 2014)

Ingredients for reproducibility

Page 5: The role of annotation in reproducibility (Empirical 2014)

The role of laboratory protocols in Life Sciences

Laboratory Protocols

http://mibbi.sourceforge.net/about.shtml

Laboratory protocols supportthe scientific results

Page 6: The role of annotation in reproducibility (Empirical 2014)

Laboratory Protocols

• Written in natural language• Generally, presented in a “recipe” style

• Description of a sequence of operations that include inputs and outputs• Step-by-step descriptions of procedures• A protocol is a type of workflow

• They must be described in sufficient and unambiguous detail. • To enable another agent (human or machine)

to replicate the original experiment.• Specific journals: Biotechniques, CSH

protocols, Current protocols, GMR, Jove, Protocol exchange, Plant methods, Plos One, Springer protocols

Page 7: The role of annotation in reproducibility (Empirical 2014)

Detailed instructions on journal’s guides for authors

Page 8: The role of annotation in reproducibility (Empirical 2014)

And other useful elements, including ontologies

It maintains checklists that promote how to report an experiment.

It models the design of an investigation. Including protocols, instrumentation, materials and data generated.

Aims to formalize knowledge about the organization, execution and analysis of scientific experiments.

EXPO

EXACTIt provides a model for the description of experiment actions.

Minimal information models, check lists, and even ontologies

Page 9: The role of annotation in reproducibility (Empirical 2014)

However…

• Ambiguity is the norm

• Let’s make an analysis on protocols written for the plant biology community

• Incubate the centrifuge tubes in a water bath.

• Incubate the samples for 5 min with gentle shaking.

• Rinse DNA briefly in 1-2 ml of wash.

• Incubate at -20C overnight.

Protocol

Page 10: The role of annotation in reproducibility (Empirical 2014)

Analysis of Laboratory Protocols

Repository Number of Protocols

Biotechniques 8

CSH protocols 11

Current protocols 25

GMR 4

Jove 21

Protocol exchange 12

Plant methods 10

Plos One 3

Springer protocols 5

Total 99

Page 11: The role of annotation in reproducibility (Empirical 2014)

Minimal Information to Report a Laboratory Protocol

Our modelOcurrence in other models

TITLE 100%AUTHOR 100%INTRODUCTION  Purpose 89%Provenance of the protocol 89%Applications of the protocol 89%Comparison with other protocols 89%Limitations 89%MATERIALS  Sample 100%· strain or line genotype  · Developmental stage  · Organism part (tissue)  Laboratory consumables/supplies  · Laboratory consumable name 22%· Manufacturer name 11%· Laboratory consumable ID (catalog number) 11%Buffer recipes  · Buffer name 67%· Chemical compound name 67%· Initial concentration of chemical compound 67%· Final concentration or amount of chemical compound

56%

· Storage conditions 56%· Cautions 56%· Hints 67%

Our modelOcurrence in other models

Reagent  · Reagent name 100%·  Reagent vendor or manufacturer 100%·  Reagent ID (catalog number) 100%Kit  ·  Kit name 100%·  Kit vendor or manufacturer 100%·  Kit ID (catalog number) 56%Primer  ·  Primer name 67%·  Primer sequence 89%·  Primer vendor or manufacturer 33%Equipment  ·  Equipment name 67%·  Equipment vendor or Manufacturer 67%· Equipment ID (catalog number) 67%Software  ·  Software name 67%·  Software version 67%METHODS/PROCEDURE  Protocol 100%· Cautions 56%· Critical steps 56%·  Pause point 33%·  Hints 22%·  Troubleshooting 44%

Page 12: The role of annotation in reproducibility (Empirical 2014)

How to Formalize the Protocols?

• Incubate the centrifuge tubes at 65°C in a water bath for 10 min.

• Rinse DNA briefly in 1-2 ml of wash.

• Incubate at -20C overnight.

Protocol

indicate different length of time2 seconds?, 5-10 seconds?...

Object: centrifuge tubes, water bathUnit of measure: 65C, 10 min.Action: incubate.

Page 13: The role of annotation in reproducibility (Empirical 2014)

SMARTProtocols ontology

• http://vocab.linkeddata.es/SMARTProtocols/

Page 14: The role of annotation in reproducibility (Empirical 2014)

Currently working on protocol annotation

plant material

instrument name

manufacturer

Buffer recipe

Reagent nameLaboratory consumable name

Source: Biotechniques

Meta-information about content

Content

Plant material Arabidopsis thaliana (rosette leaves, flowers, siliques),… and Larix decidua (young needles)

Instrument name Leitz DMRB microscope

manufacturer Leica Micro-systems

Buffer recipe 50 mM EDTA, 1.4% SDS

Reagent name 96% ethanol ~ absolute ethanol

Laboratory consumable name

2-mL tube, zeolite beads

Page 15: The role of annotation in reproducibility (Empirical 2014)

15

From the wet lab to our computers

Lab book

Digital Log

Laboratory Protocol (recipe)

Workflow

Experiment

Page 16: The role of annotation in reproducibility (Empirical 2014)

Ingredients for reproducibility

Page 17: The role of annotation in reproducibility (Empirical 2014)

Scientific Workflows

17

“Template defining the set of tasks needed to carry out a computational experiment” [1]

• Inputs

• Steps

• Intermediate results

• Outputs

• Data driven, usually represented as Directed Acyclic Graphs (DAGs)

[1] Ewa Deelman, Dennis Gannon, Matthew Shields, Ian Taylor, Workflows and e-science: an overview of workflow system features and capabilities, Future Generation Computer Systems 25 (5) (2009) 528–540.

Page 18: The role of annotation in reproducibility (Empirical 2014)

18

Plenty of workflow tools and platforms: Taverna, Wings, LONI Pipeline

Page 19: The role of annotation in reproducibility (Empirical 2014)

What do I want from these workflows and repositories?

19

• As a designer: Discovery

• Workflows with similar functionality fragments/methods

• Design based in previous templates.

• As user/reuser/reviewer: Understandability, Exploration

• Search workflows by functionality

• Commonalities between execution runs

• Component categorization

• Reproducibility

Workflow 1

Page 20: The role of annotation in reproducibility (Empirical 2014)

Working on different aspects of workflow preservation

• Workflow representation• Plan/template representation• Provenance trace representation• Link between templates and traces

• Creation of abstractions/motifs in scientific workflows• Abstraction catalog• Find how different workflows are

related

• Understandability and reuse of scientific workflows• Relation between the

workflows involved in thesame experiment (Research Objects)

20

CH1: Can we export an abstract template of the method being represented?CH2: How do we interoperate with other workflow results?CH3: How do we access the workflow results?CH4: How do we link an abstract method with several implementations?

CH5: How can we detect what are the typical operations in scientific workflows?CH6: How can we detect them automatically?

CH7: Which workflow parts are related to other workflows?CH8: How do workflows depend on the other parts of the experiments?

Page 21: The role of annotation in reproducibility (Empirical 2014)

21

Overview

• Empirical analysis on 260 workflow templates from Taverna, Wings, Galaxy and Vistrails

• Catalog of recurring patterns: scientific workflow motifs.

• Data Oriented Motifs

• Workflow Oriented Motifs

• Understandability and reuse

Catalog

http://sensefinancial.com/wp-content/uploads/2012/02/contribution.jpg

Common motifs in scientific workflows: An empirical analysis. Garijo, D.; Alper, P.; Belhajjame, K.; Corcho, O.; Gil, Y.; and Goble, C. Future Generation Computer Systems, . 2013

Page 22: The role of annotation in reproducibility (Empirical 2014)

22

Approach

• Reverse-engineer the set of current practices in workflowdevelopment through an analysis of empirical evidence

• Identify workflow abstractions that would facilitateunderstandability and therefore effective re-use

Page 23: The role of annotation in reproducibility (Empirical 2014)

23

Motif CatalogData-Oriented Motifs (What?)

Data Retrieval

Data Preparation

Format Transformation

Input Augmentation and Output Splitting

Data Organisation

Data Analysis

Data Curation/Cleaning

Data Moving

Data Visualisation

Workflow-Oriented Motifs (How?)

Intra-Workflow Motifs

Stateful (Asynchronous) Invocations

Stateless (Synchronous) Invocations

Internal Macros

Human Interactions

Inter-Workflow Motifs

Atomic Workflows

Composite Workflows

Workflow Overloading

Ontology Purl: http://purl.org/net/wf-motifs

Page 24: The role of annotation in reproducibility (Empirical 2014)

Macro abstraction detection

Problem statement:

Given a repository of workflow templates (either abstract or specific) or workflow execution traces, what are the workflow fragments I can deduce from it?

Useful for:• Systems like Taverna and Wings: (Many templates, little annotation to

relate them)• Finding relationships between workflows and sub-workflows.

• Most used fragments, most executed, etc.

• Systems like GenePattern, LONI Pipeline and Galaxy: (Many runs, nearly no templates published)

• Proposing new templates with the popular fragments.

24

Page 25: The role of annotation in reproducibility (Empirical 2014)

25

Common workflow fragment detection

[Holder et al 1994]: Substructure Discovery in the SUBDUE System L. B. Holder, D. J. Cook, and S. Djoko. AAAI Workshop on Knowledge Discovery, pages 169-180, 1994.

•Given a collection of workflows, which are the most common fragments?• Common sub-graphs among the collection

• Sub-graph isomorphism (NP-complete)

•We use subgraph mining algorithms• Graph Grammar learning

• The rules of the grammar are the workflow fragments

• Graph based hierarchical clustering• Each cluster corresponds to a workflow fragment

• Iterative algorithm with two measures for compressing the graph:• Minimum Description Length (MDL)• Size

Page 26: The role of annotation in reproducibility (Empirical 2014)

26

Exporting the fragment results: Wf-FD model

http://purl.org/net/wf-fd

Page 27: The role of annotation in reproducibility (Empirical 2014)

27

Exporting the fragment results: Wf-FD model

Page 28: The role of annotation in reproducibility (Empirical 2014)

Ingredients for reproducibility

Page 29: The role of annotation in reproducibility (Empirical 2014)

Preserving the infrastructure

http://vocab.linkeddata.es/wicus/

Page 30: The role of annotation in reproducibility (Empirical 2014)

Working on different aspects of workflow preservation

• Workflow representation• Plan/template representation• Provenance trace representation• Link between templates and traces

• Creation of abstractions/motifs in scientific workflows• Abstraction catalog• Find how different workflows are

related

• Understandability and reuse of scientific workflows• Relation between the

workflows involved in thesame experiment (Research Objects)

30

CH1: Can we export an abstract template of the method being represented?CH2: How do we interoperate with other workflow results?CH3: How do we access the workflow results?CH4: How do we link an abstract method with several implementations?

CH5: How can we detect what are the typical operations in scientific workflows?CH6: How can we detect them automatically?

CH7: Which workflow parts are related to other workflows?CH8: How do workflows depend on the other parts of the experiments?

Page 31: The role of annotation in reproducibility (Empirical 2014)

31

What is a Research Object?

• Aggregation of resources that bundles together the contents of a research work:

• Data• Experiments• Examples• Bibliography• Annotations• Provenance• ROs• Etc.

http://www.researchobject.org/

Workflow-Centric Research Objects: First Class Citizens in Scholarly Discourse. Belhajjame, K.; Corcho, O.; Garijo, D.; Zhao, J.; Missier, P.; Newman, D.; Palma, R.; Bechhofer, S.; Garcıa, E.; Manuel, .G. J.; Klyne, G.; Page, K.; Roos, M.; Ruiz, J. E.; Soiland-Reyes, S.; Verdes-Montenegro, L.; De Roure, D.; and Goble, C. In Proceedings of the Second International Conference on the Future of Scholarly Communication and Scientific Publishing Sepublica2012, page 1-12, Hersonissos, 2012

Page 32: The role of annotation in reproducibility (Empirical 2014)

ROHub and rohub.linkeddata.es

http://www.rohub.org/rodl/ http://rohub.linkeddata.es/

Page 33: The role of annotation in reproducibility (Empirical 2014)

Workflow (and RO) Preservation Checklists

Page 34: The role of annotation in reproducibility (Empirical 2014)

Acknowledgements

34

:collaboratesWith

:collaboratesWith:collaboratesWith

:collaboratesWith

:supervises :supervises

:yolandGil

:khalidBelhajjame

:varunRatnakar

:caroleGoble

:pinarAlper

:danielGarijo

:collaboratesWith:collaboratesWith

:idafenSantana

:olgaGiraldo

Laboratory Protocols

Wf Infrastructure

:supervises

:oscarCorcho

OEG

Page 35: The role of annotation in reproducibility (Empirical 2014)

The role of annotation in reproducibility

ESWC2014 Empirical workshop26/05/2014

Contributors: my PhD students Olga Giraldo, Daniel Garijo, and Idafen Santana, and the Wf4Ever team

Oscar [email protected]

@ocorchohttps://www.slideshare.com/ocorcho

Page 36: The role of annotation in reproducibility (Empirical 2014)

A final note on terminology

PreservationRestoratio

n

Conserva

tion Reconstruction

Source: Idafen Santana; Inspired by [Goble, 2012]