escience-school-oct2012-campinas-brazil

122
The buzz around reproducible bioscience data: the policies, the communities and the standards Susanna-Assunta Sansone, PhD Principal Investigator and Team Leader, University of Oxford e-Research Centre, Oxford, UK SPSAS e-SciBioEnergy Sao Paolo School of Advanced Science on e-Science for Bioenergy Research, 22-26 Oct, 2012, Campinas, Brazil Slides at: http://www.slideshare.net/SusannaSansone

Upload: susanna-assunta-sansone

Post on 27-Jan-2015

108 views

Category:

Documents


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: eScience-School-Oct2012-Campinas-Brazil

The buzz around reproducible bioscience data:

the policies, the communities and the standards

Susanna-Assunta Sansone, PhD Principal Investigator and Team Leader,

University of Oxford e-Research Centre, Oxford, UK

SPSAS e-SciBioEnergy Sao Paolo School of Advanced Science on e-Science for Bioenergy Research, 22-26 Oct, 2012, Campinas, Brazil

Slides at: http://www.slideshare.net/SusannaSansone

Page 2: eScience-School-Oct2012-Campinas-Brazil

Lab scientist!

Data scientist!

Consultant!Team Leader!

Page 3: eScience-School-Oct2012-Campinas-Brazil

Oxford e-Research Centre

Page 4: eScience-School-Oct2012-Campinas-Brazil

Oxford e-Research Centre

Page 5: eScience-School-Oct2012-Campinas-Brazil

Providing research computing, high-performance computing

Integrating with national and international infrastructure

Supporting leading edge facilities through education and training

Oxford e-Research Centre

Page 6: eScience-School-Oct2012-Campinas-Brazil

Oxford e-Research Centre

Collaborating with European and wider international groups in, e.g.:

•  energy, •  radio astronomy, •  biological data federation, •  life sciences simulation, •  biodiversity, •  computational chemistry, •  neuroscience, •  digital humanities tools, •  digital music analysis

Research in •  computation, •  data infrastructure and analysis, •  visualisation

Page 7: eScience-School-Oct2012-Campinas-Brazil

tox/pharma  

env  

health  

agro  

My team’s activities and groups we work with data management and biocuration, collaborative development

of software and database, standards and ontology

•  environmental genomics •  metabolomics •  metagenomics •  nanotechnology •  proteomics

•  stem cell discovery •  system biology •  transcriptomics •  toxicogenomics •  environmental health

Page 8: eScience-School-Oct2012-Campinas-Brazil

http://www.flickr.com/photos/12308429@N03/4957994485/ CC BY

Page 9: eScience-School-Oct2012-Campinas-Brazil

“The buzz around reproducible bioscience data:

the policies, the communities and the standards”

“The reality from the buzz:

how to deliver reproducible bioscience data”

Outline

Page 10: eScience-School-Oct2012-Campinas-Brazil

10

Harmonize collection across sites Find matching studies

Data dissemination Long-term data stewardship

Preserve institutional /

corporate memory

Page 11: eScience-School-Oct2012-Campinas-Brazil

11

Utilize public data

Identify suitable data Retrieve

Curate and harmonize Re-analyze

Page 12: eScience-School-Oct2012-Campinas-Brazil

12

Address reproducibility /

reuse of public data

Page 13: eScience-School-Oct2012-Campinas-Brazil

13

Address reproducibility /

reuse of public data

Page 14: eScience-School-Oct2012-Campinas-Brazil

14

Ioannidis et al., Repeatability of published microarray gene expression analyses. Nature Genetics 41(2), 149-55 (2009) doi:10.1038/ng.295

Address reproducibility /

reuse of public data

Page 15: eScience-School-Oct2012-Campinas-Brazil

15

15

Address reproducibility /

reuse of public data

Page 16: eScience-School-Oct2012-Campinas-Brazil

16

Address reproducibility /

reuse of public data

16

Page 17: eScience-School-Oct2012-Campinas-Brazil

17

17

Address reproducibility /

reuse of public data

Page 18: eScience-School-Oct2012-Campinas-Brazil

http://www.flickr.com/photos/notbrucelee/8016189356/ CC BY

Page 19: eScience-School-Oct2012-Campinas-Brazil

http://www.flickr.com/photos/notbrucelee/8016189356/ CC BY

C O M P R E H E N S I B L E I N T E R O P E R A B L E R E P R O D U C I B L E

R E U S A B L E

Page 20: eScience-School-Oct2012-Campinas-Brazil

The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project

20

Growing, worldwide movement for reproducible research

“Publicly-funded research data are a public good, produced in the public interest”

“Publicly-funded research data should be openly available to the maximum extent possible”

Shared, annotated research data and methods offer new discovery opportunities and prevent unnecessary repetition of work.

Improved data sharing underpins science of the future

Page 21: eScience-School-Oct2012-Campinas-Brazil

§  Researchers and bioinformaticians in both academic and commercial science, along with funding agencies and publishers, embrace the concept that community-developed standards are pivotal to structure and enrich the annotation of

•  entities of interest (e.g., genes, metabolites, phenotypes) and •  experimental steps (e.g., provenance of study materials,

technology and measurement types)

esoteric formats

hoc or proprietary terminology

lack of sufficient contextual

information

comprehensible?

interoperable?

reusable?

reproducible?

Growing, worldwide movement for reproducible research

Page 22: eScience-School-Oct2012-Campinas-Brazil

Seven week old C57BL/6N mice were treated with low-fat diet.

Liver was dissected out, RNA prepared…etc.

Type of protocol - sample treatment Type of protocol - nucleic acid extraction

Age value Unit

Strain name Subject of the experiment

Type of diet and experimental condition Anatomy part

§  Describe and communicate the information in an unambiguous, human and machine readable manner

Structure and enrich description of the experiments

Page 23: eScience-School-Oct2012-Campinas-Brazil

§  Describe and communicate the information in an unambiguous, human and machine readable manner

Figure: credit to OBI consortium

Structure and enrich description of the experiments

Page 24: eScience-School-Oct2012-Campinas-Brazil

Reproducible & Reusable

Bioscience Research

Page 25: eScience-School-Oct2012-Campinas-Brazil

Reproducible & Reusable

Bioscience Research

Well-annotated & Structured Data

reasoning

analysis

exchange

integration

visualization

browsing retrieval

Page 26: eScience-School-Oct2012-Campinas-Brazil

Reproducible & Reusable

Bioscience Research

Well-annotated & Structured Data

Community Standards

Software Tools

reasoning

analysis

exchange

integration

visualization

browsing retrieval

Page 27: eScience-School-Oct2012-Campinas-Brazil

http://www.flickr.com/photos/lamerentertainment/1581770980/sizes/m/in/photostream/

Page 28: eScience-School-Oct2012-Campinas-Brazil

Source of the figure: EBI website

§  Is interdisciplinary and integrative in character •  need to deal with new and existing datasets •  deal with a variety of data types

§  ‘How the organism works’ is the focus •  Twenty years ago data was the center

Experimental and

computational data

Publications

Today’s bioscience research

Page 29: eScience-School-Oct2012-Campinas-Brazil

The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project

29 Source: http://ebbailey.wordpress.com

Page 30: eScience-School-Oct2012-Campinas-Brazil

Example from the toxicogenomics domain

Study looking at the effect of a compound inducing liver damage by characterizing/measuring

- the metabolic profile by MS and NMR

- protein expression in liver by MS

- gene expression by DNA microarray

-  conducting genetic and phenotypical analysis

Information contributing to the construction and validation of system biology models

Page 31: eScience-School-Oct2012-Campinas-Brazil

The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project

31

Example of experiments by InnoMed PredTox a FP6 public-private consortium

Page 32: eScience-School-Oct2012-Campinas-Brazil

§  Capture all salient features of the experimental workflow

§  Make annotation explicit and discoverable

§  Structure the descriptions for consistency, tracking §  independent variables §  dependent variables using §  cross reference and

resolvable identifiers

Structured description of datasets

Page 33: eScience-School-Oct2012-Campinas-Brazil

§  We must strike a balance between •  depth and breadth of

information; and •  sufficient information

required to reuse the data

Not too much, not too little, just ‘right’

Page 34: eScience-School-Oct2012-Campinas-Brazil

Information intensive experiments

Page 35: eScience-School-Oct2012-Campinas-Brazil

To make the experiments comprehensible and reusable,

underpinning future investigations, we need

common ways to report and share the experimental details and the associated data.

Consistent reporting will have a positive and long-lasting impact

on the value of collective scientific outputs.

Information intensive experiments

Page 36: eScience-School-Oct2012-Campinas-Brazil

§ The challenges we face

•  Large in volume: lots of data types and metadata! •  Lots of free text descriptions: hard to mine, subject to mistakes! •  Babel of terminologies: lack of definitions, hard to map! •  Heterogeneous file formats: software lock-in!

§ Need for reporting standards •  Minimal reporting descriptors

- Report the same ‘core essentials’ •  Controlled vocabularies or ontology

- Use the same word and mean the same thing •  Common exchange formats

- Make tools interoperable, allow data exchange and integration

Common ways to report and share

Page 37: eScience-School-Oct2012-Campinas-Brazil

§  Describe and communicate the information to others, in an unambiguous manner

§  To unlock the value in the data •  Compare, query and evaluate data

- Facilitate scientific validation of the findings •  Understand variability within/between different technologies and

protocols -  Facilitate technical validation -  Enable optimization of the experimental designs -  Identify critical checkpoints and develop quality metrics

§  To define submission and/or publication requirements •  Journals •  Databases

§  To ensure data integrity, reproducibility and (re)use

Reporting standards – the benefits

Page 38: eScience-School-Oct2012-Campinas-Brazil

Genome annotation www.geneontology.org

Functional Genomics Data Society (FGED)

www.fged.org

HUPO- Proteomics Standards Initiative (PSI)

http://www.psidev.info

Cheminformatics www.ebi.ac.uk/chebi

Pathways www.biopax.org

Systems modelling standards

www.co.mbine.org

Metabolomics Standards Initiative (MSI) http://www.metabolomicssociety.org

Genomics Standards Consortium (GSC)

gensc.org

Escalating number of standardization efforts in bioscience, e.g.:

Enzymology data standards

www.strenda.org

Page 39: eScience-School-Oct2012-Campinas-Brazil

Different community, different norms and standards, e.g.:

report the same core, essential information

use the same word and refer to the same ‘thing’ allow data to flow from

one system to another

Challenges: lack of coordination, fragmentation and uneven coverage

Page 40: eScience-School-Oct2012-Campinas-Brazil

report the same core, essential information

use the same word and refer to the same ‘thing’ allow data to flow from

one system to another

Is this ‘general mobilization’ good or bad?

§  Difference in structures and processes: •  organization types (open, close to members, society, WG…) •  standards development (how to design, develop, evaluate, maintain…) •  adoption, uptake, outreach (link to journals, funders, commercial sector…) •  funds (sponsors, memberships, grants, volunteering…)

Page 41: eScience-School-Oct2012-Campinas-Brazil

report the same core, essential information

use the same word and refer to the same ‘thing’ allow data to flow from

one system to another

§  Fragmentation of the standards is a major issue •  Being focused on particular communities’ interests, be their individual

technologies or biological/biomedical disciplines, leads to duplication of effort, and more seriously, the development of (largely arbitrarily) different standards

•  This severely hinders the interoperability of databases and tools and ultimately the integration of datasets

Is this ‘general mobilization’ good or bad?

Page 42: eScience-School-Oct2012-Campinas-Brazil

Three EBI omics systems S

ubm

issi

on

Acc

ess

Sto

rage

Fragmentation of the databases and data, e.g.

Page 43: eScience-School-Oct2012-Campinas-Brazil

Three EBI omics systems S

ubm

issi

on

Acc

ess

Sto

rage

Fragmentation of the databases and data, e.g.

Page 44: eScience-School-Oct2012-Campinas-Brazil

Three EBI omics systems S

ubm

issi

on

Acc

ess

Sto

rage

Fragmentation of the databases and data, e.g.

Page 45: eScience-School-Oct2012-Campinas-Brazil

Three EBI omics systems

DIFFERENT Formats, terminologies and tools

Sub

mis

sion

DIFFERENT Download formats

Acc

ess

DIFFERENT - Core requirements represented - Representation of the studies and related samples - Curation practices

Sto

rage

Fragmentation of the databases and data, e.g.

Page 46: eScience-School-Oct2012-Campinas-Brazil

Technologically-delineated views of the world

Biologically-delineated views of the world

Generic features (‘common core’) - description of source biomaterial - experimental design components

Arrays

Scanning Arrays & Scanning

Columns

Gels MS MS

FTIR

NMR

Columns

transcriptomics transcriptomics metabolomics

plant biology epidemiology microbiology

To integrate data we need interoperable standards

Page 47: eScience-School-Oct2012-Campinas-Brazil

§  Promote synergies •  Among basic academic (omics) research but also regulatory- or

healthcare-driven initiatives

§  Much could be learned from exchange of ideas and practices •  Although, regulatory- or healthcare-driven initiatives have far stricter

guidelines

•  Although, often SDOs have ‘close’ discussions, require membership

§  Create interoperable standards •  Fit neatly into a jigsaw, resolving inconsistency and filling gaps

§  Overcome several barriers •  Technical

•  Funding issue

•  Sociological......

Need to address the fragmentation

Page 48: eScience-School-Oct2012-Campinas-Brazil

The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project

48

“Any customer can have a car painted any colour that he wants so long as it is black” Henry Ford, you know who he is…

“Biologists would rather share their toothbrush than their gene name” Michael Ashburner, Professor Genetics, University of Cambridge, UK

Eloquent quotes

Page 49: eScience-School-Oct2012-Campinas-Brazil

§  Buying nuts and bolts is easy today •  But in the 19th century it was very complicated!

Standards – an old issue, e.g. engineering in 1850

Page 50: eScience-School-Oct2012-Campinas-Brazil

§  Buying nuts and bolts is easy today •  But in the 19th century it was very complicated!

§  Nuts and bolts were custom made •  Products from different shops were incompatible •  Craftsmen liked the monopoly

- Customers were ‘locked in’ !!

§  In 1864 William Sellers initiated the standardization •  Mass production •  Get interchangeable parts •  Standardized way to make nuts and bolts

§  Generally adopted only after WWII, though …. !!

Standards – an old issue, e.g. engineering in 1850

Page 51: eScience-School-Oct2012-Campinas-Brazil

The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project

51

Social engeneering

Page 52: eScience-School-Oct2012-Campinas-Brazil

The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project

52

Ownership of open standards can be problematic in broad, grass-root collaborations; it

requires improved models, to encourage maintenance of and contributions to these efforts,

supporting their evolutions

Page 53: eScience-School-Oct2012-Campinas-Brazil

The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project

53

The extensive community liaison needs to be managed

and funded; rewards and incentives need to be identified

for all contributors

Page 54: eScience-School-Oct2012-Campinas-Brazil

The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project

54

The cost of implementing a standards-supported data

sharing vision is as large as the number of stakeholders that must operate synchronously

Page 55: eScience-School-Oct2012-Campinas-Brazil

§ Several data preservation, management and sharing policies have emerged in response to increased funding for omics domains

§ Even if in general terms, standards are recognized as necessary ‘tools’ to unambiguously represent, describe and communicate research data

1. Funders actively developing data policies

Page 56: eScience-School-Oct2012-Campinas-Brazil
Page 57: eScience-School-Oct2012-Campinas-Brazil

§  “… lack of standardized data affects CDER’s review processes by curtailing a reviewer’s ability to perform integral tasks such as rapid acquisition, storage, analysis......efficient management of a portfolio of standards projects will require coordinated efforts and clear roles for multiple participants within/outside FDA”

2. Similar trend in the regulatory arena

Page 58: eScience-School-Oct2012-Campinas-Brazil
Page 59: eScience-School-Oct2012-Campinas-Brazil

The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project

59

§ Continue to support the development of open standards and tools •  to support sharing of sufficiently well annotated datasets •  to enable comprehensible, reusable, reproducible research

3. Publishes have become strong advocators

Page 60: eScience-School-Oct2012-Campinas-Brazil

….the rise of data-driven journals, e.g.:

partnering with:

Page 61: eScience-School-Oct2012-Campinas-Brazil

The rise of data-driven journals, e.g.:

partnering with:

Page 62: eScience-School-Oct2012-Campinas-Brazil

§ R&D has invested heavily in procedures and tools that integrate external information with their own data to enhance the decision-making process

•  Now joining forces to streamline non-competitive elements of the life science workflow by the specification of common standards, business terms, relationships and processes

4. Similar trend in the commercial sector

Page 63: eScience-School-Oct2012-Campinas-Brazil

Big Life Science

Company

Yesterday Today Tomorrow

Yesterday Today Tomorrow Innovation Model

Innovation inside Searching for Innovation Heterogeneity of collaborations; part of the wider ecosystem

IT Internal apps & data Struggling with change security and trust

Cloud, services

Data Mostly inside In and out Distributed

Portfolio Internally driven and owned Partially shared Shared portfolio

Credit to: Pistoia Alliance

Big Life Science

Company

Proprietary content provider

Public content provider

Academic group

Software vendor

CRO

Service provider

Regulatory authorities

....their information landscape is evolving

Page 64: eScience-School-Oct2012-Campinas-Brazil

CC BY

http://www.flickr.com/photos/idiolector/289490834/

Page 65: eScience-School-Oct2012-Campinas-Brazil
Page 66: eScience-School-Oct2012-Campinas-Brazil
Page 67: eScience-School-Oct2012-Campinas-Brazil

“The buzz around reproducible bioscience data:

the policies, the communities and the standards”

u  Contribute to the reproducible research movement

u  Learn about open community-standards in your area

u  Consider data science as a career path

Take home messages

Page 68: eScience-School-Oct2012-Campinas-Brazil

“The buzz around reproducible bioscience data:

the policies, the communities and the standards”

“The reality from the buzz:

how to deliver reproducible bioscience data”

Outline

Page 69: eScience-School-Oct2012-Campinas-Brazil

“The buzz around reproducible bioscience data:

the policies, the communities and the standards”

“The reality from the buzz:

how to deliver reproducible bioscience data”

How do we achieve this? Is it possible to achieve a common,

structured representation of diverse bioscience experiments

that:

•  follows the appropriate community standards and

•  delivers research? C O M P R E H E N S I B L E I N T E R O P E R A B L E R E P R O D U C I B L E

R E U S A B L E

Page 70: eScience-School-Oct2012-Campinas-Brazil

VO!

miame!MIAPA!

MIRIAM!MIQAS!MIX!

MIGEN!

CIMR!MIAPE!

MIASE!

MIQE!

MISFISHIE….!

REMARK!

CONSORT!

MAGE-Tab!GCDML!

SRAxml!SOFT! FASTA!

DICOM!

MzML !SBRML!

SEDML…!

GELML!

ISA-Tab!

CML!

MITAB!

AAO!CHEBI!

OBI!

PATO! ENVO!MOD!

BTO!IDO…!

TEDDY!

PRO!XAO!

DO

Growing number of reporting standards

Page 71: eScience-School-Oct2012-Campinas-Brazil

Growing number of reporting standards

+ 130

Estimated

+ 150

Source: MIB

BI,

EQU

ATOR

+ 303

Source: BioPortal

Databases, annotation,

curation tools

miame!MIAPA!

MIRIAM!MIQAS!MIX!

MIGEN!

CIMR!MIAPE!

MIASE!

MIQE!

MISFISHIE….!

REMARK!

CONSORT!

MAGE-Tab!GCDML!

SRAxml!SOFT! FASTA!

DICOM!

MzML !SBRML!

SEDML…!

GELML!

ISA-Tab!

CML!

MITAB!

AAO!CHEBI!

OBI!

PATO! ENVO!MOD!

BTO!IDO…!

TEDDY!

PRO!XAO!

DO

VO!

Page 72: eScience-School-Oct2012-Campinas-Brazil

But how much do we know about these standards

miame!MIAPA!

MIRIAM!MIQAS!MIX!

MIGEN!

CIMR!MIAPE!

MIASE!

MIQE!

MISFISHIE….!

REMARK!

CONSORT!

MAGE-Tab!GCDML!

SRAxml!SOFT! FASTA!

DICOM!

MzML !SBRML!

SEDML…!

GELML!

ISA-Tab!

CML!

MITAB!

AAO!CHEBI!

OBI!

PATO! ENVO!MOD!

BTO!IDO…!

TEDDY!

PRO!XAO!

DO

VO!

Page 73: eScience-School-Oct2012-Campinas-Brazil

Which one are mature enough for

me to use or recommend?

I work on plants, are these just for

biomedical applications?

What are the criteria to evaluate

their status and value?

How can I get involved to

propose extensions or modifications?

Which tools and databases

implement which standards?

I use high throughput sequencing technologies, which one are applicable

to me?

But how much do we know about these standards

Page 74: eScience-School-Oct2012-Campinas-Brazil

§  A bewildering array of standards is available, but

•  these are hard to find, at different levels of maturity; in

some areas duplications or gaps in coverage also exist

§  Standards are just a ‘means to an end’, therefore

•  we want to make them discoverable and accessible,

maximizing their use to assist the virtuous data cycle,

from generation to standardization through publication to

subsequent sharing and reuse

But how much do we know about these standards

Page 75: eScience-School-Oct2012-Campinas-Brazil

(2007) Vol 25 No 11

obofoundry.org

Page 76: eScience-School-Oct2012-Campinas-Brazil

The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project

76

§  Compound terms should be formed out of simpler constituents:

•  Body weight weight (quality ontology, PATO) that inheres_in (relation ontology, RO) whole_organism (anatomy ontology, CARO)

•  Xylene contaminated soil soil (environmental ontology, EnvO) that

has_contaminated (relation ontology, RO) xylene (chemical ontology, ChEBI)

Towards Lego-like ontologies

Page 77: eScience-School-Oct2012-Campinas-Brazil

(2008) Vol 26 No 8

mibbi.og

Page 78: eScience-School-Oct2012-Campinas-Brazil

§ Serves researchers, biocurators, journal editors and reviewers, and funders to

§  discover checklists for a particular domain §  monitor progress of extant efforts §  facilitate collaborations

Page 79: eScience-School-Oct2012-Campinas-Brazil

Science (2009), Vol 326, 234-236

http://biosharing.org

Page 80: eScience-School-Oct2012-Campinas-Brazil

The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project

80

A catalogue to map the landscape of standards and the systems implementing them: Over 400 bio-standards (public and in curation)

Field*, Sansone* et al., Omics data sharing. Science 326, 234-36 (2009) doi:0.1126/science.1180598

Page 81: eScience-School-Oct2012-Campinas-Brazil

•  A coherent, curated and searchable catalogue of data sharing resources •  Bioscience standards and associated data-sharing policies, publications, tools and databases •  Assessment criteria for usability and popularity of standards •  Relationships among standards •  Encouragement for communication & interaction among groups •  Promoting interoperability & informed decisions about standards

Page 82: eScience-School-Oct2012-Campinas-Brazil

The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project

Smith et al, 2007

Page 83: eScience-School-Oct2012-Campinas-Brazil

The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project

Smith et al, 2007

Taylor, Field, Sansone et al, 2008

Page 84: eScience-School-Oct2012-Campinas-Brazil

The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project

84

List of databases, linked to standards a collaboration with Database Issue

Page 85: eScience-School-Oct2012-Campinas-Brazil

The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project

85

List of databases, linked to standards a collaboration with Database Issue

Page 86: eScience-School-Oct2012-Campinas-Brazil

The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project

86

List of databases, linked to standards a collaboration with Database Issue

Page 87: eScience-School-Oct2012-Campinas-Brazil

The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project

87

The relationship among popular standard formats for pathway information BioPAX and PSI-MI are designed for data exchange to and from databases and pathway and network data integration. SBML and CellML are designed to support mathematical simulations of biological systems and SBGN represents pathway diagrams.

CREDIT: Demir, et al., The BioPAX community standard for pathway data sharing, 2010.

Major challenge: define ‘relations’ among standards

Page 88: eScience-School-Oct2012-Campinas-Brazil
Page 89: eScience-School-Oct2012-Campinas-Brazil

Example of multi-assays study – how many ‘standards’ are applicable to this?

Page 90: eScience-School-Oct2012-Campinas-Brazil

Example of multi-assays study – how many ‘standards’ are applicable to this?

Page 91: eScience-School-Oct2012-Campinas-Brazil

Example of multi-assays study – how many ‘standards’ are applicable to this?

Page 92: eScience-School-Oct2012-Campinas-Brazil

Example of multi-assays study – how many ‘standards’ are applicable to this?

Page 93: eScience-School-Oct2012-Campinas-Brazil

§  A grass-root collaborative that works to facilitate collection, curation and sharing of experiments using a common, structured representation

of the experiments that •  transcends individual biological and technological domains and

•  can be ‘configured’ to implement (several of) the community

standards

An exemplar approach to the status quo

www.biosharing.org

www.isacommons.org

TOWARDS INTEROPERABLE BIOSCIENCE DATA

Sansone SA, Rocca-Serra P, Field D, Maguire E, Taylor C, Hofmann O, Fang H, Neumann S, Tong W, Amaral-Zettler L, Begley K, Booth T, Bougueleret L, Burns G, Chapman B, Clark T, Coleman LA, Copeland J, Das S, de Daruvar A, de Matos P, Dix I, Edmunds S, Evelo C, Forster M, Gaudet P, Gilbert J, Goble C, Griffin J, Jacob D, Kleinjans J, Harland L, Haug K, Hermjakob H, Sui S, Laederach A, Liang S, Marshall S, Merrill E, McGrath A, Reilly D, Roux M, Shamu C, Shang C, Steinbeck C, Trefethen A, Williams-Jones B, Wolstencroft K, Xenarios J, Hide W.

Feb 2012

www.isacommons.org

doi:10.1038/ng.1054

Page 94: eScience-School-Oct2012-Campinas-Brazil

An exemplar approach to the status quo

§  A grass-root collaborative that works to facilitate collection, curation and sharing of experiments using a common, structured representation

of the experiments that •  transcends individual biological and technological domains and

•  can be ‘configured’ to implement (several of) the community

standards

Page 95: eScience-School-Oct2012-Campinas-Brazil

The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project

user community

metadata tracking framework

Page 96: eScience-School-Oct2012-Campinas-Brazil

General-purpose, configurable format, designed to support the use of several standards checklists, terminologies and conversions to (a growing number of) other metadata formats, used by public repositories, e.g.

MAGE-Tab

SRA-xml SOFT

Pride-xml

Page 97: eScience-School-Oct2012-Campinas-Brazil

(Rocca-Serra et al, 2010)

a collaborative effort of international research/service groups: University of Oxford, EBI, Harvard School of Public Health, NERC Environmental Bioinformatics Centre, Genomic Standards Consortium, US FDA Center for Bioinformatics, Leibniz Institute of Plant Biochemistry and more….

ISA software suite: supporting standards-compliant experimental annotation and enabling curation at the community level

Page 98: eScience-School-Oct2012-Campinas-Brazil
Page 99: eScience-School-Oct2012-Campinas-Brazil

1

Create template(s) to fit the type of experiments to be described

Create templates detailing the steps to be reported for different investigations, complying to community standards, e.g. configuring the value(s) allowed for each field to be •  text (with/without regular expression testing), •  ontology terms, •  numbers etc.

Page 100: eScience-School-Oct2012-Campinas-Brazil

Describe, curate your experiment with geographically- distributed collaborators

Report and edit the description of the investigation using customized Google Spreadsheets (importing the ‘template’ created by the ISA configurator) enabled with ontology search and term-tagging features.

2a

Page 101: eScience-School-Oct2012-Campinas-Brazil

Or describe, curate your experiment using a desktop-based tool

Report and edit the description using this tool, (also customized using the templates) with a spreadsheet like look and feel, packed with functionalities such as • ontology search (access via ) • term-tagging features • import from spreadsheets etc…

2b

Page 102: eScience-School-Oct2012-Campinas-Brazil

The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project

102

empowering researchers to use standards

To mint DOIs

ISMB tag: #PP44

Page 103: eScience-School-Oct2012-Campinas-Brazil

Perform data analysis

We are building relevant ISA modules for GenomeSpace, R-based BioConductor and Galaxy tools

3

Page 104: eScience-School-Oct2012-Campinas-Brazil

Share your experiments with the world as Linked Open Data

Through conversion to RDF; work in collaboration with the W3C HCLSIG

4

Page 105: eScience-School-Oct2012-Campinas-Brazil

Share your experiments with the world as Linked Open Data

Through conversion to RDF; work in collaboration with the W3C HCLSIG

4

Tim Berners-Lee’s 5-star deployment scheme for Linked Open Data

Page 106: eScience-School-Oct2012-Campinas-Brazil

5

Submit your experiments to public repositories

Directly in ISA-Tab or reformatting using the ISAconverter

Page 107: eScience-School-Oct2012-Campinas-Brazil

6

Create your own repository

Store the investigations in the database, assign access rights and conduct maintenance tasks. Share, browse, query and view investigations, their descriptions and access associated data files.

Page 108: eScience-School-Oct2012-Campinas-Brazil

Maguire E, Rocca-Serra P, Sansone SA, Davies J and Chen M. Taxonomy-based Glyph Design -- with a Case Study on Visualizing Workflows of Biological Experiments, IEEE Transactions on Visualization and Computer Graphics, volume 18, 2012

(in press)

Page 109: eScience-School-Oct2012-Campinas-Brazil

A growing ecosystem of over 30 public and internal resources using the ISA metadata tracking framework (ISA-Tab and/or format) to facilitate standards-compliant collection, curation, management and reuse of investigations in an increasingly diverse set of life science domains, including:

•  environmental health •  environmental genomics •  metabolomics •  metagenomics •  nanotechnology •  proteomics,

•  stem cell discovery •  system biology •  transcriptomics •  toxicogenomics •  also by communities working to build

a library of cellular signatures

Page 110: eScience-School-Oct2012-Campinas-Brazil

Importance of a local community

Implementations at Harvard

Page 111: eScience-School-Oct2012-Campinas-Brazil

Importance of a local community

Implementations at Harvard

data sharing in ISA-Tab

Page 112: eScience-School-Oct2012-Campinas-Brazil

Importance of a local community

Implementations at Harvard

data sharing in ISA-Tab

Page 113: eScience-School-Oct2012-Campinas-Brazil

The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project

113

Implementation at the EBI

Page 114: eScience-School-Oct2012-Campinas-Brazil

Data papers

Page 115: eScience-School-Oct2012-Campinas-Brazil

The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project

115

Nanotechnology Informatics Working Group

Extensions of the

Page 116: eScience-School-Oct2012-Campinas-Brazil

Development timeline

Community involvement and uptake!

Core developments!

2008 2009 2010

1st ISA-Tab workshop!3rd ISA-Tab workshop!

2nd ISA-Tab workshop!

Final ISA-Tab spec! Database instance !at EBI!

ISA software v1!

2011

1st public instance: !Harvard Stem Cell !Discovery Engine!

RDF format starts!

Conversions to !Pride-XML/SRA-XML/!MAGE-Tab and more!

User workshops/visits - start!Growing number of systems starts to adopt ISA framework!

Publications!‘Omics data sharing!(Science)!

ISA-Tab and !ISA software suite!(Bioinformatics)!

Stem Cell !Discovery !Engine!(NAR)!

2007 2012

Strawman ISA-Tab spec!

Other tools implement !ISA-Tab!

Workshop reports!ISA Commons!(Nature Genetics)!

Links to analysis tools starts!

Open source code

Page 117: eScience-School-Oct2012-Campinas-Brazil

“The buzz around reproducible bioscience data:

the policies, the communities and the standards”

“The reality from the buzz:

how to deliver reproducible bioscience data”

Final remarks

Page 118: eScience-School-Oct2012-Campinas-Brazil

The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project

118

http://www.flickr.com/photos/equinoxefr/2620239993/ CC BY

Your research and all (publicly funded) research should make

make an … impact

Page 119: eScience-School-Oct2012-Campinas-Brazil

The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project

119

http://www.flickr.com/photos/webhamster/2582189977/ CC BY

…..the biggest possible impact!

Page 120: eScience-School-Oct2012-Campinas-Brazil

http://www.flickr.com/photos/andrevanbortel/3745527869/sizes/m/in/photostream/

Page 121: eScience-School-Oct2012-Campinas-Brazil

Notes in Lab Books(information for humans)

Spreadsheets and Tables( the compromise)

Facts as RDF statements(information for machines)

We must increase the level of annotation

•  Invest in curating and manage data at the source using: •  a common metadata tracking framework, such as ISA •  publicly available and community-developed terminologies •  recording sufficient contextual information of the experimental steps

§  Progressively datasets will become more comprehensible, interoperable, reproducible and (re)usable, underpinning future investigations

Page 122: eScience-School-Oct2012-Campinas-Brazil

The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project

122