being fair: fair data and model management ssbss 2017 summer school

Post on 22-Jan-2018

195 Views

Category:

Science

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Being FAIR: FAIR data and model management

Professor Carole Goble, carole.goble@manchester.ac.uk

The University of Manchester, UKThe FAIRDOM Association CoordinatorELIXIR-UK Head of NodeCo-lead ELIXIR Interoperability Platform

SSBSS 2017, July 17 2017, Cambridge, UK4th International Synthetic & Systems Biology Summer School

Data-driven and predictive biology

Data, Software, Models, SOPs….MATTER

Not a by-product.

It’s the fuel.

The assets.

modellersexperimentalists

Why Data Managementhttp://fair-dom.org

https://www.youtube.com/watch?v=N2zK3sAtr-4

https://www.youtube.com/watch?v=PWutnWBfUSw

Systems Approach: Context + more than Datamodels, data, SOPs, samples, strains, publications….multiple, interrelated assets. multiple, dispersed repositories

Multiple omics: genomics, transcriptomicsproteomics, metabolomics, fluxomics, reactomics

Images, molecular biology, reaction kinetics…SOPs, sample and strain metadata…Models: Metabolic, gene network, kinetic…Scripts and workflows

The relationships between…

Tracking: versions, provenance, parameters…

Citation and credit…

Standardsfairsharing.org

More than simple supplementary materials

16 datafiles (kinetic, flux inhibition, runout)

19 models (kinetics, validation)

13 SOPs

3 studies (model analysis, construction, validation)

24 assays/analyses (simulations, model characterisations)

Penkler, G., du Toit, F., Adams, W., Rautenbach, M., Palm, D. C., van Niekerk, D. D. and Snoep, J. L. (2015), Construction and validation of a detailed kinetic model of glycolysis in Plasmodium falciparum. FEBS J, 282: 1481–1511. doi:10.1111/febs.13237

Synthetic Approach - Automation

• Automate data management– Spreadsheets, instruments, LIMS…

– Replication, comparison

• Support automation– Tracking successful products from plasmids

– Informing robots

– Incorporate into pipelines and workflows

– Mediate through samples

– Standards

[Courtesy: Andrew Millar]

Systems Approach…Collaborationteams, disciplines, partners

What methods are been used to determine enzyme activity?

What SOP was used for this sample?

Where is the validation data for this model?

Is there any group generating kinetic data?

Is this data available?

Track versions of my model

Whats the relationship between the data and model?

Which data belong to which publications?

modellersexperimentalists

End to end ManagementProject Boot up, Run and Washup

• Capture

• Track• Organise & Link• Curate

• Report• Exchange • Retain

• Integrate• Reuse other systems• Support data-driven processes

CREATING DATA

PROCESSING DATA

ANALYSING DATA

PRESERVING DATA

ACCESSTO DATA

RE-USING DATA

The FAIR Guiding Principles for scientific data management and stewardshiphttps://www.nature.com/articles/sdata201618 (2016)

The greater good….Access to public funded research, Reproducible resultsValue and cite all research outputs

https://www.nature.com/articles/sdata201618 (2016)

UK Funder Data Policies http://www.dcc.ac.uk/resources/policy-and-legal/overview-funders-data-policies

Compliance and PolicyData Management Plans

https://wellcomeopenresearch.org/ Nature Scientific Data

Data (and software) as a first class citizenData (and software) Citation

Scholarly Communications Providers

The Personal good….

• reviewers want additional work• statistician wants more runs• analysis needs to be repeated

• post-doc leaves, • student arrives

• new/revised datasets• updated/new versions of

algorithms/codes

• sample was contaminated• better kit - longer simulations

• new partners, new projects

Personal & Lab Productivity

SharingReproducibility

Catalogues

Standards: identifiers, metadata

Stores

Policy, Identifiers, Authorised Access & Licensing

Standards are not always used....

Formats MetadataMetadata reporting guidelines

Ontologies

*top three most popular

The evolution of standards and data management practices in systems biology (2015). Stanford et al, Molecular Systems Biology, 11(12):851

… model reuse and reproducibility tricky…

Stanford et al The evolution of standards and data management practices in systems biology, Molecular Systems Biology (2015) 11: 851 DOI 10.15252/msb.20156053

Catalogues, Storage and PublishingActive Data, Published Data

local/project LIMS, data management,analytics. active data.

global, public, central subject-specific databasespublished data.

ACT LOCAL THINK GLOBAL

Cloud services

figshare

zenodo

Amazon Web ServicesGoogle CloudAzure, EBI Embassy CloudOwn cloud

FAIRDOMHub

mendeley data

Cloud Data Services

Cloud hosting services

OpenAIRE

Catalogues, Storage and PublishingActive Data, Published Data

Stanford et al The evolution of standards and data management practices in systems biology, Molecular Systems Biology (2015) 11: 851 DOI 10.15252/msb.20156053

Catalogues, Storage and PublishingActive Data, Published Data

Stanford et al The evolution of standards and data management practices in systems biology, Molecular Systems Biology (2015) 11: 851 DOI 10.15252/msb.20156053

Type specific archivesFragmented silos

Catalogues, Storage and PublishingActive Data, Published Data

Stanford et al The evolution of standards and data management practices in systems biology, Molecular Systems Biology (2015) 11: 851 DOI 10.15252/msb.20156053

Type specific archivesFragmented silos

Experimental context

All together

Catalogues, Storage and PublishingActive Data, Published Data

Central Repository CentricResearch Infrastructure for FAIR Data for Life Sciences in Europe

Top down, 21 National Nodes + EMBL-EBI

Project CentricFAIR Research Data Management for Data, SOPs, Models for Systems and Synthetic Biology ProjectsGrass roots Association of Institutions and members funded by 4 EU countries.

http://www.fair-dom.org http://www.elixir-europe.org

modellersexperimentalists

FAIRDOM ConsortiumSince 2008….

ERANets and ERACoFunds

National Programmes

National Centres

EU Research Infrastructures Sponsors:

Built by Project PALspost docs, postgrads and techs

FAIRDOM

FAIRDOM Software Platform+ Tools

A Central Public Hub for Projects

Customised ProjectInstallations

Project Stewardship Consultancy Services

Community Activities

80 Projects 30+ Installations

http://fair-dom.org/knowledgehub/data-management-checklist/

https://dmponline.dcc.ac.uk/

http://dmp.fairdata.solutions/ (very early alpha)

FAIR Checklists

Making Data Findable (documentation and metadata management)

• What documentation and metadata will accompany the data (assist its discoverability)? (Details on methodology, definitions, procedures, SOPs, vocabularies, units, dependencies, etc)

• What information is needed for the data to be read and interpreted in the future?

• What naming conventions will be used?

• How will you approach versioning your data?

• How will you capture / create this documentation and metadata?

• How do you ensure the completeness of the captured data?

Making Data Accessible

Specify which data will be made openly available taking into consideration

• What ethics and legal compliance issues do you have if any? Do you need consent for data preservation and sharing? Do you have to protect certain data? Is any data sensitive?

• Do you think you might have Intellectual Property Rights issues? Have you considered ownership of the data, licensing, restrictions on use?

• Do you think you will need to embargo any data?

• How will you make the data available? (consider the platforms you will use: databases, repositories, etc)

• What methods or software tools are needed to access the data? shoudl you include documentation detailing how to access use/access the software that is needed for accessing the data? Is it possible to include this software with the data (e.g. source code, docker etc)

• If there are any restrictions on accessibility, how will you provide access?

Making Data Interoperable• What standards (metadata vocabularies, formats,

checklists) or methodologies will you use?• How do you address data and model quality? What

validation steps do you foresee?• Will you use standardised vocabulary for all data types

to allow inter-disciplinary interoperability?• Where you can not used standardised vocabulary for all

types of data, can you map to more commonly used ontologies?

Making Data Re-usable• How will you licence your data to permit the widest re-

use possible?• When will the data be made available for re-use? Does

this include an embargo period? (if so, why?)• Which data will be available for re-use during/after the

project? If not, why?• What are your data quality assurance processes?• How long do you expect your data to remain re-usable?

Community Actionshttp://www.fair-dom.org

Samples Club Developers Club

Stewardship Support500K needed*, a new career needing a career path

*European Open Science Cloud Report

FAIRDOM PlatformFree and Open Source

Front end

Project(s) Hub

Back end

Onsite storage & analytics

On siteTracking, data analytic pipelines, Extract, Transform and Load direct from the instruments, large data managementLIMS, auto-archiving

Web-based portalProject controlled spacesMetadata catalogue & Yellow pagesResults repository, dissemination and collaboration Tool gateway

Built using Built using

Back end Instrument Data Management, LIMS, ELN

Samples

Protocols

Experiment Description

Raw Data

Analysis Scripts

Results

Laboratory Notebook &Inventory Manager

ELNLIMS-likelinking data to biological materials• samples+protocols management• data management• experimental descriptionBig Data analytics on distributed compute resources

• Project controlled protected spaces – Working space, show space for results

– Supp. materials space for publications

– Yellow pages and collaboration

– Upload or link to data

• One place catalogue– Regardless of physical store

– Organised is ISA with shared metadata

– Standards-compliant

• Linked with other systems– Project on-site (secure) repositories

– Public deposition archives

– Integrated with JWSOnline modelling tools

Front End Hub: A Commons one place to Find, Access and organise assets

“Using FAIRDOMHub my own lab colleagues saw what I was doing and called to collaborate!”

859 people80 projects 198 institutions

FAIRDOMHub.org Public Commonsself managed workspaces, controlled sharing, shared metadatayellow pages

More than simple supplementary materials

16 datafiles (kinetic, flux inhibition, runout)

19 models (kinetics, validation)

13 SOPs

3 studies (model analysis, construction, validation)

24 assays/analyses (simulations, model characterisations)

Penkler, G., du Toit, F., Adams, W., Rautenbach, M., Palm, D. C., van Niekerk, D. D. and Snoep, J. L. (2015), Construction and validation of a detailed kinetic model of glycolysis in Plasmodium falciparum. FEBS J, 282: 1481–1511. doi:10.1111/febs.13237

Investigation

Study Analysis

Data

Model

SOP(Assay)

https://fairdomhub.org/investigations/56

Catalogue across repositories regardless of locationfederated stores retaining context to support decision making and reuse

bridging local and global

In House Stores

External Databases

Publishing services

Secure Stores

Model Resources

Upload or Reference

Protected spaces, sharing sensitivitiesOpen science applies to you but not me…not available, not citable.

LicensesNegotiated accessEmbargosPermission controlsStaged sharing

Act Local Think Global Cloud Service

.org

Local retentionIn flight management, Private sharingCustomisationCentres, large projectsNational projectsLocal skills for admin support

Post-project retentionOne stop showcaseSelf-managed sharingSupplementary materialsOff-the-shelf featuresHosted on behalf of usersDelegated admin supportLong term repository

• Trusted repository

• Guaranteed until 2029

• Long term maintenance

• Sustainability• 1 TB per

project stored centrally.

• Much more catalogued.

Hub common space, one placeto organise and report your assets

.org

Nucl. Acids Res. (2016) doi: 10.1093/nar/gkw1032

70+ Projects

30+ Installations

Public & cloud Subject and Datatype archives

Typical Data Flows

HTP dataprocessingmanagementexchange

depositionpublishingreporting

ORGANISATIONCOMMUNICATION

samplesanalytics

models, SOPsprocessed data

DISSEMINATION

Less data, more metadata, potentially wider access

processeddata

Publishing…snapshot and assign DOIsCredits and Citations

G. Penkler, F. Du Toit, W. Adams, M. Rautenbach, D. C. Palm, D. D. Van Niekerk, & J. L. Snoep. (2014). Glucose metabolism in Plasmodium falciparum trophozoites. FAIRDOMHub.

http://doi.org/10.15490/seek.1.investigation.56

Snapshot to fix state with particular versionsAssign a DOI

Entry has citation metadata

Use in journals and in metrics systems

Active entry continues to evolve

Fenner et al, A Data Citation Roadmap for Scholarly Data Repositoriesdoi: https://doi.org/10.1101/097196

18/07/2017 44

An “evolving manuscript” would begin with a pre-publication, pre-peer review “beta 0.9” version of an article, followed by the approved published article itself, [ … ] “version 1.0”.

Subsequently, scientists would update this paper with details of further work as the area of research develops. Versions 2.0 and 3.0 might allow for the “accretion of confirmation [and] reputation”.

Ottoline Leyser […] assessment criteria in science revolve around the individual. “People have stopped thinking about the scientific enterprise”.

http://www.timeshighereducation.co.uk/news/evolving-manuscripts-the-future-of-scientific-communication/2020200.article

Retention: Mosesfrom the ERANET SysMO Programme

Project ended in 2010Publication in 2014/2015Using data from 2012

[Maxim Zakhartsev]

[Adapted from Ursula Klingmüller, Martin Böhm]

Excemplify

Antibody Database

FAIR collaboration from the ERANet ERASysAPP

47

Programme

Overarching research theme (The Digital Salmon)

Project

Research grant (DigiSal, GenoSysFat)

Investigation

A particular biological process, phenomenon or thing

(typically corresponds to [plans for] one or more closely related

papers)

Study

Experiment whose design reflects a specific biological research

question

Assay

Standardized measurement or diagnostic experiment using a

specific protocol

(applied to material from a study)

Jon Olav Vik, Norwegian University of Life Science

Integration with Norway’s national einfrastructure for Life Science (NeLS)

Specialist databases

LocalBiochem4jICE

GlobalBrenda, wikipathways,BiomodelsICE

PublicDeposition Databases

Public Catalogues

Tracking inSpecialist Systems

Institutional Catalogue & Repository

Specialist databases

LocalBiochem4jICE

GlobalBrenda, wikipathways,BiomodelsICE

PublicDeposition Databases

Public Catalogues

Institutional Catalogue & Repository

Tracking inSpecialist Systems

Ubiquitous Spreadsheet• Unifying processes

• Common spreadsheet models

– Consistency and quality of collaboration

– Common identifier meanings

– Metadata collection

Tracking inSpecialist Systems

http://www.fairdomhub.org

https://sandbox1.fairdomhub.org• empty box for safe playing• copy the investigation that is there• add your name to the guest list so we don’t double

up - http://tinyurl.com/sandboxlist

Try out for yourself…

The first steps?

• Metadata design

• Samples

– The link between everything

• The ubiquitous spreadsheet– Templates and exchange…

– Unifying processes

– Carrying best practice

Image from FAIRSharing.org

Use and reuse standard identifiers

General standards

Site specific

Community standards

e.g. SynBioChem ICE Strain conventionA URL preferably to identifiers.org that resolves to the description of the host strain in NCBI taxonomye.g. e-Coli DH5α http://identifiers.org/taxonomy/668369

location independent resolvable identifiers (URIs) decoupling the identification of records from their physical locations

Investigation:

Glucose metabolism in P.

falciparum trophozoites

Study:

Model construction

Study:

Model validation

Assay: LDH

Assay: PK

Assay: ENO

Assay: PGM

Assay: PGK

Assay: GAPDH

Assay: TPI

Assay: ALD

Assay: PFK

Assay: PGI

Assay: HK

Assay: GLCtr

Assay: PYRtr

Assay: LACtr

Assay: G3PDH

Assay: GLYtr

Assay: ATPase

Data: GLCtr

Model: GLCtr

Data: HK

Model: HK

Steady state

Incubation

penkler1

Validation data

penkler2

Validation data

...

...

SOP: GLCtr

SOP: HK

...

SOP: Validation

Assay: Culturing

Assay: Lysate prep.

SOP: Culturing

SOP: Lysate prep.

Design an ISA (Investigation, Study, Assay/Analysis) structure.

Devising this makes you think…..

Use FAIR Data and Metadata Standards

help to improve understanding and exchange….

Credit: Nicolas Le Novère, Babraham Institute, UK, adapted.

represents genetic designs- standardized vocabulary of schematic glyphs - standardized digital format.

ICE, SBOLStack, iGEM

CIMR Core Information for Metabolomics ReportingMIABE Minimal Information About a Bioactive Entity MIACA Minimal Information About a Cellular Assay MIAME Minimum Information About a Microarray Experiment MIAME/Nutr MIAME / Nutrigenomics MIAME/Plant MIAME / Plant transcriptomics

MIAME/Tox MIAME / Toxicogenomics MIAPA Minimum Information About a Phylogenetic Analysis MIAPAR Minimum Information About a Protein Affinity Reagent MIAPE Minimum Information About a Proteomics Experiment MIARE Minimum Information About a RNAi Experiment MIASE Minimum Information About a Simulation Experiment

Where do I go for standards information?

Linking models….• connecting (experimental/simulation) data to models• connecting the single standards?• interfacing between the different scales?

https://fairsharing.org/collection/FAIRDOM

How do I design the metadata?Metadata ramps

Metadata Registration and Use

Metadata ramps: spreadsheet templatesTooling for annotations and checklist templates for different types of assay data.

Embed ontologies into Excel templates

Excel spreadsheets enriched with ontology annotations

Upload, extract metadata and register

http://www.rightfield.org.uk

Ramping up SamplesSpreadsheets! A new framework for Syn and Sys BioSamples are Inputs and Outputs….

compliant

Sounds hard….what can I do?

12 steps to being FAIRplan to be born FAIR

1. plan data management lifecycle: plan, cost and implement pathways and storage including what you will archive, what you will throw away, how you will collect metadata and how you will curate throughout

2. use standard identifiers and identifier standards

3. use metadata standards with data provenance

4. catalogue / register data with metadata

5. have access and sharing policies with licenses

6. use data (assets) management platforms and tools that work together

7. deposit into public archives

8. have a sustainability / end project plan

9. resource and support, and that also means people too

10. embed data management into work practices and do some training

11. give credit

12. check if you have sensitive data issues

What can you do?

• Make a Data Management Plan (check the checklist).• Get an account on the FAIRDOMHub or install your own.• Define and share your SOPs.

• Who is your group’s data steward? • How are they getting credit?• Know your local data management policies and resources.

• Get some training.• Educate your supervisors, institutions and peers.

• Build some metadata ramps

The Data Stewardfunction, profession, cultural shift

• 500,000 needed in Europe*

• Specialist skills

• Career pathways

• Recognition

Curation and management• Supported, Resourced

• Recognised, Rewarded

Sharing policy and practice embedded

* Realising the Open European Science Cloud (2016)

Jon Olav Vik, Norwegian University of Life Science

Maksim ZakhartsevUniversity Hohenheim, Stuttgart, Germany

Alexey KolodkinSiberian BranchRussian Academy of Sciences

Tomasz Zieliński,SynthSys CentreUniversity Edinburgh, UK

Martin Peters, Martin Scharm Systems Biology BioinformaticsUniversity of Rostock, Germany

Reading List

• Wolstencroft et al (2016). “FAIRDOMHub: a repository and collaboration environment for sharing systems biology research”. Nucleic Acids Research, 45(D1): D404-D407. DOI: 10.1093/nar/gkw1032

• Rice and Southal, The Data Librarian's Handbook, Wiley Publishing, 2016

• Stanford et al The evolution of standards and data management practices in systems biology, Molecular Systems Biology (2015) 11: 851 DOI 10.15252/msb.20156053

• Wilkinson et al The FAIR Guiding Principles for scientific data management and stewardship, https://www.nature.com/articles/sdata201618 (2016)

• McMurry, Juty, et al. (2017) Identifiers for the 21st century: How to design, provision, and reuse persistent identifiers to maximize utility and impact of life science data. PLoS Biol 15(6): e2001414. https://doi.org/10.1371/journal.pbio.2001414

• Fenner et al, A Data Citation Roadmap for Scholarly Data Repositories doi: https://doi.org/10.1101/097196

• Realising the Open European Science Cloud https://ec.europa.eu/research/openscience/pdf/realising_the_european_open_science_cloud_2016.pdf

Website list

• FAIRDOM http://www.fair-dom.org• FAIRDOMHub http://www.fairdomhub.org• Rightfield http://www.rightfield.org.uk• FAIRSharing http://www.fairsharing.org• ELIXIR http://www.elixir-europe.org• Software Carpentry https://software-carpentry.org/• Data Carpentry http://www.datacarpentry.org/

• Sandbox https://sandbox1.fairdomhub.org• empty box for safe playing• copy the investigation that is there• add your name to the guest list so we don’t double up -

http://tinyurl.com/sandboxlist

top related