being fair: fair data and model management ssbss 2017 summer school
TRANSCRIPT
Being FAIR: FAIR data and model management
Professor Carole Goble, [email protected]
The University of Manchester, UKThe FAIRDOM Association CoordinatorELIXIR-UK Head of NodeCo-lead ELIXIR Interoperability Platform
SSBSS 2017, July 17 2017, Cambridge, UK4th International Synthetic & Systems Biology Summer School
Data-driven and predictive biology
Data, Software, Models, SOPs….MATTER
Not a by-product.
It’s the fuel.
The assets.
modellersexperimentalists
Why Data Managementhttp://fair-dom.org
https://www.youtube.com/watch?v=N2zK3sAtr-4
https://www.youtube.com/watch?v=PWutnWBfUSw
Systems Approach: Context + more than Datamodels, data, SOPs, samples, strains, publications….multiple, interrelated assets. multiple, dispersed repositories
Multiple omics: genomics, transcriptomicsproteomics, metabolomics, fluxomics, reactomics
Images, molecular biology, reaction kinetics…SOPs, sample and strain metadata…Models: Metabolic, gene network, kinetic…Scripts and workflows
The relationships between…
Tracking: versions, provenance, parameters…
Citation and credit…
Standardsfairsharing.org
More than simple supplementary materials
16 datafiles (kinetic, flux inhibition, runout)
19 models (kinetics, validation)
13 SOPs
3 studies (model analysis, construction, validation)
24 assays/analyses (simulations, model characterisations)
Penkler, G., du Toit, F., Adams, W., Rautenbach, M., Palm, D. C., van Niekerk, D. D. and Snoep, J. L. (2015), Construction and validation of a detailed kinetic model of glycolysis in Plasmodium falciparum. FEBS J, 282: 1481–1511. doi:10.1111/febs.13237
Synthetic Approach - Automation
• Automate data management– Spreadsheets, instruments, LIMS…
– Replication, comparison
• Support automation– Tracking successful products from plasmids
– Informing robots
– Incorporate into pipelines and workflows
– Mediate through samples
– Standards
[Courtesy: Andrew Millar]
Systems Approach…Collaborationteams, disciplines, partners
What methods are been used to determine enzyme activity?
What SOP was used for this sample?
Where is the validation data for this model?
Is there any group generating kinetic data?
Is this data available?
Track versions of my model
Whats the relationship between the data and model?
Which data belong to which publications?
modellersexperimentalists
End to end ManagementProject Boot up, Run and Washup
• Capture
• Track• Organise & Link• Curate
• Report• Exchange • Retain
• Integrate• Reuse other systems• Support data-driven processes
CREATING DATA
PROCESSING DATA
ANALYSING DATA
PRESERVING DATA
ACCESSTO DATA
RE-USING DATA
The FAIR Guiding Principles for scientific data management and stewardshiphttps://www.nature.com/articles/sdata201618 (2016)
The greater good….Access to public funded research, Reproducible resultsValue and cite all research outputs
https://www.nature.com/articles/sdata201618 (2016)
UK Funder Data Policies http://www.dcc.ac.uk/resources/policy-and-legal/overview-funders-data-policies
Compliance and PolicyData Management Plans
https://wellcomeopenresearch.org/ Nature Scientific Data
Data (and software) as a first class citizenData (and software) Citation
Scholarly Communications Providers
The Personal good….
• reviewers want additional work• statistician wants more runs• analysis needs to be repeated
• post-doc leaves, • student arrives
• new/revised datasets• updated/new versions of
algorithms/codes
• sample was contaminated• better kit - longer simulations
• new partners, new projects
Personal & Lab Productivity
SharingReproducibility
Catalogues
Standards: identifiers, metadata
Stores
Policy, Identifiers, Authorised Access & Licensing
Standards are not always used....
Formats MetadataMetadata reporting guidelines
Ontologies
*top three most popular
The evolution of standards and data management practices in systems biology (2015). Stanford et al, Molecular Systems Biology, 11(12):851
… model reuse and reproducibility tricky…
Stanford et al The evolution of standards and data management practices in systems biology, Molecular Systems Biology (2015) 11: 851 DOI 10.15252/msb.20156053
Catalogues, Storage and PublishingActive Data, Published Data
local/project LIMS, data management,analytics. active data.
global, public, central subject-specific databasespublished data.
ACT LOCAL THINK GLOBAL
Cloud services
figshare
zenodo
Amazon Web ServicesGoogle CloudAzure, EBI Embassy CloudOwn cloud
FAIRDOMHub
mendeley data
Cloud Data Services
Cloud hosting services
OpenAIRE
Catalogues, Storage and PublishingActive Data, Published Data
Stanford et al The evolution of standards and data management practices in systems biology, Molecular Systems Biology (2015) 11: 851 DOI 10.15252/msb.20156053
Catalogues, Storage and PublishingActive Data, Published Data
Stanford et al The evolution of standards and data management practices in systems biology, Molecular Systems Biology (2015) 11: 851 DOI 10.15252/msb.20156053
Type specific archivesFragmented silos
Catalogues, Storage and PublishingActive Data, Published Data
Stanford et al The evolution of standards and data management practices in systems biology, Molecular Systems Biology (2015) 11: 851 DOI 10.15252/msb.20156053
Type specific archivesFragmented silos
Experimental context
All together
Catalogues, Storage and PublishingActive Data, Published Data
Central Repository CentricResearch Infrastructure for FAIR Data for Life Sciences in Europe
Top down, 21 National Nodes + EMBL-EBI
Project CentricFAIR Research Data Management for Data, SOPs, Models for Systems and Synthetic Biology ProjectsGrass roots Association of Institutions and members funded by 4 EU countries.
http://www.fair-dom.org http://www.elixir-europe.org
modellersexperimentalists
FAIRDOM ConsortiumSince 2008….
ERANets and ERACoFunds
National Programmes
National Centres
EU Research Infrastructures Sponsors:
Built by Project PALspost docs, postgrads and techs
FAIRDOM
FAIRDOM Software Platform+ Tools
A Central Public Hub for Projects
Customised ProjectInstallations
Project Stewardship Consultancy Services
Community Activities
80 Projects 30+ Installations
http://fair-dom.org/knowledgehub/data-management-checklist/
https://dmponline.dcc.ac.uk/
http://dmp.fairdata.solutions/ (very early alpha)
FAIR Checklists
Making Data Findable (documentation and metadata management)
• What documentation and metadata will accompany the data (assist its discoverability)? (Details on methodology, definitions, procedures, SOPs, vocabularies, units, dependencies, etc)
• What information is needed for the data to be read and interpreted in the future?
• What naming conventions will be used?
• How will you approach versioning your data?
• How will you capture / create this documentation and metadata?
• How do you ensure the completeness of the captured data?
Making Data Accessible
Specify which data will be made openly available taking into consideration
• What ethics and legal compliance issues do you have if any? Do you need consent for data preservation and sharing? Do you have to protect certain data? Is any data sensitive?
• Do you think you might have Intellectual Property Rights issues? Have you considered ownership of the data, licensing, restrictions on use?
• Do you think you will need to embargo any data?
• How will you make the data available? (consider the platforms you will use: databases, repositories, etc)
• What methods or software tools are needed to access the data? shoudl you include documentation detailing how to access use/access the software that is needed for accessing the data? Is it possible to include this software with the data (e.g. source code, docker etc)
• If there are any restrictions on accessibility, how will you provide access?
Making Data Interoperable• What standards (metadata vocabularies, formats,
checklists) or methodologies will you use?• How do you address data and model quality? What
validation steps do you foresee?• Will you use standardised vocabulary for all data types
to allow inter-disciplinary interoperability?• Where you can not used standardised vocabulary for all
types of data, can you map to more commonly used ontologies?
Making Data Re-usable• How will you licence your data to permit the widest re-
use possible?• When will the data be made available for re-use? Does
this include an embargo period? (if so, why?)• Which data will be available for re-use during/after the
project? If not, why?• What are your data quality assurance processes?• How long do you expect your data to remain re-usable?
Community Actionshttp://www.fair-dom.org
Samples Club Developers Club
Stewardship Support500K needed*, a new career needing a career path
*European Open Science Cloud Report
FAIRDOM PlatformFree and Open Source
Front end
Project(s) Hub
Back end
Onsite storage & analytics
On siteTracking, data analytic pipelines, Extract, Transform and Load direct from the instruments, large data managementLIMS, auto-archiving
Web-based portalProject controlled spacesMetadata catalogue & Yellow pagesResults repository, dissemination and collaboration Tool gateway
Built using Built using
Back end Instrument Data Management, LIMS, ELN
Samples
Protocols
Experiment Description
Raw Data
Analysis Scripts
Results
Laboratory Notebook &Inventory Manager
ELNLIMS-likelinking data to biological materials• samples+protocols management• data management• experimental descriptionBig Data analytics on distributed compute resources
• Project controlled protected spaces – Working space, show space for results
– Supp. materials space for publications
– Yellow pages and collaboration
– Upload or link to data
• One place catalogue– Regardless of physical store
– Organised is ISA with shared metadata
– Standards-compliant
• Linked with other systems– Project on-site (secure) repositories
– Public deposition archives
– Integrated with JWSOnline modelling tools
Front End Hub: A Commons one place to Find, Access and organise assets
“Using FAIRDOMHub my own lab colleagues saw what I was doing and called to collaborate!”
859 people80 projects 198 institutions
FAIRDOMHub.org Public Commonsself managed workspaces, controlled sharing, shared metadatayellow pages
More than simple supplementary materials
16 datafiles (kinetic, flux inhibition, runout)
19 models (kinetics, validation)
13 SOPs
3 studies (model analysis, construction, validation)
24 assays/analyses (simulations, model characterisations)
Penkler, G., du Toit, F., Adams, W., Rautenbach, M., Palm, D. C., van Niekerk, D. D. and Snoep, J. L. (2015), Construction and validation of a detailed kinetic model of glycolysis in Plasmodium falciparum. FEBS J, 282: 1481–1511. doi:10.1111/febs.13237
Investigation
Study Analysis
Data
Model
SOP(Assay)
https://fairdomhub.org/investigations/56
Catalogue across repositories regardless of locationfederated stores retaining context to support decision making and reuse
bridging local and global
In House Stores
External Databases
Publishing services
Secure Stores
Model Resources
Upload or Reference
Protected spaces, sharing sensitivitiesOpen science applies to you but not me…not available, not citable.
LicensesNegotiated accessEmbargosPermission controlsStaged sharing
Act Local Think Global Cloud Service
.org
Local retentionIn flight management, Private sharingCustomisationCentres, large projectsNational projectsLocal skills for admin support
Post-project retentionOne stop showcaseSelf-managed sharingSupplementary materialsOff-the-shelf featuresHosted on behalf of usersDelegated admin supportLong term repository
• Trusted repository
• Guaranteed until 2029
• Long term maintenance
• Sustainability• 1 TB per
project stored centrally.
• Much more catalogued.
Hub common space, one placeto organise and report your assets
.org
Nucl. Acids Res. (2016) doi: 10.1093/nar/gkw1032
70+ Projects
30+ Installations
Public & cloud Subject and Datatype archives
Typical Data Flows
HTP dataprocessingmanagementexchange
depositionpublishingreporting
ORGANISATIONCOMMUNICATION
samplesanalytics
models, SOPsprocessed data
DISSEMINATION
Less data, more metadata, potentially wider access
processeddata
Publishing…snapshot and assign DOIsCredits and Citations
G. Penkler, F. Du Toit, W. Adams, M. Rautenbach, D. C. Palm, D. D. Van Niekerk, & J. L. Snoep. (2014). Glucose metabolism in Plasmodium falciparum trophozoites. FAIRDOMHub.
http://doi.org/10.15490/seek.1.investigation.56
Snapshot to fix state with particular versionsAssign a DOI
Entry has citation metadata
Use in journals and in metrics systems
Active entry continues to evolve
Fenner et al, A Data Citation Roadmap for Scholarly Data Repositoriesdoi: https://doi.org/10.1101/097196
18/07/2017 44
An “evolving manuscript” would begin with a pre-publication, pre-peer review “beta 0.9” version of an article, followed by the approved published article itself, [ … ] “version 1.0”.
Subsequently, scientists would update this paper with details of further work as the area of research develops. Versions 2.0 and 3.0 might allow for the “accretion of confirmation [and] reputation”.
Ottoline Leyser […] assessment criteria in science revolve around the individual. “People have stopped thinking about the scientific enterprise”.
http://www.timeshighereducation.co.uk/news/evolving-manuscripts-the-future-of-scientific-communication/2020200.article
Retention: Mosesfrom the ERANET SysMO Programme
Project ended in 2010Publication in 2014/2015Using data from 2012
[Maxim Zakhartsev]
[Adapted from Ursula Klingmüller, Martin Böhm]
Excemplify
Antibody Database
FAIR collaboration from the ERANet ERASysAPP
47
Programme
Overarching research theme (The Digital Salmon)
Project
Research grant (DigiSal, GenoSysFat)
Investigation
A particular biological process, phenomenon or thing
(typically corresponds to [plans for] one or more closely related
papers)
Study
Experiment whose design reflects a specific biological research
question
Assay
Standardized measurement or diagnostic experiment using a
specific protocol
(applied to material from a study)
Jon Olav Vik, Norwegian University of Life Science
Integration with Norway’s national einfrastructure for Life Science (NeLS)
Specialist databases
LocalBiochem4jICE
GlobalBrenda, wikipathways,BiomodelsICE
PublicDeposition Databases
Public Catalogues
Tracking inSpecialist Systems
Institutional Catalogue & Repository
Specialist databases
LocalBiochem4jICE
GlobalBrenda, wikipathways,BiomodelsICE
PublicDeposition Databases
Public Catalogues
Institutional Catalogue & Repository
Tracking inSpecialist Systems
Ubiquitous Spreadsheet• Unifying processes
• Common spreadsheet models
– Consistency and quality of collaboration
– Common identifier meanings
– Metadata collection
Tracking inSpecialist Systems
http://www.fairdomhub.org
https://sandbox1.fairdomhub.org• empty box for safe playing• copy the investigation that is there• add your name to the guest list so we don’t double
up - http://tinyurl.com/sandboxlist
Try out for yourself…
The first steps?
• Metadata design
• Samples
– The link between everything
• The ubiquitous spreadsheet– Templates and exchange…
– Unifying processes
– Carrying best practice
Image from FAIRSharing.org
Use and reuse standard identifiers
General standards
Site specific
Community standards
e.g. SynBioChem ICE Strain conventionA URL preferably to identifiers.org that resolves to the description of the host strain in NCBI taxonomye.g. e-Coli DH5α http://identifiers.org/taxonomy/668369
location independent resolvable identifiers (URIs) decoupling the identification of records from their physical locations
Investigation:
Glucose metabolism in P.
falciparum trophozoites
Study:
Model construction
Study:
Model validation
Assay: LDH
Assay: PK
Assay: ENO
Assay: PGM
Assay: PGK
Assay: GAPDH
Assay: TPI
Assay: ALD
Assay: PFK
Assay: PGI
Assay: HK
Assay: GLCtr
Assay: PYRtr
Assay: LACtr
Assay: G3PDH
Assay: GLYtr
Assay: ATPase
Data: GLCtr
Model: GLCtr
Data: HK
Model: HK
Steady state
Incubation
penkler1
Validation data
penkler2
Validation data
...
...
SOP: GLCtr
SOP: HK
...
SOP: Validation
Assay: Culturing
Assay: Lysate prep.
SOP: Culturing
SOP: Lysate prep.
Design an ISA (Investigation, Study, Assay/Analysis) structure.
Devising this makes you think…..
Use FAIR Data and Metadata Standards
help to improve understanding and exchange….
Credit: Nicolas Le Novère, Babraham Institute, UK, adapted.
represents genetic designs- standardized vocabulary of schematic glyphs - standardized digital format.
ICE, SBOLStack, iGEM
CIMR Core Information for Metabolomics ReportingMIABE Minimal Information About a Bioactive Entity MIACA Minimal Information About a Cellular Assay MIAME Minimum Information About a Microarray Experiment MIAME/Nutr MIAME / Nutrigenomics MIAME/Plant MIAME / Plant transcriptomics
MIAME/Tox MIAME / Toxicogenomics MIAPA Minimum Information About a Phylogenetic Analysis MIAPAR Minimum Information About a Protein Affinity Reagent MIAPE Minimum Information About a Proteomics Experiment MIARE Minimum Information About a RNAi Experiment MIASE Minimum Information About a Simulation Experiment
Where do I go for standards information?
Linking models….• connecting (experimental/simulation) data to models• connecting the single standards?• interfacing between the different scales?
https://fairsharing.org/collection/FAIRDOM
How do I design the metadata?Metadata ramps
Metadata Registration and Use
Metadata ramps: spreadsheet templatesTooling for annotations and checklist templates for different types of assay data.
Embed ontologies into Excel templates
Excel spreadsheets enriched with ontology annotations
Upload, extract metadata and register
http://www.rightfield.org.uk
Ramping up SamplesSpreadsheets! A new framework for Syn and Sys BioSamples are Inputs and Outputs….
compliant
Sounds hard….what can I do?
12 steps to being FAIRplan to be born FAIR
1. plan data management lifecycle: plan, cost and implement pathways and storage including what you will archive, what you will throw away, how you will collect metadata and how you will curate throughout
2. use standard identifiers and identifier standards
3. use metadata standards with data provenance
4. catalogue / register data with metadata
5. have access and sharing policies with licenses
6. use data (assets) management platforms and tools that work together
7. deposit into public archives
8. have a sustainability / end project plan
9. resource and support, and that also means people too
10. embed data management into work practices and do some training
11. give credit
12. check if you have sensitive data issues
What can you do?
• Make a Data Management Plan (check the checklist).• Get an account on the FAIRDOMHub or install your own.• Define and share your SOPs.
• Who is your group’s data steward? • How are they getting credit?• Know your local data management policies and resources.
• Get some training.• Educate your supervisors, institutions and peers.
• Build some metadata ramps
The Data Stewardfunction, profession, cultural shift
• 500,000 needed in Europe*
• Specialist skills
• Career pathways
• Recognition
Curation and management• Supported, Resourced
• Recognised, Rewarded
Sharing policy and practice embedded
* Realising the Open European Science Cloud (2016)
Jon Olav Vik, Norwegian University of Life Science
Maksim ZakhartsevUniversity Hohenheim, Stuttgart, Germany
Alexey KolodkinSiberian BranchRussian Academy of Sciences
Tomasz Zieliński,SynthSys CentreUniversity Edinburgh, UK
Martin Peters, Martin Scharm Systems Biology BioinformaticsUniversity of Rostock, Germany
Reading List
• Wolstencroft et al (2016). “FAIRDOMHub: a repository and collaboration environment for sharing systems biology research”. Nucleic Acids Research, 45(D1): D404-D407. DOI: 10.1093/nar/gkw1032
• Rice and Southal, The Data Librarian's Handbook, Wiley Publishing, 2016
• Stanford et al The evolution of standards and data management practices in systems biology, Molecular Systems Biology (2015) 11: 851 DOI 10.15252/msb.20156053
• Wilkinson et al The FAIR Guiding Principles for scientific data management and stewardship, https://www.nature.com/articles/sdata201618 (2016)
• McMurry, Juty, et al. (2017) Identifiers for the 21st century: How to design, provision, and reuse persistent identifiers to maximize utility and impact of life science data. PLoS Biol 15(6): e2001414. https://doi.org/10.1371/journal.pbio.2001414
• Fenner et al, A Data Citation Roadmap for Scholarly Data Repositories doi: https://doi.org/10.1101/097196
• Realising the Open European Science Cloud https://ec.europa.eu/research/openscience/pdf/realising_the_european_open_science_cloud_2016.pdf
Website list
• FAIRDOM http://www.fair-dom.org• FAIRDOMHub http://www.fairdomhub.org• Rightfield http://www.rightfield.org.uk• FAIRSharing http://www.fairsharing.org• ELIXIR http://www.elixir-europe.org• Software Carpentry https://software-carpentry.org/• Data Carpentry http://www.datacarpentry.org/
• Sandbox https://sandbox1.fairdomhub.org• empty box for safe playing• copy the investigation that is there• add your name to the guest list so we don’t double up -
http://tinyurl.com/sandboxlist