how to share useful data
TRANSCRIPT
How to share useful dataPeter McQuiltonBiosharing.org@drosophilic
Outline• Data sharing
• Reusability and reproducibility• How the lack of these affects scientific accountability and progress
• Experimental context• What to report – what level of granularity• How to report it – what format, structure
• Content standards• How to find them• Complying with repositories, funders and publishers
Outline• Data sharing
• Reusability and reproducibility• How the lack of these affects scientific accountability and progress
• Experimental context• What to report – what level of granularity• How to report it – what format, structure
• Content standards• How to find them• Complying with repositories, funders and publishers
Research data life cycle
Image credit to:
Credit to: ttps://projects.ac/blog/five-top-reasons-to-protect-your-data-and-practise-safe-science/ 2014
Better data = better science
A community mobilization for “openness”
image by Greg Emmerich
http://discovery.urlibraries.org/ https://okfn.org
Open data is a means to do better science more efficientlyhttp://pantonprinciples.org
https://creativecommons.org
Growing movement for FAIR data and research outputs
But in all fairness, not much data is FAIR!
But in all fairness, not much data is FAIR!
But in all fairness, not much data is FAIR!
“Reproducing the method took several months of effort, and required using new versions and new software that posed challenges to reconstructing and validating the results”
Unfairness in both experimental and computation areas
• Not always well cited, storedo Software, codes, workflows are hard(er) to get hold of
• Poorly described for third party reuseo Different level of detail and annotation
• Curation activities are perceived as time consumingo Collection and harmonization of detailed methods and
experimental steps is rushed at the publication stage
Not very FAIR: low findability and understandability
• Effectively document your data so that it can be understood in the future
• Periodically move data to new storage media (drives degrade over time)
• Keep more than one copy of data (local and cloud)• Migrate data to new software versions• Use a well documented and supported format
Ideally this should be covered in a data management plan at the start of a project, so that you can factor any associated time and resources into your budget.
What can I do to ensure my data are shareable/usable in the future?
Outline• Data sharing
• Reusability and reproducibility• How the lack of these affects scientific accountability and progress
• Experimental context - standards• What to report – what level of granularity• How to report it – what format, structure
• Content standards• How to find them• Complying with repositories, funders and publishers
Do you know what this is?
LS1_C2_LD_TP2_P1 file1-fastq.gz
…how NOT to report the experimental information!
LS1_C2_LD_TP2_P1 file1-fastq.gz
…how NOT to report the experimental information!
Sample name (?!) Data file
LS1_C2_LD_TP2_P1 file1-fastq.gz
We need to clearly describe the information
• L S1 liver sample 1• C2 compound 2• LD low dose• TP2 time point 2• P1 protocol 1• file1-fastq.gz compressed data file for sequence
information corresponding to this sample
Sample name (?!) Data file
LS1_C2_LD_TP2_P1 file1-fastq.gz
Without context data is meaningless
Without context data is meaningless
Without context data is meaningless
Without context data is meaningless
• We need to report sufficient information to reuse the dataset
• We must strike a balance between depth and breadth of information
Information intensive experiments
Information intensive experiments
• Not too much• Not too little• ….just right
Seven week old C57BL/6N mice were treated with low-fat diet. Liver was dissected out, hepatocytes prepared…
From natural language to ‘computable’ concepts
Age value?Unit?Strain nameSubject of the experimentType of diet and experimental conditionAnatomy part
Seven week old C57BL/6N mice were treated with low-fat diet. Liver was dissected out, hepatocytes prepared …
From natural language to ‘computable’ concepts
Age valueUnitStrain name?Subject of the experiment?Type of diet and experimental conditionAnatomy part
Seven week old C57BL/6N mice were treated with low-fat diet. Liver was dissected out, hepatocytes prepared …
From natural language to ‘computable’ concepts
Age valueUnitStrain nameSubject of the experimentType of diet and experimental condition?Anatomy part
Seven week old C57BL/6N mice were treated with low-fat diet. Liver was dissected out, hepatocytes prepared …
From natural language to ‘computable’ concepts
Age valueUnitStrain nameSubject of the experimentType of diet and experimental conditionAnatomy part?
Seven week old C57BL/6N mice were treated with low-fat diet. Liver was dissected out, hepatocytes prepared …
From natural language to ‘computable’ concepts
Age valueUnitStrain nameSubject of the experimentType of diet and experimental conditionAnatomy part
Seven week old C57BL/6N mice were treated with low-fat diet. Liver was dissected out, hepatocytes prepared …
From natural language to ‘computable’ concepts
Age valueUnitStrain nameSubject of the experimentType of diet and experimental conditionAnatomy part
Seven week old C57BL/6N mice were treated with low-fat diet. Liver was dissected out, hepatocytes prepared …
From natural language to ‘computable’ concepts
Type of protocol – cell preparation
Type of protocol - sample treatment
Type of protocol – liver preparation
How do you know what to report, or how to structure it?
• Data/content standards:• Structure, enrich and report the description of the
datasets and the experimental context under which they were produced
• Facilitate the discovery, sharing, understanding and reuse of datasets
Outline• Data sharing
• Reusability and reproducibility• How the lack of these affects scientific accountability and progress
• Experimental context• What to report – what level of granularity• How to report it – what format, structure
• Content standards• How to find them• Complying with repositories, funders and publishers
193
85
346
miameMIAPA
MIRIAMMIQASMIX
MIGEN
ARRIVEMIAPE
MIASE
MIQE
MISFISHIE….
REMARK
CONSORT
MAGE-TabGCDML
SRAxmlSOFT FASTA
DICOM
MzMLSBRML
SEDML…
GELML
ISA-Tab
CML
MITAB
AAO
CHEBIOBI
PATO ENVOMOD
BTOIDO…
TEDDY
PRO
XAO
DO
VO
There are over 600 content standards in the life sciences
de jure de facto
grass-rootsgroups
standard organizations
Nanotechnology Working Group
Community mobilisation to develop content standards
Databases have their own standards, e.g. at EBI:
Enablers: to better describe, share and query data
Enablers: to better describe, share and query data
• Minimum information reporting requirements, or checklists o Report the same core,
essential information
• Minimum information reporting requirements, or checklists o Report the same core,
essential information
• Controlled vocabularies, taxonomies, thesauri, ontologies etc.o Use the same word and refer to the same
‘thing’
Enablers: to better describe, share and query data
• Minimum information reporting requirements, or checklists o Report the same core,
essential information
• Controlled vocabularies, taxonomies, thesauri, ontologies etc.o Use the same word and refer to the same
‘thing’
• Conceptual model, conceptual schema, or exchange formatso Allow data to flow from one
system to another
Enablers: to better describe, share and query data
A web-based, curated and searchable registry ensuring that biological standards and databases are registered, informative and discoverable; also
monitoring the development and evolution of standards, their use in databases and the adoption of both in data policies.
Researchers, developers and curators lack support and guidance on how to best navigate and select content standards, understand their maturity, or find databases that implement them;
Funders, journals and librarians do not have enough information to make informed decisions on which content standards or database to recommended in policies, or fund or implement
Our mission: To help people make the right choice
Three interlinked registries
Work out which format your data should be in for submission to a particular database
STANDARD DATABASE
Standards and databases (and policies) cross-linked
From simple and advanced searches
The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project
Search and filter to find what is relevant to your type of data
The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project
Tracking evolution, e.g. deprecations and substitutions
The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project
Tracking evolution, e.g. deprecations and substitutions
Create your own Collection
The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project
53
User profiles populated from ORCID...
The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project
54
... credit for creating, contributing to, maintaining standards, databases and policies
Ownership of open standards can be problematic in broad, grass-root collaborations
It requires improved models, to encourage maintenance of and contributions to these efforts, rewards and incentives need to be identified for all contributors to supporting the continued development of standards
What you can do with BioSharing…“Which standard should I use for this data, considering I’d
like to publish in journal X?
“Are we using the most up-to-date version of this standard?”
“My data is in X format, which databases take that format?
How can you use community-standards?
model and related formats
These tools and formats will help you to:
The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-projectISA powers data collection, curation resources and repositories, e.g.:
ISA
model and related formats
1
Create template(s) to fit the type of experiments to be described
Create templates detailing the steps to be reported for different investigations, complying to community standards in e.g. configuring the value(s) allowed for each field to be • text (with/without regular expressions),• ontology terms,• numbers etc.
We have ‘ready to use’ community standards compliant configurations and can create more according to
user needs
• The ISA model records the data’s provenance, how it was generated and where it is located.
• Published Data Descriptors are indexed in all major bibliographic indexing services (incl. PubMed)
• However, accompanying every Data Descriptor article there are metadata files, specifically created to aid discovery and understanding of the data itself.
• Using the ISA (Investigation, Study, Assay) model, these metadata files provide a machine readable overview of the study that generated the data.
• Filter datasets by data repository or metadata• Boolean searches
• Future enhancements: - Statistics- Richer queries based on semantics of the data
ISA-explorer: A demo tool for discovering and exploring Scientific Data’s ISA-tab metadata
ISA-explorer: A demo tool for discovering and exploring Scientific Data’s ISA-tab metadata
Visualise the dataassociated with a paper
http://tinyurl.com/isaexplorer
• Reusability and reproducibilityo Is pivotal to drive science and discoverieso Do your best to make your digital research outputs FAIR
• Experimental contexto Report the experimental context of your findingso Do to your data what you wish that others would do to theirs
• Content standardso Continuously evolvingo Make use of tools implementing standards, such as ISAtoolso Use biosharing.org to explore repositories, standards and policies
Summary
Acknowledgements
Find the right database for your data, and which data standard to use – https://www.biosharing.org
Checking your data conforms to a standard, or making your own templates – http://www.isa-tools.org
Where to keep research data: DCC checklist for evaluating data repositories (DCC) - http://tinyurl.com/DCCResearchData
How and why you should manage your research data (JISC) - http://tinyurl.com/JISCDMP
Useful links