research data, or: how i learned to stop worrying and love the policy

Research Data, or: How I Learned to Stop Worrying and Love the PolicyRDMF14: Research Data (and) SystemsYork, 9th November 2015

Dr Torsten ReimerScholarly Communications OfficerImperial College [email protected] / @torstenreimerhttp://orcid.org/0000-0001-8357-9422

mailto:[email protected]

https://twitter.com/torstenreimer

https://twitter.com/torstenreimer

http://orcid.org/0000-0001-8357-9422

http://orcid.org/0000-0001-8357-9422

http://orcid.org/0000-0001-8357-9422

Why are we here?

Why we fight – Compliance! Really?

“Well compliance is really important, yes that's the whole reason we are doing it really. I mean to comply with Research Council guidelines yes. I am not saying the whole reason butthat's the main driver, yes.”10.1371/journal.pone.0114734

There are issues with RCUK/EPSRC policy:• cost-benefit analysis, anyone?• expensive/issues around funding• enough support/incentive for culture change?• fine in theory, but is it workable in practice?

But…

http://dx.doi.org/10.1371/journal.pone.0114734

Blame funders, or blame ourselves (hedgehog and hare)?

It seems wherever we go, the funders have already been there: HEFCE open access policy; EPSRC data policy…

Are the funders too fast? Or we too slow?

Imagine the sector had agreed on best practice years ago – and implemented it in a sensible way!

So, why are we here again? No really, why?

Data Science hub and KPMG Data Observatory

Data Science hub and KPMG Data Observatory launch (04 Nov)

"At a research intensive university like Imperial it is hard to do anything that doesn't involve data.“

James Stirling, Provost

"Data is at the heart of the human condition."

Joanna Shields, UK Minister for Internet Safety and Security

Considering these statements you’d think everyone, especially Imperial, would have RDM all sorted, wouldn’t you?

… and yet we are losing research data

“In their parents' attic, in boxes in the garage, or stored on now-defunct floppy disks — these are just some of the inaccessible places in which scientists have admitted to keeping their old research data.”http://www.nature.com/news/scientists-losing-data-at-a-rapid-rate-1.14416

http://www.nature.com/news/scientists-losing-data-at-a-rapid-rate-1.14416

http://www.nature.com/news/scientists-losing-data-at-a-rapid-rate-1.14416

Isn’t research meant to be reproducible?

The results of only 6 out 53 ‘landmark’ studies were found reproducible.

Drug development: Raise standards for preclinical cancer research. DOI: doi:10.1038/483531a

“Several recent publications suggested that the seminal findings from academic laboratories could only be reproduced 11–50% of the time. The lack of data reproducibility likely contributes to the difficulty in rapidly developing new drugs and biomarkers that significantly impact the lives of patients with cancer and other diseases.”

A Survey on Data Reproducibility in Cancer Research Provides Insights into Our Limited Ability to Translate Findings from the Laboratory to the Clinic. DOI: 10.1371/journal.pone.0063221

Shouldn’t the public too be allowed to play with data?

(This is in our own interest!)

RDM systems landscape

Case for a national infrastructure?

Currently, ~100 UK institutions spend effort to define and implementan RDM infrastructure (storage, workflows, interfaces, metadata, compliance, monitoring, business model etc.). Some aspectshave to be local, but…

…imagine a national research data infrastructure (say for datapublishing and preservation), run by RCUK:

• Economies of scale• No issues with funding• Just one system to interface with• Increased visibility/discoverability• Solution would by default be compliant• No commercial “ownership” of public data

However, past experience suggests…

One RDM system to rule them all?

• Is community track record actuallybetter than funders’?

• Jisc offers components, but have wefound right model for collaboration (supplier? leader? partner?)?

• Commercial solutions exist– trust?Should they define our infrastructure?

Þ Funders set policy; 3rd parties infrastructure – we’ve been too slow again!

Þ However, is one system actually suitable (redundancy, competition, disciplines etc.)?

Until the one solution emerges (if ever), we should:• consider defining minimum requirements (metadata,

identifiers, embargoes) for 3rd party solutions?• use a flexible approach that enables us to learn and change

Imperial College London(From funder policy to) institutional

strategy

Imperial College London

• Seven London campuses• Four Faculties: Engineering,

Medicine, Natural Sciencesand Business School

• Ranked 3rd in Europe / 8th in theworld (THE 2015-16 rankings)

• Net income (2014): £855m, incl.£351m research grants and contracts

• ~15,000 students, ~7,400 staff, incl. ~3,900 academic & research staff• Staff publish 10-12,000 scholarly articles per year• Largest data traffic into Janet network of all UK universities

Process of policy development

• 2014: Draft policy: “Statement of Strategic Aims”• Lack of reliable data (on data storage needs (scale) in particular)• Concerns about cost of maintaining infrastructure• Concerns about uncertainties and changing market / policy landscape• Decision: re-think approach – more cost-effective, based on better data

• Approach: RDM Green Shoots and RDM Investigation• Funded by Vice-Provost (Research)• Green Shoots: 6 bottom-up, academic projects (2nd half of 2014)• RDM investigation (Oct 2014-Jan 2015)

• Online survey (academics; 390 responses)• ~40 interviews (academics)• Workshops (academics & data managers)

http://www.imperial.ac.uk/research-and-innovation/research-office/research-outcomes-outputs-and-impact/research-data-management/

RDM Green Shoots

• Haystack – a computational molecular data notebook (Dr Mike Bearpark, Chemistry)

• Imperial College Healthcare Tissue Bank (Prof. Gerry Thomas, Surgery & Cancer)

• Integrated Rule-based Data Management System for Genome Sequencing Data (Dr Michael Mueller, Medicine)

• RDM in Computational and Experimental Molecular Sciences (Prof. Henry Rzepa, Chemistry)

• RDM: Where software meets data (Dr Gerard Gorman & Dr Matthew Piggott, Earth Science & Engineering)

• Time Series (Dr Nick Jones, Mathematics)

Idea

• Provide a platform and technology which automatically connects researchers through their time-series data, models and analysis methods

Achievements

• Online interdisciplinary collection of time-series data and time-series analysis code• Functionality to automatically profile time series• Functionality to automatically profile time series algorithms• Functionality to use these profiles to place a user’s work in the context of others

RDM Benefits

• Incentivises data sharing by allowing data comparison – increases discoverability of an academic’s data plus increases likelihood of finding other relevant data• Resource also available to general public

More Information

• http://www.comp-engine.org/timeseries/

Example project: Time Series

http://github.com/pyrdm

Online survey – where does active data live?

College computer

External/portable storage

Cloud storage

Personal computer

Departmental/group storage

College H drive

ICT central storage

0 10 20 30 40 50 60 70 80

Use of different types of storage in %

Online survey – growth of data volume

> 1 PB100 TB – 1 PB

10 TB – 100 TB1 TB – 10 TB

100 GB – 1 TB10 GB – 100 GB

< 10 GB

0 5 10 15 20 25 30

Research group data storage needs in %

NowIn 2 years

Findings (best practice)

• RDM principles are considered to be sound but not fully practised• Sharing publicly-funded data accepted in principle but some question

value and cost• Concerns about (metadata) effort to make shared data discoverable• Metadata schemas are not yet widely available across disciplines• Auto-generate metadata where possible• Consensus that RDM training for PhDs is vital

(also to ensure data loss when they leave)

Findings (data)

• 60-100% of grant required to re-generate data used in publications

• % of data that needs retaining to support publications: ~60%• Data storage capacity will have to grow significantly• Concerns around back-up and archiving, esp. considering data volume• Popularity of cloud services (as opposed to College storage)

Þ Researchers want self-administered, secure, responsive solutionfor data sharing, storing and archiving; open APIs preferred

(“Yes [storage] is really important. Basically, whenever we have been out to talk to researchers, that's the thing they have latched on to and want to talk about the most.” 10.1371/journal.pone.0114734)

http://dx.doi.org/10.1371/journal.pone.0114734

Conclusions / policy implementation principles

• Provide platform-independent, flexible data storage• Embed RDM training into PhD progression• Where available, uses existing workflows:

• Symplectic Elements: metadata management• Spiral (DSpace): public (metadata) catalogue

• Additional infrastructure:• use external resources• no long-term commitment• as flexible as possible• cost-effective

Reesult: Imperial College RDM Policy

“Imperial College London is committed to promoting the highest standards of academic research, including excellence in research data management. This includes a robust digital curation infrastructure that supports open data access and protects confidential data. The College acknowledges legal, ethical and commercial constraints on data sharing and the need to preserve the academic entitlement to publication.”

“Principal Investigators have overall responsibility for the effective management of research data generated within or obtained for their research, including by their research groups. The Library and ICT will provide training, guidance and services to support PIs.” http://imperial.ac.uk/research-data-management

http://imperial.ac.uk/research-data-management

Building a flexible RDM infrastructure

Research Project

Data: BoxSoftware: GitHub

Data/software still needed Delete

External repositoryInternal Storage

Elements

Spiral

Creates data/software

Project ends

no

yes

Metadata, manualor automatic

Can it be published or embargoed externally?

yesno

Metadata, manualor automatic

Can metadata be published?

Library reviews

yes

Summarising RDM in 6 steps

1. Make a data management plan: use DMPOnline

2. Store your data management plan centrally: use InfoEd

3. Store your live data securely and safely: use Box

4. Store your final data (and/or code) for 10+ years,making it publicly available: use Zenodo

5. Tell the College where your data (and/or code) ispublished or stored: use Symplectic

6. Reference your funding and your data in the publications it underpins: tell your publisher

Box – Data storage, sharing and syncing

Roll-out across College:• unlimited data storage• online access, easy

sharing, data syncing• file viewers included• backup, data remains

even when staff leave• machine learning

tools to describe data• API

Infrastructure summary

• Flexible, can react to market / policy changes• Components can be exchanged, no additional

in-house infrastructure• Make a start, collect data, learn – change as required• Preservation infrastructure needs further work

(discussions with Arkivum about ‘framework’ for costing into grants) – how much do we needto retain beyond published data?

• It isn’t perfect, but we can make a start

“In, through … and beyond”

RDM policy with research software requirements

“3.6.7 Cost Effectiveness – where computer-generated data may be reliably recreated at a cost less than that of storing raw output data, then the inputs and human-readable outputs of the relevant programme may be stored instead along with a reference to or copy of the software version used.”

“3.7 If software is developed as part of a research project, Principal Investigators must archive the particular version of the software used to generate or analyse the data in a repository and inform the Library of its location, taking account of the points raised in 3.5 above. Principal Investigators are encouraged to follow the Sustainability and Preservation Framework of the Software Sustainability Institute.”

Treat software as valuable research output

PyRDM Green Shoots projectZenodo integrates with GitHub

College survey on distributed version controlSoftware Sustainability Institute – I a fellow

https://github.com/pyrdm

http://www.software.ac.uk/

ORCID – Open Researcher and Contributor ID

• Emerging global standard for identifying authors of academic outputs• The College created ORCID iDs for academics staff in late 2014

(now 2,088 of 3,200 iDs claimed, ~1,500 linked in Elements)• Imperial hosted launch of Jisc ORCID consortium with

50 UK universities in September 2015

http://www.imperial.ac.uk/orcid

http://wwwf.imperial.ac.uk/blog/openaccess/2015/10/07/uk-orcid-members-meeting-and-launch-of-jisc-orcid-consortium-at-imperial-college-london-28th-september-2015/



http://www.imperial.ac.uk/orcid

Towards automating RDM reporting with ORCID

Author links

ORCID with CRIS

…shares ORCID iD

with repository

…publishes dataset

DataCite DOI linked to ORCID

iD

CRIS pulls metadata

from ORCID / DataCite / Repository

But: is the external metadata likely to be complete “enough”?

Useful infrastructure makes compliance a by-product

• One workflow for data generation, publishing, reporting and curation• Link data generation directly to storage (log into facility, data “at your

desk” before you are out of the “lab”)• (HSS colleagues – “facility” can also be a book scanner• Automate reporting and generating / sharing of metadata

Facilities

write (meta) data into Box

Data processed / analysed from Box

Machine-

learning

adds metadata

Publish to

repository

from Box, with

reference

Metadata

directly or

indirectly

(ORCID) to

CRISS

Make data useful for us, not just for external re-use

Now that we get data, shouldn’t we analyse it?

Add value by:• connect researchers who have similar data interests• connect researchers to relevant data• present data in a way that’s suitable for public reuse• develop data analytics and knowledge transfer service• collect impact information on data

• Let’s make a start and learn from doing, from actual data• Think about where we can coordinate (3rd party requirements)• It is early stages, take a flexible approach• Don’t wait for funders, interpret policies in a useful way and lead

=> If we lead instead of following there will be fewer unpleasant surprises to deal with!

Research Data, or: How I Learned to Stop Worrying and Love the Policy

Image Credit (note NC licence!)

1. https://en.wikipedia.org/wiki/File:Dr._Strangelove_-_Group_Captain_Lionel_Mandrake.png public domain

2. https://it.wikipedia.org/wiki/Why_We_Fight#/media/File:Why_We_Fight_title.jpg public domain

3. https://commons.wikimedia.org/wiki/File:Hase_und_Igel_%281%29.jpg public domain

4. https://www.flickr.com/photos/jdhancock/4617759902/ C-3PO vs. Data (137/365), by JD Hancock, CC BY 2.0

5. https://en.wikipedia.org/wiki/One_Ring#/media/File:Unico_Anello.png public domain

6. https://www.flickr.com/photos/dinnerseries/14994148089/ OXO tools, by Didriks, CC BY 2.0

7. https://www.flickr.com/photos/albertovo5/3908190631/ How I Learned To Stop Worrying..., by hjhipster, CC BY NC 2.0

https://en.wikipedia.org/wiki/File:Dr._Strangelove_-_Group_Captain_Lionel_Mandrake.png

https://en.wikipedia.org/wiki/File:Dr._Strangelove_-_Group_Captain_Lionel_Mandrake.png

https://it.wikipedia.org/wiki/Why_We_Fight#/media/File:Why_We_Fight_title.jpg



https://commons.wikimedia.org/wiki/File:Hase_und_Igel_(1).jpg

https://commons.wikimedia.org/wiki/File:Hase_und_Igel_(1).jpg

https://www.flickr.com/photos/jdhancock/4617759902/



https://en.wikipedia.org/wiki/One_Ring#/media/File:Unico_Anello.png

https://en.wikipedia.org/wiki/One_Ring#/media/File:Unico_Anello.png

https://www.flickr.com/photos/dinnerseries/14994148089/

https://www.flickr.com/photos/dinnerseries/14994148089/

https://www.flickr.com/photos/albertovo5/3908190631/

https://www.flickr.com/photos/albertovo5/3908190631/

research data, or: how i learned to stop worrying and love the policy

Education