research data, or: how i learned to stop worrying and love the policy

41
Research Data, or: How I Learned to Stop Worrying and Love the Policy RDMF14: Research Data (and) Systems York, 9 th November 2015 Dr Torsten Reimer Scholarly Communications Officer Imperial College London [email protected] / @ torstenreimer http :// orcid.org/0000-0001-8357-9422

Upload: torsten-reimer

Post on 14-Apr-2017

656 views

Category:

Education


3 download

TRANSCRIPT

Page 1: Research Data, or: How I Learned to Stop Worrying and Love the Policy

Research Data, or: How I Learned to Stop Worrying and Love the PolicyRDMF14: Research Data (and) SystemsYork, 9th November 2015

Dr Torsten ReimerScholarly Communications OfficerImperial College [email protected] / @torstenreimerhttp://orcid.org/0000-0001-8357-9422

Page 2: Research Data, or: How I Learned to Stop Worrying and Love the Policy

Why are we here?

Page 3: Research Data, or: How I Learned to Stop Worrying and Love the Policy

Why we fight – Compliance! Really?

“Well compliance is really important, yes that's the whole reason we are doing it really. I mean to comply with Research Council guidelines yes. I am not saying the whole reason butthat's the main driver, yes.”10.1371/journal.pone.0114734

There are issues with RCUK/EPSRC policy:• cost-benefit analysis, anyone?• expensive/issues around funding• enough support/incentive for culture change?• fine in theory, but is it workable in practice?

But…

Page 4: Research Data, or: How I Learned to Stop Worrying and Love the Policy

Blame funders, or blame ourselves (hedgehog and hare)?

It seems wherever we go, the funders have already been there: HEFCE open access policy; EPSRC data policy…

Are the funders too fast? Or we too slow?

Imagine the sector had agreed on best practice years ago – and implemented it in a sensible way!

Page 5: Research Data, or: How I Learned to Stop Worrying and Love the Policy

So, why are we here again? No really, why?

Page 6: Research Data, or: How I Learned to Stop Worrying and Love the Policy

Data Science hub and KPMG Data Observatory

Page 7: Research Data, or: How I Learned to Stop Worrying and Love the Policy

Data Science hub and KPMG Data Observatory launch (04 Nov)

"At a research intensive university like Imperial it is hard to do anything that doesn't involve data.“

James Stirling, Provost

"Data is at the heart of the human condition."

Joanna Shields, UK Minister for Internet Safety and Security

Considering these statements you’d think everyone, especially Imperial, would have RDM all sorted, wouldn’t you?

Page 8: Research Data, or: How I Learned to Stop Worrying and Love the Policy

… and yet we are losing research data

“In their parents' attic, in boxes in the garage, or stored on now-defunct floppy disks — these are just some of the inaccessible places in which scientists have admitted to keeping their old research data.”http://www.nature.com/news/scientists-losing-data-at-a-rapid-rate-1.14416

Page 9: Research Data, or: How I Learned to Stop Worrying and Love the Policy

Isn’t research meant to be reproducible?

The results of only 6 out 53 ‘landmark’ studies were found reproducible.

Drug development: Raise standards for preclinical cancer research. DOI: doi:10.1038/483531a

“Several recent publications suggested that the seminal findings from academic laboratories could only be reproduced 11–50% of the time. The lack of data reproducibility likely contributes to the difficulty in rapidly developing new drugs and biomarkers that significantly impact the lives of patients with cancer and other diseases.”

A Survey on Data Reproducibility in Cancer Research Provides Insights into Our Limited Ability to Translate Findings from the Laboratory to the Clinic. DOI: 10.1371/journal.pone.0063221

Page 10: Research Data, or: How I Learned to Stop Worrying and Love the Policy

Shouldn’t the public too be allowed to play with data?

Page 11: Research Data, or: How I Learned to Stop Worrying and Love the Policy

(This is in our own interest!)

Page 12: Research Data, or: How I Learned to Stop Worrying and Love the Policy

RDM systems landscape

Page 13: Research Data, or: How I Learned to Stop Worrying and Love the Policy

Case for a national infrastructure?

Currently, ~100 UK institutions spend effort to define and implementan RDM infrastructure (storage, workflows, interfaces, metadata, compliance, monitoring, business model etc.). Some aspectshave to be local, but…

…imagine a national research data infrastructure (say for datapublishing and preservation), run by RCUK:

• Economies of scale• No issues with funding• Just one system to interface with• Increased visibility/discoverability• Solution would by default be compliant• No commercial “ownership” of public data

Page 14: Research Data, or: How I Learned to Stop Worrying and Love the Policy

However, past experience suggests…

Page 15: Research Data, or: How I Learned to Stop Worrying and Love the Policy

However, past experience suggests…

Page 16: Research Data, or: How I Learned to Stop Worrying and Love the Policy

One RDM system to rule them all?

• Is community track record actuallybetter than funders’?

• Jisc offers components, but have wefound right model for collaboration (supplier? leader? partner?)?

• Commercial solutions exist– trust?Should they define our infrastructure?

Þ Funders set policy; 3rd parties infrastructure – we’ve been too slow again!

Þ However, is one system actually suitable (redundancy, competition, disciplines etc.)?

Until the one solution emerges (if ever), we should:• consider defining minimum requirements (metadata,

identifiers, embargoes) for 3rd party solutions?• use a flexible approach that enables us to learn and change

Page 17: Research Data, or: How I Learned to Stop Worrying and Love the Policy

Imperial College London(From funder policy to) institutional

strategy

Page 18: Research Data, or: How I Learned to Stop Worrying and Love the Policy

Imperial College London

• Seven London campuses• Four Faculties: Engineering,

Medicine, Natural Sciencesand Business School

• Ranked 3rd in Europe / 8th in theworld (THE 2015-16 rankings)

• Net income (2014): £855m, incl.£351m research grants and contracts

• ~15,000 students, ~7,400 staff, incl. ~3,900 academic & research staff• Staff publish 10-12,000 scholarly articles per year• Largest data traffic into Janet network of all UK universities

Page 19: Research Data, or: How I Learned to Stop Worrying and Love the Policy

Process of policy development

• 2014: Draft policy: “Statement of Strategic Aims”• Lack of reliable data (on data storage needs (scale) in particular)• Concerns about cost of maintaining infrastructure• Concerns about uncertainties and changing market / policy landscape• Decision: re-think approach – more cost-effective, based on better data

• Approach: RDM Green Shoots and RDM Investigation• Funded by Vice-Provost (Research)• Green Shoots: 6 bottom-up, academic projects (2nd half of 2014)• RDM investigation (Oct 2014-Jan 2015)

• Online survey (academics; 390 responses)• ~40 interviews (academics)• Workshops (academics & data managers)

Page 20: Research Data, or: How I Learned to Stop Worrying and Love the Policy

RDM Green Shoots

• Haystack – a computational molecular data notebook (Dr Mike Bearpark, Chemistry)

• Imperial College Healthcare Tissue Bank (Prof. Gerry Thomas, Surgery & Cancer)

• Integrated Rule-based Data Management System for Genome Sequencing Data (Dr Michael Mueller, Medicine)

• RDM in Computational and Experimental Molecular Sciences (Prof. Henry Rzepa, Chemistry)

• RDM: Where software meets data (Dr Gerard Gorman & Dr Matthew Piggott, Earth Science & Engineering)

• Time Series (Dr Nick Jones, Mathematics)

Page 21: Research Data, or: How I Learned to Stop Worrying and Love the Policy

Idea

• Provide a platform and technology which automatically connects researchers through their time-series data, models and analysis methods

Achievements

• Online interdisciplinary collection of time-series data and time-series analysis code• Functionality to automatically profile time series• Functionality to automatically profile time series algorithms• Functionality to use these profiles to place a user’s work in the context of others

RDM Benefits

• Incentivises data sharing by allowing data comparison – increases discoverability of an academic’s data plus increases likelihood of finding other relevant data• Resource also available to general public

More Information

• http://www.comp-engine.org/timeseries/

Example project: Time Series

Page 22: Research Data, or: How I Learned to Stop Worrying and Love the Policy

Online survey – where does active data live?

College computer

External/portable storage

Cloud storage

Personal computer

Departmental/group storage

College H drive

ICT central storage

0 10 20 30 40 50 60 70 80

Use of different types of storage in %

Page 23: Research Data, or: How I Learned to Stop Worrying and Love the Policy

Online survey – growth of data volume

> 1 PB100 TB – 1 PB

10 TB – 100 TB1 TB – 10 TB

100 GB – 1 TB10 GB – 100 GB

< 10 GB

0 5 10 15 20 25 30

Research group data storage needs in %

NowIn 2 years

Page 24: Research Data, or: How I Learned to Stop Worrying and Love the Policy

Findings (best practice)

• RDM principles are considered to be sound but not fully practised• Sharing publicly-funded data accepted in principle but some question

value and cost• Concerns about (metadata) effort to make shared data discoverable• Metadata schemas are not yet widely available across disciplines• Auto-generate metadata where possible• Consensus that RDM training for PhDs is vital

(also to ensure data loss when they leave)

Page 25: Research Data, or: How I Learned to Stop Worrying and Love the Policy

Findings (data)

• 60-100% of grant required to re-generate data used in publications

• % of data that needs retaining to support publications: ~60%• Data storage capacity will have to grow significantly• Concerns around back-up and archiving, esp. considering data volume• Popularity of cloud services (as opposed to College storage)

Þ Researchers want self-administered, secure, responsive solutionfor data sharing, storing and archiving; open APIs preferred

(“Yes [storage] is really important. Basically, whenever we have been out to talk to researchers, that's the thing they have latched on to and want to talk about the most.” 10.1371/journal.pone.0114734)

Page 26: Research Data, or: How I Learned to Stop Worrying and Love the Policy

Conclusions / policy implementation principles

• Provide platform-independent, flexible data storage• Embed RDM training into PhD progression• Where available, uses existing workflows:

• Symplectic Elements: metadata management• Spiral (DSpace): public (metadata) catalogue

• Additional infrastructure:• use external resources• no long-term commitment• as flexible as possible• cost-effective

Page 27: Research Data, or: How I Learned to Stop Worrying and Love the Policy

Reesult: Imperial College RDM Policy

“Imperial College London is committed to promoting the highest standards of academic research, including excellence in research data management. This includes a robust digital curation infrastructure that supports open data access and protects confidential data. The College acknowledges legal, ethical and commercial constraints on data sharing and the need to preserve the academic entitlement to publication.”

“Principal Investigators have overall responsibility for the effective management of research data generated within or obtained for their research, including by their research groups. The Library and ICT will provide training, guidance and services to support PIs.” http://imperial.ac.uk/research-data-management

Page 28: Research Data, or: How I Learned to Stop Worrying and Love the Policy

Building a flexible RDM infrastructure

Page 29: Research Data, or: How I Learned to Stop Worrying and Love the Policy

Research Project

Data: BoxSoftware: GitHub

Data/software still needed Delete

External repositoryInternal Storage

Elements

Spiral

Creates data/software

Project ends

no

yes

Metadata, manualor automatic

Can it be published or embargoed externally?

yesno

Metadata, manualor automatic

Can metadata be published?

Library reviews

yes

Page 30: Research Data, or: How I Learned to Stop Worrying and Love the Policy

Summarising RDM in 6 steps

1. Make a data management plan: use DMPOnline

2. Store your data management plan centrally: use InfoEd

3. Store your live data securely and safely: use Box

4. Store your final data (and/or code) for 10+ years,making it publicly available: use Zenodo

5. Tell the College where your data (and/or code) ispublished or stored: use Symplectic

6. Reference your funding and your data in the publications it underpins: tell your publisher

Page 31: Research Data, or: How I Learned to Stop Worrying and Love the Policy

Box – Data storage, sharing and syncing

Roll-out across College:• unlimited data storage• online access, easy

sharing, data syncing• file viewers included• backup, data remains

even when staff leave• machine learning

tools to describe data• API

Page 32: Research Data, or: How I Learned to Stop Worrying and Love the Policy

Infrastructure summary

• Flexible, can react to market / policy changes• Components can be exchanged, no additional

in-house infrastructure• Make a start, collect data, learn – change as required• Preservation infrastructure needs further work

(discussions with Arkivum about ‘framework’ for costing into grants) – how much do we needto retain beyond published data?

• It isn’t perfect, but we can make a start

Page 33: Research Data, or: How I Learned to Stop Worrying and Love the Policy

“In, through … and beyond”

Page 34: Research Data, or: How I Learned to Stop Worrying and Love the Policy

RDM policy with research software requirements

“3.6.7 Cost Effectiveness – where computer-generated data may be reliably recreated at a cost less than that of storing raw output data, then the inputs and human-readable outputs of the relevant programme may be stored instead along with a reference to or copy of the software version used.”

“3.7 If software is developed as part of a research project, Principal Investigators must archive the particular version of the software used to generate or analyse the data in a repository and inform the Library of its location, taking account of the points raised in 3.5 above. Principal Investigators are encouraged to follow the Sustainability and Preservation Framework of the Software Sustainability Institute.”

Page 35: Research Data, or: How I Learned to Stop Worrying and Love the Policy

Treat software as valuable research output

PyRDM Green Shoots projectZenodo integrates with GitHub

College survey on distributed version controlSoftware Sustainability Institute – I a fellow

Page 36: Research Data, or: How I Learned to Stop Worrying and Love the Policy

ORCID – Open Researcher and Contributor ID

• Emerging global standard for identifying authors of academic outputs• The College created ORCID iDs for academics staff in late 2014

(now 2,088 of 3,200 iDs claimed, ~1,500 linked in Elements)• Imperial hosted launch of Jisc ORCID consortium with

50 UK universities in September 2015

http://www.imperial.ac.uk/orcid

Page 37: Research Data, or: How I Learned to Stop Worrying and Love the Policy

Towards automating RDM reporting with ORCID

Author links

ORCID with CRIS

…shares ORCID iD

with repository

…publishes dataset

DataCite DOI linked to ORCID

iD

CRIS pulls metadata

from ORCID / DataCite / Repository

But: is the external metadata likely to be complete “enough”?

Page 38: Research Data, or: How I Learned to Stop Worrying and Love the Policy

Useful infrastructure makes compliance a by-product

• One workflow for data generation, publishing, reporting and curation• Link data generation directly to storage (log into facility, data “at your

desk” before you are out of the “lab”)• (HSS colleagues – “facility” can also be a book scanner• Automate reporting and generating / sharing of metadata

Facilities

write (meta) data into Box

Data processed / analysed from Box

Machine-

learning

adds metadata

Publish to

repository

from Box, with

reference

Metadata

directly or

indirectly

(ORCID) to

CRISS

Page 39: Research Data, or: How I Learned to Stop Worrying and Love the Policy

Make data useful for us, not just for external re-use

Now that we get data, shouldn’t we analyse it?

Add value by:• connect researchers who have similar data interests• connect researchers to relevant data• present data in a way that’s suitable for public reuse• develop data analytics and knowledge transfer service• collect impact information on data

Page 40: Research Data, or: How I Learned to Stop Worrying and Love the Policy

• Let’s make a start and learn from doing, from actual data• Think about where we can coordinate (3rd party requirements)• It is early stages, take a flexible approach• Don’t wait for funders, interpret policies in a useful way and lead

=> If we lead instead of following there will be fewer unpleasant surprises to deal with!

Research Data, or: How I Learned to Stop Worrying and Love the Policy

Page 41: Research Data, or: How I Learned to Stop Worrying and Love the Policy

Image Credit (note NC licence!)

1. https://en.wikipedia.org/wiki/File:Dr._Strangelove_-_Group_Captain_Lionel_Mandrake.png public domain

2. https://it.wikipedia.org/wiki/Why_We_Fight#/media/File:Why_We_Fight_title.jpg public domain

3. https://commons.wikimedia.org/wiki/File:Hase_und_Igel_%281%29.jpg public domain

4. https://www.flickr.com/photos/jdhancock/4617759902/ C-3PO vs. Data (137/365), by JD Hancock, CC BY 2.0

5. https://en.wikipedia.org/wiki/One_Ring#/media/File:Unico_Anello.png public domain

6. https://www.flickr.com/photos/dinnerseries/14994148089/ OXO tools, by Didriks, CC BY 2.0

7. https://www.flickr.com/photos/albertovo5/3908190631/ How I Learned To Stop Worrying..., by hjhipster, CC BY NC 2.0