jonathan tedds distinguished lecture at dlab, uc berkeley, 12 sep 2013: "the open research...

THE OPEN RESEARCH CHALLENGE: PEER REVIEW AND PUBLICATION OF RESEARCH DATA

Dr Jonathan Tedds [email protected] @jtedds

Senior Research Fellow,

D2K Data to Knowledge Research Group

(University of Leicester)

PI #PREPARDE http://www.le.ac.uk/projects/preparde

mailto:[email protected]

http://www.le.ac.uk/projects/preparde


http://www.astrogrid.org (April 2008 1st public release)

http://www.astrogrid.org/

Science as an Open Enterprise Report Why open? • As a first step towards this intelligent

openness, data that underpin a journal article should be made concurrently available in an accessible database

• We are now on the brink of an achievable aim: for all science literature to be online, for all of the data to be online and for the two to be interoperable. [p.7]

• Royal Society June 2012, Science as an Open Enterprise, http://royalsociety.org/policy/projects/science-public-enterprise/report/

• Issues linking data to the scientific record: – Data persistence – Data and metadata quality – Attribution and credit for data producers

• Geoffrey Boulton (Edinburgh), Lead author: – “Science has been sleepwalking into crisis of

replicability...and of the credibility of science” – “Publishing articles without making the data

available is scientific malpractice”

http://royalsociety.org/policy/projects/science-public-enterprise/report/





Data Reuse: asking new questions

Hubble Space Telescope • Papers based upon reuse of archived observations now exceed those based on the

use described in the original proposal. – http://archive.stsci.edu/hst/bibliography/pubstat.html

• See also work by Piwowar & Vision re life sciences: “Data reuse and the open data citation advantage”

– http://peerj.com/preprints/1/

http://archive.stsci.edu/hst/bibliography/pubstat.html

http://peerj.com/preprints/1/

http://peerj.com/preprints/1/

Oh, and…. says so :P

We are committed to openness in scientific research data to speed up the progress of scientific discovery, create innovation, ensure that the results of scientific research are as widely available as practical, enable transparency in science and engage the public in the scientific process. • To the greatest extent and with the fewest constraints possible publicly funded

scientific research data should be open, while at the same time respecting concerns in relation to privacy, safety, security and commercial interests, whilst acknowledging the legitimate concerns of private partners.

• Open scientific research data should be easily discoverable, accessible, assessable, intelligible, useable, and wherever possible interoperable to specific quality standards.

• To ensure successful adoption by scientific communities, open scientific research data principles will need to be underpinned by an appropriate policy environment, including recognition of researchers fulfilling these principles, and appropriate digital infrastructure.

Scale of the problem: who, what, when

where….?

http://blogs.scientificamerican.com/absolutely-maybe/2013/09/10/opening-a-can-of-data-sharing-worms/ • Timothy Vines and colleagues studied reproducibility of data sets in zoology and changes

through time – gathered 516 papers published between 1991 and 2011 – then they tried to track the data down…

• Even tracking down the authors was a challenge

– Over time a dwindling minority of papers were accompanied by author email addresses that still functioned

• only 37% of the data - even from papers in 2011 - were still findable and retrievable

– proportion dropped each earlier year

• For papers published in 1991 – only 7% of the data could be determined to truly still be in existence and retrievable – few authors could be found, and most of them were reporting that their data were lost or

inaccessible

http://blogs.scientificamerican.com/absolutely-maybe/2013/09/10/opening-a-can-of-data-sharing-worms/
















This isn’t new…

Henry Oldenburg

– inveterate correspondent – now think of as scientist

• Had idea to publish Philosophical

Transactions (1665):

o Should be written in vernacular not Latin o Underlying evidence must be concurrently

published o Helped propel Europe at the time o Concept of scientific self correction

• able to write it's errors

• Wrote: “thought fit to employ the [printing] press……..Universal Good of Mankind“

o How do we achieve these ends in the post-

Gutenburg era?

Science ecosystems (Peter Fox, Rensselaer)

• These elements are what enable scientists to

explore/ confirm/ deny their research ideas

and collaborate!

• Abduction as well as induction and deduction

Accountability

Proof Explanation Justification Verifiability

‘Transparency’ -> Translucency

Trust

Identity Citability Integrateability

Data as a “public good” (2011)

• Public good

• Preservation

• Discovery

• Confidentiality

• First use

• Recognition

• Public funding

http://osc.universityofcalifornia.edu/openaccesspolicy/

http://osc.universityofcalifornia.edu/openaccesspolicy/

So what do we mean by publishing data?

• The familiar: – Supplementary tables via journal or – Archived raw or calibrated facility data – Discipline specific and institutional / national archives

• Data under the graph?

– In order to reproduce and adapt article analysis

• “Research ready” open data

– In order to reuse and repurpose – for interdisciplinary researchers, community, business – Ideally peer reviewed?

Research data example - level 1:

• A typical example from physical sciences (astronomy) distinguishes between broad categories within the research data spectrum:

• raw/initially auto-processed data produced at a research facility such as an observatory

• typically made publically available in this format after an embargo period of e.g. 1 year

• in some cases available immediately - e.g. Swift Gamma Ray Burst satellite

"research ready" processed data which has been fully calibrated, combined and cleaned/annotated

• often produced by individuals or collaborations

• rarely available to anyone outside the collaboration except upon request/collaboration

• needed for re-analysis or reuse for science unless you have detailed sub domain specific knowledge and detailed contextual information to reproduce from raw

• considered to enable a competitive advantage for producers

• may be produced by dedicated data scientists on behalf of the community for major survey/missions e.g. ESA XMM-Survey Science Centre (Leicester), NASA, NCAR…

Research data example – level 2

output dataset – following detailed analysis of research ready datasets

• forms the data under the graph in a journal publication following analysis of research ready datasets

• Might be available as a Table via journal, CDS etc

• May not be available outside the collaboration except upon request/collaboration

• may well generate future additional samples and papers for the owning collaboration on top of the original

• other researchers may request the data for their own research but may not get it!

Research data example – level 3

….and STOP!

• Next project – Proposal long since written

– Probably already underway…

• Feel free to email ME if you would like to work on an idea using this dataset or code – As long as I’m a co-author on the paper!

– You have to go through me to find out what you really need to know to reuse the data/code

• published catalogue type representation of published output dataset

• NOT a “data paper”….but could be

• optional in many cases, mandatory for most major surveys

• usually made available via project specific online resource

• may be provided as table of parameters based on research ready dataset, usually linked from and associated with a journal

• specifically produced in order for the wider community to reuse (cite!) and repurpose if wanted

• The well-known Sloan Digital Sky Survey is a classic example or more recently the 2XMMi X-ray catalogue I have a close involvement with (largest X-ray survey of the sky).

Research data example – level 4a

http://adsabs.harvard.edu/abs/2013arXiv1302.5329E

e.g.

http://adsabs.harvard.edu/abs/2013arXiv1302.5329E

data paper describing and linking to output dataset(s)

Research data example – level 4b

Live Data paper! Dataset citation is first thing in the paper and is also included in reference list (to take advantage of citation count systems) DOI: 10.1002/gdj3.2

Data Publications

• Initiatives for open publication of data, datasets, data papers and open peer review

• RDMF8 ‘Engaging with the publishers’: http://www.dcc.ac.uk/events/research-data-management-forum-rdmf/rdmf8-engaging-publishers

• Particularly Rebecca Lawrence on peer review policies

• Earth System Science Data: http://www.earth-system-science-data.net/

• Pensoft Data Publishing Policies and Guidelines for Biodiversity Data http://www.pensoft.net/news.php?n=59

Slide: Simon Hodson (Jisc / CODATA)

http://www.dcc.ac.uk/events/research-data-management-forum-rdmf/rdmf8-engaging-publishers













http://www.earth-system-science-data.net/







http://www.pensoft.net/news.php?n=59

21

ODE Data Publication Pyramid:

Pubs

Supps

Data Archives

Data on Disks

and in Drawers

(1) Top of the pyramid is stable

but small (2) Risk that

supplements to articles turn into

Data Dumping places

(3) Too many disciplines lack a

community endorsed data

archive

(4) Estimates are that at least 75 % of research data is never made

openly avaiable

21

From Mayernik et al (in prep) – PREPARDE project - Most cited Bulletin of the American Meteorological Society (BAMS) articles. Data from Web of Science, gathered on June 11, 2013

Article Data paper? Citations Article details

1 Yes 10,113 Kalnay, E; et al. The NCEP/NCAR 40-year reanalysis project,

1996.

2 No 3,201 Torrence, C; Compo, GP. A practical guide to wavelet

analysis, 1998.

3 No 2,367 Mantua, NJ; et al. A Pacific interdecadal climate oscillation

with impacts on salmon production, 1997.

4 Yes 1,987 Kistler, R; et al. The NCEP-NCAR 50-year reanalysis:

Monthly means CD-ROM and documentation, 2001.

5 Yes 1,791 Xie, PP; Arkin, PA. Global precipitation: A 17-year monthly

analysis based on gauge observations, satellite estimates, and

numerical model outputs, 1997.

6 Yes 1,448 Kanamitsu, M; et al. NCEP-DOE AMIP-II reanalysis (R-2),

2002.

7 No 1,014 Baldocchi, D; et al. FLUXNET: A new tool to study the

temporal and spatial variability of ecosystem-scale carbon

dioxide, water vapor, and energy flux densities, 2001.

8 Yes 902 Rossow, WB; Schiffer, RA. Advances in understanding clouds

from ISCCP, 1999.

9 Yes 900 Rossow, WB; Schiffer, RA. ISCCP cloud data products, 1991.

10 No 877 Hess, M; Koepke, P; Schult, I. Optical properties of aerosols

and clouds: The software package OPAC, 1998.

11 No 815 Willmott, CJ. Some comments on the evaluation of model

performance, 1982.

12 No 815 Trenberth, KE. The definition of El Nino, 1997.

13 Yes 785 Woodruff, SD; Slutz, RJ; et al. A comprehensive ocean-

atmosphere data set, 1987.

14 Yes 776 Meehl, G.A.; et al. The WCRP CMIP3 multimodel dataset - A

new era in climate change research, 2007.

15 Yes 742 Liebmann, B; Smith, CA. Description of a complete

(interpolated) outgoing longwave radiation dataset, 1996.

16 Yes 734 Huffman, GJ; et al. The Global Precipitation Climatology

Project (GPCP) Combined Precipitation Dataset, 1997.

17 No 697 Trenberth, KE. Recent observed interdecadal climate changes

in the northern-hemisphere, 1990.

18 No 672 Gates, WL. AMIP - THE Atmospheric Model Intercomparison

Project, 1992.

19 No 656 Stephens, GL; et al. The Cloudsat mission and the A-Train - A

new dimension of space-based observations of clouds and

precipitation, 2002.

20 Yes 647 Mesinger, F; et al. North American regional reanalysis, 2006.

It’s a long road….

What do researchers need to make this all possible?

– Incentives - citations, promotion, support • long way to go

– Institutional and funder policy framework • mostly there now?

– Appropriate discipline specific community centres of expertise • rare, mostly limited to big science niches or very broad but may be

poorly sustained

– Institutional support services for the basics • pilots to date

– Software tools that are open and can be adapted • on the way

PREPARDE: Peer REview for Publication & Accreditation of Research

Data in the Earth sciences Jonathan Tedds (Leicester), Sarah Callaghan (BADC), Fiona Murphy (Wiley), Rebecca Lawrence (F1000R), Geraldine Stoneham (MRC), Elizabeth Newbold (BL), Rachel Kotarski (BL), Matthew Mayernik

(NCAR), John Kunze, Carly Strasser (CDL), Angus Whyte (DCC), Becca Wilson (Leicester), Simon Hodson (Jisc) and #PREPARDE project team

+ Geraldine Clement Stoneham (MRC), Elizabeth Newbold, Rachel Kotarski (BL) on data peer review




• Partnership formed between Royal

Meteorological Society and academic

publishers Wiley Blackwell to develop a

mechanism for the formal publication of data in

the Open Access Geoscience Data Journal

• GDJ publishes short data articles cross-linked

to, and citing, datasets that have been

deposited in approved data centres and

awarded DOIs (or other permanent identifier).

• A data article describes a dataset, giving details

of its collection, processing, software, file

formats, etc., without the requirement of novel

analyses or ground breaking conclusions.

• the when, how and why data was collected

and what the data-product is.

http://www.geosciencedata.com/

PREPARDE key use case: Geoscience Data Journal, Wiley-Blackwell and the Royal Meteorological Society

• capture the processes and procedures required to publish a scientific dataset

– ingestion into a data repository

– formal publication in a data journal

• address key issues in data publication

– how to peer-review a dataset?

– what criteria are needed for a repository to be considered objectively trustworthy?

– how can datasets and journal publications be effectively cross-linked for the benefit of the wider research community?

• PREPARDE team includes key expertise in

– Research

– academic publishing

– data management

• Earth Sciences focus but produce general guidelines applicable to a wide range of scientific disciplines and data publication types incl life sciences (F1000R)

PREPARDE: Peer REview for Publication & Accreditation of Research Data in the Earth sciences http://www.le.ac.uk/projects/preparde

BADC

Data Data BODC

Data Data

A Journal (Any online

journal system)

PDF PDF PDF PDF PDF Word processing software

with journal template

Data Journal (Geoscience Data Journal)

html html html html

1) Author prepares the

paper using word

processing software.

3) Reviewer reviews the

PDF file against the

journal’s acceptance

criteria.

2) Author submits

the paper as a

PDF/Word file.

Word processing software with journal template

1) Author prepares the

data paper using word

processing software and

the dataset using

appropriate tools.

2a) Author submits

the data paper to

the journal. 3) Reviewer reviews

the data paper and

the dataset it points

to against the

journals acceptance

criteria.

The traditional online journal model

Overlay journal model for publishing data

2b) Author submits

the dataset to a

repository.

Data

How: to publish data in GDJ

Live Data paper! Dataset citation is first thing in the paper and is also included in reference list (to take advantage of citation count systems) DOI: 10.1002/gdj3.2

Data Centre

Trust: Repository accreditation • Link between data paper and dataset is crucial!

• How do data journal editors know a repository is trustworthy?

• How can repositories prove they’re trustworthy?

• What makes a repository trustworthy? • Many things: mission, processes, expertise,

workflows, history, systems, documentation, … • Assessing trustworthiness requires assessing

the entire repository workflow

• PREPARDE / IDCC13 Workshop – report out soon! • Peer review of data is implicitly peer review of

repository

And what does “trustworthy” mean, when you get right

down to it?

DataCite Repository List

• working document

• initated via a collaboration

between the British Library,

BioMed Central and the Digital

Curation Centre

• aims to capture growing number

of repositories for research data

• provided for information purposes

only:

• DataCite provides no

endorsements of quality or

suitability of the repositories

• encourage community

participation in developing

this resource

http://www.datacite.org/repolist/

http://www.datacite.org/repolist/

Dryad Data Repository

JDAP: Joint Data Archiving Policy

Joint Data Archiving Policy: http://datadryad.org/jdap

Joint declarations, Feb 2010, in American Naturalist, Evolution, the Journal of Evolutionary Biology, Molecular Ecology, Heredity, and other key journals in evolution and ecology: http://www.journals.uchicago.edu/doi/full/10.1086/650340

This journal requires, as a condition for publication, that data supporting the results in the paper should be archived in an appropriate public archive, such as GenBank, TreeBASE, Dryad, or the Knowledge Network for Biocomplexity.

Allows embargos of up to one year; allows exceptions for, e.g., sensitive information such as human subject data or the location of endangered species.

‘Data that have an established standard repository, such as DNA sequences, should continue to be archived in the appropriate repository, such as GenBank. For more idiosyncratic data, the data can be placed in a more flexible digital data library such as the National Science Foundation-sponsored Dryad archive at http://datadryad.org.'

Slide: Simon Hodson (Dryad / Jisc / CODATA)

http://datadryad.org/jdap

http://www.journals.uchicago.edu/doi/full/10.1086/650340

http://datadryad.org

PREPARDE and Bi-Directional Data Linking • already have a link from the GDJ data

article to the data repository via DOI

• GDJ can also pull the standard DOI metadata attached to that DOI from the DataCite metadata store

• need to figure out a way so GDJ can inform the repository that their dataset has been cited/published!

• At this time, we have a manual work-around (i.e. email)

• Workshop on cross-linking between data centres and publishers 30th April 2013 at RAL, UK

• Report out soon!

BADC NCAR

GDJ

Standardised metadata

DataCite Metadata Store

Standardised metadata

Peer review of data: the Perfect Disaster?

• Support for the peer review process

– scholars contributing peer reviews with little formal reward

– opportunity to polish and refine understanding of the cutting edge of research

• But peer review system under stress

– exploding number of journals, conferences, and grant applications

– self-publication tools - blogs and wikis - allow scholars to disseminate their research results and products

• Faster and more directly

• Now adding research data into the publication and peer review queues …

Peer-review of data

• Technical

– author guidelines for GDJ

– Funder Data Value Checklist

– implicit peer review of repository?

• Scientific

– pre-publication?

– post-publication? E.g. F1000R

– guidelines on uncertainty e.g. IPCC

– discipline specific?

– EU Inspire spatial formatting

• Societal

– contribution to human knowledge

– reliability http://libguides.luc.edu/content.php?pid=5464&sid=164619

http://libguides.luc.edu/content.php?pid=5464&sid=164619

Open Peer Review of Data ESSD peer review ensures that the datasets are:

Plausible with no immediately detectable problems;

Sufficient high quality and their limitations clearly stated;

Well annotated by standard metadata and available from a certified data center/repository;

Customary with regard to their format(s) and/or access protocol, and expected to be useable for the foreseeable future;

Openly accessible (toll free)

Earth System Science Data journal: http://www.earth-system-science-data.net/

Rebecca Lawrence, Data Publishing: peer review, shared standards and collaboration, http://www.dcc.ac.uk/events/research-data-management-forum-rdmf/rdmf8-engaging-publishers

Faculty 1000 Open Peer Review

Sanity check:

Format and suitable basic structure adherence

A standard basic protocol structure is adhered to

Data stored in the most appropriate and stable

location

Open Peer Review:

Is the method used appropriate for the scientific

question being asked?

Has enough information been provided to be able to

replicate the experiment?

Have appropriate controls been conducted, and the

data presented?

Is the data in a useable format/structure?

Are stated data limitations and possible sources of

error appropriately described

Does the data ‘look’ ok (optional; e.g. Microarray data)





















Draft Recommendations on Peer-review of data

• Summary Recommendations from Workshop at British Library, 11 March 2013

• Workshop attendees included funders, publishers, repository managers, researchers ….

• Draft recommendations put up for discussion and feedback captured

• Feedback from the community still welcome

• 2nd workshop 24 June: put recommendations to peer reviewers!


Document at: http://bit.ly/DataPRforComment Feedback to: https://www.jiscmail.ac.uk/DATA-PUBLICATION


http://bit.ly/DataPRforComment

http://bit.ly/DataPRforComment

https://www.jiscmail.ac.uk/DATA-PUBLICATION




Draft Recommendations on data peer review Summary Recommendations from Workshop at the British Library, 11 March 2013

• Connecting data review with data management planning

• Connecting scientific, technical review and curation

• Connecting data review with article review

• 4-5 draft recommendations in each of above

• Assist Researchers, Publishers, Journal Editors, Reviewers,

Data Centres, Institutional Repositories to map requirements

for data peer review

• Matrix of stakeholders vs processes

– Assist in assigning responsibilities for given context

– New for most disciplines

– Learn from disciplines where this already happens

Connecting data review with data management planning

1. All research funders should at least require a “data sharing plan” as part of

all funding proposals, and if a submitted data sharing plan is inadequate,

appropriate amendments should be proposed.

2. Research organisations should manage research data according to

recognised standards, providing relevant assurance to funders so that

additional technical requirements do not need to be assessed as part of the

funding application peer review. (Additional note: Research organisations

need to provide adequate technical capacity to support the management of

the data that the researchers generate.)

3. Research organisations and funders should ensure that adequate funding is

available within an award to encourage good data management practice.

4. Data sharing plans should indicate how the data can and will be shared and

publishers should refuse to publish papers which do not clearly indicate how

underlying data can be accessed, where appropriate.

Connecting scientific, technical review and curation

1. Articles and their underlying data or metadata (by the same or other

authors) should be multi-directionally linked, with appropriate

management for data versioning.

2. Journal editors should check data repository ingest policies to avoid

duplication of effort , but provide further technical review of important

aspects of the data where needed. (Additional note: A map of

ingest/curation policies of the different repositories should be generated.)

3. If there is a practical/technical issue with data access (e.g. files don’t open

or exist), then the journal should inform the repository of the issue. If

there is a scientific issue with the data, then the journal should inform the

author in the first instance; if the author does not respond adequately to

serious issues, then the journal should inform the institution who should

take the appropriate action. Repositories should have a clear policy in

place to deal with any feedback.

Connecting data review with article review

1. For all articles where the underlying data is being submitted, authors need to

provide adequate methods and software/infrastructure information as part of

their article. Publishers of these articles should have a clear data peer review

process for authors and referees.

2. Publishers should provide simple and, where appropriate, discipline-specific data

review (technical and scientific) checklists as basic guidance for reviewers.

3. Authors should clearly state the location of the underlying data. Publishers should

provide a list of known trusted repositories or, if necessary, provide advice to

authors and reviewers of alternative suitable repositories for the storage of their

data.

4. For data peer review, the authors (and journal) should ensure that the data

underpinning the publication, and any tools required to view it, should be fully

accessible to the referee. The referees and the journal need to then ensure

appropriate access is in place following publication.

5. Repositories need to provide clear terms and conditions for access, and ensure

that datasets have permanent and unique identifiers.

Publishing research data

• Research is heavily context specific => So keep it context specific? • Publishers and Professional & Learned Societies can help

galvanise agreement of researchers • To define how they want their data represented and preserved for reuse &

citation => your researchers need you • Along with funder appointed peer review committees often stronger

connection than to institution • Institutional managers/services cannot cover wide range of discipline specific

expertise

• Like publishers don’t cover all fields

• Relatively constant as career progresses • Change host institution

• Change field(s)

• Change technique

• In UK RCUK funding for APCs strictly for articles only…. – How to fund APCs for depositing data in repositories? – Will publishers charge APCs to publish data papers?

2012-02-07

DCC roadshow East Midlands - CC-BY-SA 42

PDB

GenBank

UniProt

Pfam

Spreadsheets, Notebooks

Local, Lost

High throughput experimental methods

Industrial scale

Commons based production

Public data sets

Cherry picked results

Preserved

CATH, SCOP

(Protein Structure

Classification)

ChemSpider

Research and the long tail

Slide: Carole Goble

Enabling Open Data Publishing

• Active Data Management Planning

– built in at proposal stage

• Local institutional tweaks of funder and local templates

– Implemented and evolved in project

• Data Management Plan as a live, evolving object

• Annotate data on the fly – lab notebook approach

– Curated & preserved using permanent identifiers

• Appropriate repository and data collection descriptors

Active Data Storage: Identifying the Holy Grail ?

• “what is needed is a tool to transparently sync local and network storage” (Marieke Guy, JISCMRD2 Nottingham)

• CKAN (Orbital, Lincoln)? • Research Data Toolkit (Herts) – hybrid solutions • UC3 suite at CDL…

• DropBox-like functionality a must • Usability • Technical interoperability

• aim to help create databases for research data so • facilitates collaboration and data sharing • enables the subsequent publication of datasets

• challenge is to ensure data are • documented • Preserved • service is sustainable

http://halogen.le.ac.uk

Portable Antiquities

Scheme (British Museum)

Place-names

(Nottingham)

Surnames

Genetics

IT hosting and GIS

Best practice:

#JISCMRD, UKRDS,

DCC, international

Halogen as template for research data management #jiscmrd

Requirements Analysis – must be iterative!

Data Management Plan – use DMPonline (UK Digital Curation Centre)

Scalable research data management infrastructure

pilot phase to nationally available resource

LAMP stack IT infrastructure: host research database – work with JISC/DCC

A model for the long term delivery of a data management service within the institution including

support, maintenance, governance & charging policies

Include researchers, IT services, research support office, library services etc.

BENEFITS

New research opportunities

Cross database work – seed new research samples

Scholarly communication/access to national resources

Key to English Place Names (Nottingham)

Portable Antiquities Scheme (British Museum)

Verification, re-purposing, re-use of data

Cleaning & enhancing private research datasets for reuse & correlation

No re-creation of data

Increased transparency

excellent training for best practice in research data management

Increasing research productivity

Build in cleaning, annotation, enhancement into normal research workflows

research datasets may immediately be reusable and interoperable

Impact & Knowledge Transfer

Reuse IT infrastructure

Increasing skills base of researchers/students/staff

Reward = Leverhulme Trust funding £1.3m!

CHALLENGES

interdisciplinary research database

ingest each input dataset in form such that sufficient information is carried forward to enable interoperation

Cultural differences

versioning & provenance for input datasets

which software tools, infrastructure , Query interface?

suitable for multi disciplinary researchers

Requirements upon the institution for sustaining the research assets & skills

Requirements upon the researchers

Annotating

Refreshing

Maintainence of datasets

No Response

63%

Response

Received

37%

Researcher Responses to Contacts Made

Suggested timeline for implementing institutional research data management

From Whyte & Tedds (2011), DCC Briefing http://www.dcc.ac.uk/resources/briefing-papers/making-case-rdm

Challenge for institutions – Rise to scientific and research challenge

• Not just a management challenge • Responsibility for the knowledge they create

– Library • “Doing the wrong things through the wrong people”? • Challenge for library to enable:

• curation of data and publications • active support from data scientists • from centralised to dispersed support

• Expert centres such as D-Lab essential!

– IT Service • Provide research data platforms for researchers:

– Active storage – Enable collaboration – Connect to preservation services through Library

But that’s not all…

What about the software underpinning data driven research?

If we’re going to publish as open data:

How do we help researchers to store, annotate and discover the datasets they create?

How do you sustain and reuse that?

Biomedical Research Infrastructure Software Service Kit

A vision for cloud-based open source research applications

#BRISSKit

http://www.brisskit.le.ac.uk

BRISSKit context: The I4Health goal of applying knowledge engineering to close the

‘ICT gap’ between research and healthcare (Beck, T. et al 2012)

www.brisskit.le.ac.uk Email: [email protected]

http://www.brisskit.le.ac.uk

The semantic bridge

?

OBiBa Onyx

Records participant consent, questionnaire data and primary specimen IDs

i2b2

Cohort selection and data querying

Bio-ontology!

Research Software Sustainability

• OS community engagement • standards compliance • consortium approach • work with grain of researchers • discipline specific forks? • Github versioning an example for research data?

• OS Community Engagement Charter

• defining engagement with existing & new OS communities

• including adoption & code commitments

See Rob Baxter blog: “The research software engineer”

• http://dirkgorissen.com/2012/09/13/the-research-software-engineer/

Lessons for institutions? Can’t do it all in house!

But many disciplines don’t have data centres

Build coalition of institutional actors

Essential to have high level support

Take and shape Identify what you do have in-house

Access external tools, standards where possible

Active storage, collaboration, eprints…

Propose best of breed for (inter)national reuse

Share benefits (and costs) over acacdemic networks

Sustainability the key challenge

As much cultural as technical – needs networks…

But institutions alone aren’t enough – we need an alliance!

Accepted Research Data Alliance Interest Group Publishing Data

• http://rd-alliance.org/

• Close coordination with ICSU-WDS working group, CODATA and other ongoing initiatives in data publication

– WDS under International Council of Science, RDA wider

– Avoid duplication within related RDA and WDS WGs – join up

– For WDS partnerships between publishers and data centres key

• scope the territory – gap analysis

• Use RDA Forum and new http://jiscmail.ac.uk/data-publication 350+ list

• Take findings from RDA / WDS group(s) and trial in other communities / disciplines / institutional repositories

http://rd-alliance.org/



http://jiscmail.ac.uk/data-publication



15-9-2013 Launch meeting discussion 67

“Keep reaching for the stars”

• increase the trustworthiness and value of individual data sets

• strengthen the findings based on cited data sets

• increase the transparency and traceability of data and publications

• enable reuse and repurposing

i.e. Problems but extraordinary opportunities – all hands on deck!

Thank you for listening and thanks to CDL, D-Lab and the project partners

Dr Jonathan Tedds [email protected] @jtedds Senior Research Fellow,

D2K Data to Knowledge Research Group

(University of Leicester)

#PREPARDE http://www.le.ac.uk/projects/preparde

Mailing list: http://jiscmail.ac.uk/DATA-PUBLICATION

mailto:[email protected]




jonathan tedds distinguished lecture at dlab, uc berkeley, 12 sep 2013: "the open research...

Education