david shotton - research integrity: integrity of the published record

25
Image BioInformatics Research Group Department of Zoology University of Oxford, UK http:/ibrg.zoo.ox.ac.uk JISC Research Integrity Conference The Importance of Good Data Management Wellcome Collection Conference Centre 13 September 2011 Managing data, publishing data, describing data and citing data © David Shotton, 2011 Published under the Creative Commons Attribution-Noncommercial-Share Alike 3.0 Licence e-mail: [email protected] David Shotton

Upload: jisc

Post on 27-Jan-2015

127 views

Category:

Education


1 download

DESCRIPTION

David Shotton, Reader in Image Bioinformatics, University of Oxford

TRANSCRIPT

Page 1: David Shotton - Research Integrity: Integrity of the published record

Image BioInformatics Research Group

Department of Zoology

University of Oxford, UK

http:/ibrg.zoo.ox.ac.uk

JISC Research Integrity ConferenceThe Importance of Good Data Management

Wellcome Collection Conference Centre13 September 2011

Managing data, publishing data, describing data and citing data

© David Shotton, 2011 Published under the Creative Commons Attribution-Noncommercial-Share Alike 3.0 Licence

e-mail: [email protected]

David Shotton

Page 2: David Shotton - Research Integrity: Integrity of the published record

Why don’t researchers publish data?

Three pressures presently prevent researchers from publishing their data

Information overload and pressure of work With twenty new papers each week, a researcher can never catch up –

there is just too much new scientific information being produced now Have to run to stand still - no time for ‘fringe’ activities like data curation

Departmental pressure for financial viability, determined by the REF pressure to win grants and to publish in high impact journals negligible incentives and academic reward in terms of peer esteem,

tenure or promotion for data publication activities

Cognitive overhead and skill barriers to best-practice data management metadata concepts foreign to most biomedical researchers large amount of effort involved in preparing data for publication

[From evidence submitted 5 August 2011 to the Royal Society’s Science as a Public Enterprise policy study]

Page 3: David Shotton - Research Integrity: Integrity of the published record

1 Managing data

In the JISC ADMIRAL Project (A Data Management Infrastructure for Research Across the Life Sciences), we developed a two-tier data management system

Locally, researchers save files to a secure private DataStage file store

This is for their own benefit (file management, regular backup, controlled access, Web interface, etc.)

and does not pose a cognitive overhead – “sheer curation”

We then provide a Web interface that permits researchers to select and package datasets for publication and long-term repository archiving

Easy to do, when the researcher is ready, with minimal metadata

Finally, data can be published to the Oxford DataBank institutional repository

Run by the Bodleian Library, with a track record in preservation

Easy for researcher to update a revised dataset if required

Optional embargo period to permit prior journal article publication

Data packages assigned DOIs, making them citable (for academic credit)

Page 4: David Shotton - Research Integrity: Integrity of the published record

2 Publishing data on the cloud

In the new JISC UMF DataFlow Project, we are now adapting DataStage and DataFlow for third-party research groups and institutions to deploy and use

to run on the Eduserv Academic Cloud (or on another cloud, or locally)

hardened as professional VMWare virtual machine software appliances

installation designed to be easy and customizable (e.g. your name & logo)

enabling institutions to provide their members with zero cost data management solutions (apart from cloud hosting costs)

cloud provision can expand and shrink with requirements

no need to build and staff your own local data centre

We have just set up alpha versions on a local cloud in Oxford, for test use

We welcome interest from potential test users

(University of Leeds has just installed and is testing its own DataBank)

We will have beta versions by Christmas and production versions by Q2 2012

http://www.dataflow.ox.ac.uk/

Page 5: David Shotton - Research Integrity: Integrity of the published record

The JISC DRYAD-UK Project http://datadryad.org/

Dryad was originally developed at the University of North Carolina, with funding from the National Science Foundation

The JISC Dryad-UK Project has been working over the past year to mirror the Dryad data repository at the British Library to add new journals, expanding into other areas of biology and

medicine, particularly infectious disease – 25 added to date, with more coming

to plan and facilitate Dryad’s financial sustainability from journal fees Once an article has been published, Dryad publishes the related datasets,

using metadata provided by the journal publisher Data packages are assigned DOIs and published with Creative Commons

CC-Zero open data licenses, to enable free re-use of the datasets

An alternative to an institutional repository is a subject-specific repository

Dryad is a repository for datasets underlying biomedical scientific research articles

Its initial focus was in evolution and ecology

Page 6: David Shotton - Research Integrity: Integrity of the published record

3 Describing data

At present, DataBank and Dryad imposes minimal metadata requirements

The DataCite mandatory metadata properties required for DOI assignment: Identifier (the DOI) Creator (i.e. authors) Title Publisher (i.e. repository name “Dryad Data Repository”) Publication Year

As part of the JISC Dryad-UK Project, we set out to investigate whether we could enable the creation of richer metadata without too much effort, providing better data descriptions that would assist discovery and reuse

We also wanted to enable the publication of such DataBank and Dryad metadata as Open Linked Data, encoded in RDF, the machine-readable data description language used on the Web

The particular focus for our enhanced metadata is infectious disease data

Page 7: David Shotton - Research Integrity: Integrity of the published record

Enhancing metadata – the Reis et al. (2008) exemplar

http://dx.doi.org/10.1371/journal.pntd.0000228.x001

Page 8: David Shotton - Research Integrity: Integrity of the published record

Factual metadata in the Study Summary

Page 9: David Shotton - Research Integrity: Integrity of the published record

Rhetorical metadata in the Study Summary

The problem with this summary is that

it is hand-crafted by a single individual

it is not backed by any recognised metadata standard

it is only human-readable, lacking an ontology-based machine-readable RDF representation

Page 10: David Shotton - Research Integrity: Integrity of the published record

MIIDI http://www.miidi.org/

MIIDI is a Minimal Information standard for an Infectious Disease Investigation An international MIIDI workshop in September 2009 led to an initial draft

In January 2011, Tanya Gray started work with me to develop MIIDI properly She has now develop MIIDI into a validated XML data model, and has created

MIIDI Forms that permits easy metadata entry conforming to the MIIDI standard

To permit encoding of MIIDI metadata terms in RDF, we have mapped them to appropriate ontologies, including IDO, the Infectious Disease Ontology

The MIIDI standard can be used to create rich metadata both for journal articles and also for data sets, such as those held in DataBank or Dryad repositories

The methodology is generic, and we hope to see it adopted for use in combination with other metadata standards, e.g. those under the umbrella of MIBBI - Minimum Information for Biological and Biomedical Investigations

Page 11: David Shotton - Research Integrity: Integrity of the published record

The MIIDI XML data model

Page 12: David Shotton - Research Integrity: Integrity of the published record

‘Disease’ section of the MIIDI Report for Reis et al. 2008

Page 13: David Shotton - Research Integrity: Integrity of the published record

4 Citing data

At present, published datasets are poorly cited in the scientific literature

A survey of PLoS journal articles related to Dryad datasets showed that

most papers lacked any reference to Dryad, and

the others only have unstructured citations within the body text, e.g.

“A selection of the 30,000 structures is represented in Fig. 1 and a repository, with their all-atom configuration, is available at http://dx.doi.org/10.5061/dryad.1922.”

“Raw microsatellite data generated in this study have been deposited in the Dryad database (http://www.datadryad.org) under accession number 1540.”

“Initiatives such as Dryad (http://datadryad.org/repo) (where the data in this study are published) should mean that literature data become easier to gather and maintain in the future.”

None of the papers had a proper data reference in the reference list

Page 14: David Shotton - Research Integrity: Integrity of the published record

Best practice for the citation of Dryad datasets

I have proposed best practice for citing datasets, available in a discussion paper at http://bit.ly/lt7VsM, recommending:

1. That the citation style for referencing on-line data should be as similar as possible to that used for referencing scholarly articles

Creator (PublicationYear) Title. Publisher. Identifier.

2. That the preferred data identifier to be used is a Digital Object Identifier or, if that is not available, the unique accession number or identifier used by the data repository or database in which the data resides

n That this reference be included in the paper’s reference list

n That this data reference in the reference list should be denoted by an appropriate in-text citation, including an in-text reference pointer

Page 15: David Shotton - Research Integrity: Integrity of the published record

Example of best practice for the citation of a Dryad dataset

Example in-text citation and in-text reference pointer:

"The raw data underpinning this analysis are deposited in the Dryad Data Repository at http://dx.doi.org/10.5061/dryad.8684 (Vijendravarma et al., 2011)."

Example data reference in the article’s reference list:

Vijendravarma RK, Narasimha S, Kawecki TJ (2011). Data from: Plastic and evolutionary responses of cell size and number to larval malnutrition in Drosophila melanogaster.  Dryad Digital Repository. doi:10.5061/dryad.8684."

These recommendations have been adopted in the Data Publishing Policies and Guidelines for Biodiversity Data of the publisher Pensoft, available at

http://www.pensoft.net/J_FILES/Pensoft_Data_Publishing_Policies_and_Guidelines.pdf

Page 16: David Shotton - Research Integrity: Integrity of the published record

The JISC Open Citations Project

- publishing bibliographic and data citations as Linked Open Data

The problem

Citation data are hard to find, locked in the reference lists of copyright articles

Scope, vision and aim of the Open Citation Project

The Open Citations Project is global in scope, designed to change the face of scientific publishing and scholarly communication

Its vision is to publish citation data openly as Linked Open Data

It aims to make citation links as easy to traverse as Web links

Potential benefits of Open Citations Cited works are more easily discovered Citation networks can be explored to study the growth of knowledge The most cited papers – nodes with high degree (Barabási) – clearly exposed Distortions in knowledge caused by mis-citation can be identified

home

Page 17: David Shotton - Research Integrity: Integrity of the published record

The Open Citations Corpus

The reference lists extracted from all 204,637 articles in the Open Access Subset of PMC (as of 24 January 2011), each encoded as a Named Graph

These reference lists contain 6,325,178 individual references, some unique, but many from different citing articles to the same highly cited papers

These refer to 3,373,961 unique papers outside the Open Access Subset ~ 20% of all PubMed Central papers published between 1950 and 2010 includes ALL the highly cited papers in every biomedical field

Data freely available under a CC0 waiver from http://opencitations.net/data/

We would now like to expand the corpus to include data citations, e.g. references to journal articles from Dryad data packages the inferable reciprocal references from these articles to Dryad

Page 18: David Shotton - Research Integrity: Integrity of the published record

Viewing citation networks at http://opencitations.net

Page 19: David Shotton - Research Integrity: Integrity of the published record

Using the citation data - Open Research Reports

Top Papers for Open Research ReportsNumber of

papers cited Pubmed IDs of 20 most highly cited papers (with number of times cited)

Disease name   1 2 3 4

Cholera 1,993 10952301 47 15242645 44 2836362 25 16432199 24

Dengue fever 3,858 17510324 44 9665979 42 1372617 34 15577938 32

HIV/AIDS 54,432 9516219 122 12167863 101 9539414 86 12742798 83

Leprosy 1,147 11234002 70 17604718 18 15894530 13 12901893 12

Leptospirosis 940 11292640 47 14652202 37 12712204 27 15028702 26

Malaria 25,290 12368864 230 12364791 146 781840 134 12893887 101

Measles 1,719 11742391 22 16262740 19 15798843 18 8974392 13

Pneumonia 6,901 8995086 60 15699079 53 11463916 49 10524952 47

Schistosomiasis 3,036 15866310 49 12973350 46 16790382 43 4675644 40

Trypanosomiasis 5,864 16020726 108 16020725 75 10215027 57 43092 35

Tuberculosis 16,091 9634230 117 9157152 83 12742798 83 8381814 80

Amyotrophic lateral sclerosis 2,380 8446170 46 17023659 32 11386269 22 15217349 22

Spinal muscular atrophy 555 7813012 28 10339583 20 11925564 20 9074884 15

Total exluding ALS and SMA 121,271

Total 124,206

Average 9,554

Page 20: David Shotton - Research Integrity: Integrity of the published record

end

. . . with thanks to the JISC for funding over recent years

and acknowledgement of the excellent work of my colleagues who have contributed to the following JISC projects:

ADMIRAL / DataFlowGraham Klyne, Diana Galletly, Bhavana Ananda,

Anusha Ranganathan, Sally Rumsey, Neil Jeffreys (Bodleian Library)

Open CitationsBen O’Steen and Alex Dutton

Dryad-UK Tanya Gray (MIIDI), Silvio Peroni (SPAR ontologies)

Brian Hole (British Library)

e-mail: [email protected]

Page 21: David Shotton - Research Integrity: Integrity of the published record

Why publish research datasets in central repositories?

It is widely recognised that the research results from publicly funded research projects should be made publicly available

Publishing research data should simply be seen as an extension of the publication process for research papers

Centralized subject-specific repositories like Dryad, with streamlined curation processes, are highly cost-effective, in comparison with each journal taking on a massive expansion of its own Supplementary Materials capabilities

In addition, the data will be less fragmented, openly accessible, easily searched and interoperable

Imagine what a mess we would be in now if all our DNA sequence data were scattered among the Supplementary Materials holdings of different journals!

Page 22: David Shotton - Research Integrity: Integrity of the published record

How should such data publishing be funded?

Research funding agencies should pay for startup and R&D costs

The primary beneficiaries (i.e. the scientific community) should sustain the ongoing operating costs of preserving their own research data

using the same economic model (author charges, society funds, subscriptions, etc.) that funds the associated journal articles

Centralized research data repositories benefit from the same economies of scale that publishers enjoy in operating multiple journals

The average total publishing and distribution costs per article amount to about £4,000

(RIN 2008 report: Activities, Costs and Funding Flows in the Scholarly Communications System)

The Dryad business model is for each participating journal to pay a fee of ~$50 per paper, ~1% of the total cost of publishing the article

Page 23: David Shotton - Research Integrity: Integrity of the published record

Conversion of hypothesis to ‘fact’ by citation alone

Citation:

Steven Greenberg (2009).

How citation distortions create unfounded authority: analysis of a citation network.

British Medical Journal 339: b2608.

Page 24: David Shotton - Research Integrity: Integrity of the published record

Clustering of CiTO relationships by similarity

Positive

Agrees withConfirms

Credits Supports

Neutral

CitesCites as related

DiscussesReviews

Extends

Negative

CorrectsQualifies

Disagrees withDisputesRefutes

CritiquesParodiesRidicules

Cites as authorityCites as evidence

Obtains background fromObtains support from

Contains assertion from

Uses data fromUses method in

Cites as data sourceCites for information

DocumentsUpdates

Includes excerpt fromIncludes quotation from

Plagiarizes

Cites as metadata documentCites as source document

Shares authors with

Rhetorical

Factual

Page 25: David Shotton - Research Integrity: Integrity of the published record

http://purl.org/spar/

SPAR – Semantic Publishing and Referencing Ontologies