jonathan tedds distinguished lecture at dlab, uc berkeley, 12 sep 2013: "the open research...
DESCRIPTION
http://dlab.berkeley.edu/event/open-research-challenge-peer-review-and-publication-research-data A talk by Dr. Jonathan Tedds, Senior Research Fellow, D2K Data to Knowledge, Dept of Health Sciences, University of Leicester. PI: #BRISSKit www.brisskit.le.ac.uk PI: #PREPARDE www.le.ac.uk/projects/preparde The Peer REview for Publication & Accreditation of Research data in the Earth sciences (PREPARDE) project seeks to capture the processes and procedures required to publish a scientific dataset, ranging from ingestion into a data repository, through to formal publication in a data journal. It will also address key issues arising in the data publication paradigm, namely, how does one peer-review a dataset, what criteria are needed for a repository to be considered objectively trustworthy, and how can datasets and journal publications be effectively cross-linked for the benefit of the wider research community. I will discuss this and alternative approaches to research data management and publishing through examples in astronomy, biomedical and interdisciplinary research including the arts and humanities. Who can help in the long tail of research if lacking established data centers, archives or adequate institutional support? How much can we transfer from the so called “big data” sciences to other settings and where does the institution fit in with all this? What about software? Publishing research data brings a wide and differing range of challenges for all involved, whatever the discipline. In PREPARDE we also considered the pre and post publication peer review paradigm, as implemented in the F1000 Research Publishing Model for the life sciences. Finally, in an era of truly international research how might we coordinate the many institutional, regional, national and international initiatives – has the time come for an international Research Data Alliance?TRANSCRIPT
THE OPEN RESEARCH CHALLENGE: PEER REVIEW AND PUBLICATION OF RESEARCH DATA
Dr Jonathan Tedds [email protected] @jtedds
Senior Research Fellow,
D2K Data to Knowledge Research Group
(University of Leicester)
PI #PREPARDE http://www.le.ac.uk/projects/preparde
http://www.astrogrid.org (April 2008 1st public release)
Science as an Open Enterprise Report Why open? • As a first step towards this intelligent
openness, data that underpin a journal article should be made concurrently available in an accessible database
• We are now on the brink of an achievable aim: for all science literature to be online, for all of the data to be online and for the two to be interoperable. [p.7]
• Royal Society June 2012, Science as an Open Enterprise, http://royalsociety.org/policy/projects/science-public-enterprise/report/
• Issues linking data to the scientific record: – Data persistence – Data and metadata quality – Attribution and credit for data producers
• Geoffrey Boulton (Edinburgh), Lead author: – “Science has been sleepwalking into crisis of
replicability...and of the credibility of science” – “Publishing articles without making the data
available is scientific malpractice”
Data Reuse: asking new questions
Hubble Space Telescope • Papers based upon reuse of archived observations now exceed those based on the
use described in the original proposal. – http://archive.stsci.edu/hst/bibliography/pubstat.html
• See also work by Piwowar & Vision re life sciences: “Data reuse and the open data citation advantage”
– http://peerj.com/preprints/1/
Oh, and…. says so :P
We are committed to openness in scientific research data to speed up the progress of scientific discovery, create innovation, ensure that the results of scientific research are as widely available as practical, enable transparency in science and engage the public in the scientific process. • To the greatest extent and with the fewest constraints possible publicly funded
scientific research data should be open, while at the same time respecting concerns in relation to privacy, safety, security and commercial interests, whilst acknowledging the legitimate concerns of private partners.
• Open scientific research data should be easily discoverable, accessible, assessable, intelligible, useable, and wherever possible interoperable to specific quality standards.
• To ensure successful adoption by scientific communities, open scientific research data principles will need to be underpinned by an appropriate policy environment, including recognition of researchers fulfilling these principles, and appropriate digital infrastructure.
Scale of the problem: who, what, when
where….?
http://blogs.scientificamerican.com/absolutely-maybe/2013/09/10/opening-a-can-of-data-sharing-worms/ • Timothy Vines and colleagues studied reproducibility of data sets in zoology and changes
through time – gathered 516 papers published between 1991 and 2011 – then they tried to track the data down…
• Even tracking down the authors was a challenge
– Over time a dwindling minority of papers were accompanied by author email addresses that still functioned
• only 37% of the data - even from papers in 2011 - were still findable and retrievable
– proportion dropped each earlier year
• For papers published in 1991 – only 7% of the data could be determined to truly still be in existence and retrievable – few authors could be found, and most of them were reporting that their data were lost or
inaccessible
This isn’t new…
Henry Oldenburg
– inveterate correspondent – now think of as scientist
• Had idea to publish Philosophical
Transactions (1665):
o Should be written in vernacular not Latin o Underlying evidence must be concurrently
published o Helped propel Europe at the time o Concept of scientific self correction
• able to write it's errors
• Wrote: “thought fit to employ the [printing] press……..Universal Good of Mankind“
o How do we achieve these ends in the post-
Gutenburg era?
Science ecosystems (Peter Fox, Rensselaer)
• These elements are what enable scientists to
explore/ confirm/ deny their research ideas
and collaborate!
• Abduction as well as induction and deduction
Accountability
Proof Explanation Justification Verifiability
‘Transparency’ -> Translucency
Trust
Identity Citability Integrateability
Data as a “public good” (2011)
• Public good
• Preservation
• Discovery
• Confidentiality
• First use
• Recognition
• Public funding
http://osc.universityofcalifornia.edu/openaccesspolicy/
So what do we mean by publishing data?
• The familiar: – Supplementary tables via journal or – Archived raw or calibrated facility data – Discipline specific and institutional / national archives
• Data under the graph?
– In order to reproduce and adapt article analysis
• “Research ready” open data
– In order to reuse and repurpose – for interdisciplinary researchers, community, business – Ideally peer reviewed?
Research data example - level 1:
• A typical example from physical sciences (astronomy) distinguishes between broad categories within the research data spectrum:
• raw/initially auto-processed data produced at a research facility such as an observatory
• typically made publically available in this format after an embargo period of e.g. 1 year
• in some cases available immediately - e.g. Swift Gamma Ray Burst satellite
"research ready" processed data which has been fully calibrated, combined and cleaned/annotated
• often produced by individuals or collaborations
• rarely available to anyone outside the collaboration except upon request/collaboration
• needed for re-analysis or reuse for science unless you have detailed sub domain specific knowledge and detailed contextual information to reproduce from raw
• considered to enable a competitive advantage for producers
• may be produced by dedicated data scientists on behalf of the community for major survey/missions e.g. ESA XMM-Survey Science Centre (Leicester), NASA, NCAR…
Research data example – level 2
output dataset – following detailed analysis of research ready datasets
• forms the data under the graph in a journal publication following analysis of research ready datasets
• Might be available as a Table via journal, CDS etc
• May not be available outside the collaboration except upon request/collaboration
• may well generate future additional samples and papers for the owning collaboration on top of the original
• other researchers may request the data for their own research but may not get it!
Research data example – level 3
….and STOP!
• Next project – Proposal long since written
– Probably already underway…
• Feel free to email ME if you would like to work on an idea using this dataset or code – As long as I’m a co-author on the paper!
– You have to go through me to find out what you really need to know to reuse the data/code
• published catalogue type representation of published output dataset
• NOT a “data paper”….but could be
• optional in many cases, mandatory for most major surveys
• usually made available via project specific online resource
• may be provided as table of parameters based on research ready dataset, usually linked from and associated with a journal
• specifically produced in order for the wider community to reuse (cite!) and repurpose if wanted
• The well-known Sloan Digital Sky Survey is a classic example or more recently the 2XMMi X-ray catalogue I have a close involvement with (largest X-ray survey of the sky).
Research data example – level 4a
http://adsabs.harvard.edu/abs/2013arXiv1302.5329E
e.g.
data paper describing and linking to output dataset(s)
Research data example – level 4b
Live Data paper! Dataset citation is first thing in the paper and is also included in reference list (to take advantage of citation count systems) DOI: 10.1002/gdj3.2
Data Publications
• Initiatives for open publication of data, datasets, data papers and open peer review
• RDMF8 ‘Engaging with the publishers’: http://www.dcc.ac.uk/events/research-data-management-forum-rdmf/rdmf8-engaging-publishers
• Particularly Rebecca Lawrence on peer review policies
• Earth System Science Data: http://www.earth-system-science-data.net/
• Pensoft Data Publishing Policies and Guidelines for Biodiversity Data http://www.pensoft.net/news.php?n=59
Slide: Simon Hodson (Jisc / CODATA)
21
ODE Data Publication Pyramid:
Pubs
Supps
Data Archives
Data on Disks
and in Drawers
(1) Top of the pyramid is stable
but small (2) Risk that
supplements to articles turn into
Data Dumping places
(3) Too many disciplines lack a
community endorsed data
archive
(4) Estimates are that at least 75 % of research data is never made
openly avaiable
21
From Mayernik et al (in prep) – PREPARDE project - Most cited Bulletin of the American Meteorological Society (BAMS) articles. Data from Web of Science, gathered on June 11, 2013
Article Data paper? Citations Article details
1 Yes 10,113 Kalnay, E; et al. The NCEP/NCAR 40-year reanalysis project,
1996.
2 No 3,201 Torrence, C; Compo, GP. A practical guide to wavelet
analysis, 1998.
3 No 2,367 Mantua, NJ; et al. A Pacific interdecadal climate oscillation
with impacts on salmon production, 1997.
4 Yes 1,987 Kistler, R; et al. The NCEP-NCAR 50-year reanalysis:
Monthly means CD-ROM and documentation, 2001.
5 Yes 1,791 Xie, PP; Arkin, PA. Global precipitation: A 17-year monthly
analysis based on gauge observations, satellite estimates, and
numerical model outputs, 1997.
6 Yes 1,448 Kanamitsu, M; et al. NCEP-DOE AMIP-II reanalysis (R-2),
2002.
7 No 1,014 Baldocchi, D; et al. FLUXNET: A new tool to study the
temporal and spatial variability of ecosystem-scale carbon
dioxide, water vapor, and energy flux densities, 2001.
8 Yes 902 Rossow, WB; Schiffer, RA. Advances in understanding clouds
from ISCCP, 1999.
9 Yes 900 Rossow, WB; Schiffer, RA. ISCCP cloud data products, 1991.
10 No 877 Hess, M; Koepke, P; Schult, I. Optical properties of aerosols
and clouds: The software package OPAC, 1998.
11 No 815 Willmott, CJ. Some comments on the evaluation of model
performance, 1982.
12 No 815 Trenberth, KE. The definition of El Nino, 1997.
13 Yes 785 Woodruff, SD; Slutz, RJ; et al. A comprehensive ocean-
atmosphere data set, 1987.
14 Yes 776 Meehl, G.A.; et al. The WCRP CMIP3 multimodel dataset - A
new era in climate change research, 2007.
15 Yes 742 Liebmann, B; Smith, CA. Description of a complete
(interpolated) outgoing longwave radiation dataset, 1996.
16 Yes 734 Huffman, GJ; et al. The Global Precipitation Climatology
Project (GPCP) Combined Precipitation Dataset, 1997.
17 No 697 Trenberth, KE. Recent observed interdecadal climate changes
in the northern-hemisphere, 1990.
18 No 672 Gates, WL. AMIP - THE Atmospheric Model Intercomparison
Project, 1992.
19 No 656 Stephens, GL; et al. The Cloudsat mission and the A-Train - A
new dimension of space-based observations of clouds and
precipitation, 2002.
20 Yes 647 Mesinger, F; et al. North American regional reanalysis, 2006.
It’s a long road….
What do researchers need to make this all possible?
– Incentives - citations, promotion, support • long way to go
– Institutional and funder policy framework • mostly there now?
– Appropriate discipline specific community centres of expertise • rare, mostly limited to big science niches or very broad but may be
poorly sustained
– Institutional support services for the basics • pilots to date
– Software tools that are open and can be adapted • on the way
PREPARDE: Peer REview for Publication & Accreditation of Research
Data in the Earth sciences Jonathan Tedds (Leicester), Sarah Callaghan (BADC), Fiona Murphy (Wiley), Rebecca Lawrence (F1000R), Geraldine Stoneham (MRC), Elizabeth Newbold (BL), Rachel Kotarski (BL), Matthew Mayernik
(NCAR), John Kunze, Carly Strasser (CDL), Angus Whyte (DCC), Becca Wilson (Leicester), Simon Hodson (Jisc) and #PREPARDE project team
+ Geraldine Clement Stoneham (MRC), Elizabeth Newbold, Rachel Kotarski (BL) on data peer review
http://www.le.ac.uk/projects/preparde
• Partnership formed between Royal
Meteorological Society and academic
publishers Wiley Blackwell to develop a
mechanism for the formal publication of data in
the Open Access Geoscience Data Journal
• GDJ publishes short data articles cross-linked
to, and citing, datasets that have been
deposited in approved data centres and
awarded DOIs (or other permanent identifier).
• A data article describes a dataset, giving details
of its collection, processing, software, file
formats, etc., without the requirement of novel
analyses or ground breaking conclusions.
• the when, how and why data was collected
and what the data-product is.
http://www.geosciencedata.com/
PREPARDE key use case: Geoscience Data Journal, Wiley-Blackwell and the Royal Meteorological Society
• capture the processes and procedures required to publish a scientific dataset
– ingestion into a data repository
– formal publication in a data journal
• address key issues in data publication
– how to peer-review a dataset?
– what criteria are needed for a repository to be considered objectively trustworthy?
– how can datasets and journal publications be effectively cross-linked for the benefit of the wider research community?
• PREPARDE team includes key expertise in
– Research
– academic publishing
– data management
• Earth Sciences focus but produce general guidelines applicable to a wide range of scientific disciplines and data publication types incl life sciences (F1000R)
PREPARDE: Peer REview for Publication & Accreditation of Research Data in the Earth sciences http://www.le.ac.uk/projects/preparde
BADC
Data Data BODC
Data Data
A Journal (Any online
journal system)
PDF PDF PDF PDF PDF Word processing software
with journal template
Data Journal (Geoscience Data Journal)
html html html html
1) Author prepares the
paper using word
processing software.
3) Reviewer reviews the
PDF file against the
journal’s acceptance
criteria.
2) Author submits
the paper as a
PDF/Word file.
Word processing software with journal template
1) Author prepares the
data paper using word
processing software and
the dataset using
appropriate tools.
2a) Author submits
the data paper to
the journal. 3) Reviewer reviews
the data paper and
the dataset it points
to against the
journals acceptance
criteria.
The traditional online journal model
Overlay journal model for publishing data
2b) Author submits
the dataset to a
repository.
Data
How: to publish data in GDJ
Live Data paper! Dataset citation is first thing in the paper and is also included in reference list (to take advantage of citation count systems) DOI: 10.1002/gdj3.2
Data Centre
Trust: Repository accreditation • Link between data paper and dataset is crucial!
• How do data journal editors know a repository is trustworthy?
• How can repositories prove they’re trustworthy?
• What makes a repository trustworthy? • Many things: mission, processes, expertise,
workflows, history, systems, documentation, … • Assessing trustworthiness requires assessing
the entire repository workflow
• PREPARDE / IDCC13 Workshop – report out soon! • Peer review of data is implicitly peer review of
repository
And what does “trustworthy” mean, when you get right
down to it?
DataCite Repository List
• working document
• initated via a collaboration
between the British Library,
BioMed Central and the Digital
Curation Centre
• aims to capture growing number
of repositories for research data
• provided for information purposes
only:
• DataCite provides no
endorsements of quality or
suitability of the repositories
• encourage community
participation in developing
this resource
http://www.datacite.org/repolist/
Dryad Data Repository
JDAP: Joint Data Archiving Policy
Joint Data Archiving Policy: http://datadryad.org/jdap
Joint declarations, Feb 2010, in American Naturalist, Evolution, the Journal of Evolutionary Biology, Molecular Ecology, Heredity, and other key journals in evolution and ecology: http://www.journals.uchicago.edu/doi/full/10.1086/650340
This journal requires, as a condition for publication, that data supporting the results in the paper should be archived in an appropriate public archive, such as GenBank, TreeBASE, Dryad, or the Knowledge Network for Biocomplexity.
Allows embargos of up to one year; allows exceptions for, e.g., sensitive information such as human subject data or the location of endangered species.
‘Data that have an established standard repository, such as DNA sequences, should continue to be archived in the appropriate repository, such as GenBank. For more idiosyncratic data, the data can be placed in a more flexible digital data library such as the National Science Foundation-sponsored Dryad archive at http://datadryad.org.'
Slide: Simon Hodson (Dryad / Jisc / CODATA)
PREPARDE and Bi-Directional Data Linking • already have a link from the GDJ data
article to the data repository via DOI
• GDJ can also pull the standard DOI metadata attached to that DOI from the DataCite metadata store
• need to figure out a way so GDJ can inform the repository that their dataset has been cited/published!
• At this time, we have a manual work-around (i.e. email)
• Workshop on cross-linking between data centres and publishers 30th April 2013 at RAL, UK
• Report out soon!
BADC NCAR
GDJ
Standardised metadata
DataCite Metadata Store
Standardised metadata
Peer review of data: the Perfect Disaster?
• Support for the peer review process
– scholars contributing peer reviews with little formal reward
– opportunity to polish and refine understanding of the cutting edge of research
• But peer review system under stress
– exploding number of journals, conferences, and grant applications
– self-publication tools - blogs and wikis - allow scholars to disseminate their research results and products
• Faster and more directly
• Now adding research data into the publication and peer review queues …
Peer-review of data
• Technical
– author guidelines for GDJ
– Funder Data Value Checklist
– implicit peer review of repository?
• Scientific
– pre-publication?
– post-publication? E.g. F1000R
– guidelines on uncertainty e.g. IPCC
– discipline specific?
– EU Inspire spatial formatting
• Societal
– contribution to human knowledge
– reliability http://libguides.luc.edu/content.php?pid=5464&sid=164619
Open Peer Review of Data ESSD peer review ensures that the datasets are:
Plausible with no immediately detectable problems;
Sufficient high quality and their limitations clearly stated;
Well annotated by standard metadata and available from a certified data center/repository;
Customary with regard to their format(s) and/or access protocol, and expected to be useable for the foreseeable future;
Openly accessible (toll free)
Earth System Science Data journal: http://www.earth-system-science-data.net/
Rebecca Lawrence, Data Publishing: peer review, shared standards and collaboration, http://www.dcc.ac.uk/events/research-data-management-forum-rdmf/rdmf8-engaging-publishers
Faculty 1000 Open Peer Review
Sanity check:
Format and suitable basic structure adherence
A standard basic protocol structure is adhered to
Data stored in the most appropriate and stable
location
Open Peer Review:
Is the method used appropriate for the scientific
question being asked?
Has enough information been provided to be able to
replicate the experiment?
Have appropriate controls been conducted, and the
data presented?
Is the data in a useable format/structure?
Are stated data limitations and possible sources of
error appropriately described
Does the data ‘look’ ok (optional; e.g. Microarray data)
Draft Recommendations on Peer-review of data
• Summary Recommendations from Workshop at British Library, 11 March 2013
• Workshop attendees included funders, publishers, repository managers, researchers ….
• Draft recommendations put up for discussion and feedback captured
• Feedback from the community still welcome
• 2nd workshop 24 June: put recommendations to peer reviewers!
http://libguides.luc.edu/content.php?pid=5464&sid=164619
Document at: http://bit.ly/DataPRforComment Feedback to: https://www.jiscmail.ac.uk/DATA-PUBLICATION
Draft Recommendations on data peer review Summary Recommendations from Workshop at the British Library, 11 March 2013
• Connecting data review with data management planning
• Connecting scientific, technical review and curation
• Connecting data review with article review
• 4-5 draft recommendations in each of above
• Assist Researchers, Publishers, Journal Editors, Reviewers,
Data Centres, Institutional Repositories to map requirements
for data peer review
• Matrix of stakeholders vs processes
– Assist in assigning responsibilities for given context
– New for most disciplines
– Learn from disciplines where this already happens
Connecting data review with data management planning
1. All research funders should at least require a “data sharing plan” as part of
all funding proposals, and if a submitted data sharing plan is inadequate,
appropriate amendments should be proposed.
2. Research organisations should manage research data according to
recognised standards, providing relevant assurance to funders so that
additional technical requirements do not need to be assessed as part of the
funding application peer review. (Additional note: Research organisations
need to provide adequate technical capacity to support the management of
the data that the researchers generate.)
3. Research organisations and funders should ensure that adequate funding is
available within an award to encourage good data management practice.
4. Data sharing plans should indicate how the data can and will be shared and
publishers should refuse to publish papers which do not clearly indicate how
underlying data can be accessed, where appropriate.
Connecting scientific, technical review and curation
1. Articles and their underlying data or metadata (by the same or other
authors) should be multi-directionally linked, with appropriate
management for data versioning.
2. Journal editors should check data repository ingest policies to avoid
duplication of effort , but provide further technical review of important
aspects of the data where needed. (Additional note: A map of
ingest/curation policies of the different repositories should be generated.)
3. If there is a practical/technical issue with data access (e.g. files don’t open
or exist), then the journal should inform the repository of the issue. If
there is a scientific issue with the data, then the journal should inform the
author in the first instance; if the author does not respond adequately to
serious issues, then the journal should inform the institution who should
take the appropriate action. Repositories should have a clear policy in
place to deal with any feedback.
Connecting data review with article review
1. For all articles where the underlying data is being submitted, authors need to
provide adequate methods and software/infrastructure information as part of
their article. Publishers of these articles should have a clear data peer review
process for authors and referees.
2. Publishers should provide simple and, where appropriate, discipline-specific data
review (technical and scientific) checklists as basic guidance for reviewers.
3. Authors should clearly state the location of the underlying data. Publishers should
provide a list of known trusted repositories or, if necessary, provide advice to
authors and reviewers of alternative suitable repositories for the storage of their
data.
4. For data peer review, the authors (and journal) should ensure that the data
underpinning the publication, and any tools required to view it, should be fully
accessible to the referee. The referees and the journal need to then ensure
appropriate access is in place following publication.
5. Repositories need to provide clear terms and conditions for access, and ensure
that datasets have permanent and unique identifiers.
Publishing research data
• Research is heavily context specific => So keep it context specific? • Publishers and Professional & Learned Societies can help
galvanise agreement of researchers • To define how they want their data represented and preserved for reuse &
citation => your researchers need you • Along with funder appointed peer review committees often stronger
connection than to institution • Institutional managers/services cannot cover wide range of discipline specific
expertise
• Like publishers don’t cover all fields
• Relatively constant as career progresses • Change host institution
• Change field(s)
• Change technique
• In UK RCUK funding for APCs strictly for articles only…. – How to fund APCs for depositing data in repositories? – Will publishers charge APCs to publish data papers?
2012-02-07
DCC roadshow East Midlands - CC-BY-SA 42
PDB
GenBank
UniProt
Pfam
Spreadsheets, Notebooks
Local, Lost
High throughput experimental methods
Industrial scale
Commons based production
Public data sets
Cherry picked results
Preserved
CATH, SCOP
(Protein Structure
Classification)
ChemSpider
Research and the long tail
Slide: Carole Goble
Enabling Open Data Publishing
• Active Data Management Planning
– built in at proposal stage
• Local institutional tweaks of funder and local templates
– Implemented and evolved in project
• Data Management Plan as a live, evolving object
• Annotate data on the fly – lab notebook approach
– Curated & preserved using permanent identifiers
• Appropriate repository and data collection descriptors
Active Data Storage: Identifying the Holy Grail ?
• “what is needed is a tool to transparently sync local and network storage” (Marieke Guy, JISCMRD2 Nottingham)
• CKAN (Orbital, Lincoln)? • Research Data Toolkit (Herts) – hybrid solutions • UC3 suite at CDL…
• DropBox-like functionality a must • Usability • Technical interoperability
• aim to help create databases for research data so • facilitates collaboration and data sharing • enables the subsequent publication of datasets
• challenge is to ensure data are • documented • Preserved • service is sustainable
http://halogen.le.ac.uk
Portable Antiquities
Scheme (British Museum)
Place-names
(Nottingham)
Surnames
Genetics
IT hosting and GIS
Best practice:
#JISCMRD, UKRDS,
DCC, international
Halogen as template for research data management #jiscmrd
Requirements Analysis – must be iterative!
Data Management Plan – use DMPonline (UK Digital Curation Centre)
Scalable research data management infrastructure
pilot phase to nationally available resource
LAMP stack IT infrastructure: host research database – work with JISC/DCC
A model for the long term delivery of a data management service within the institution including
support, maintenance, governance & charging policies
Include researchers, IT services, research support office, library services etc.
BENEFITS
New research opportunities
Cross database work – seed new research samples
Scholarly communication/access to national resources
Key to English Place Names (Nottingham)
Portable Antiquities Scheme (British Museum)
Verification, re-purposing, re-use of data
Cleaning & enhancing private research datasets for reuse & correlation
No re-creation of data
Increased transparency
excellent training for best practice in research data management
Increasing research productivity
Build in cleaning, annotation, enhancement into normal research workflows
research datasets may immediately be reusable and interoperable
Impact & Knowledge Transfer
Reuse IT infrastructure
Increasing skills base of researchers/students/staff
Reward = Leverhulme Trust funding £1.3m!
CHALLENGES
interdisciplinary research database
ingest each input dataset in form such that sufficient information is carried forward to enable interoperation
Cultural differences
versioning & provenance for input datasets
which software tools, infrastructure , Query interface?
suitable for multi disciplinary researchers
Requirements upon the institution for sustaining the research assets & skills
Requirements upon the researchers
Annotating
Refreshing
Maintainence of datasets
No Response
63%
Response
Received
37%
Researcher Responses to Contacts Made
Suggested timeline for implementing institutional research data management
From Whyte & Tedds (2011), DCC Briefing http://www.dcc.ac.uk/resources/briefing-papers/making-case-rdm
Challenge for institutions – Rise to scientific and research challenge
• Not just a management challenge • Responsibility for the knowledge they create
– Library • “Doing the wrong things through the wrong people”? • Challenge for library to enable:
• curation of data and publications • active support from data scientists • from centralised to dispersed support
• Expert centres such as D-Lab essential!
– IT Service • Provide research data platforms for researchers:
– Active storage – Enable collaboration – Connect to preservation services through Library
But that’s not all…
What about the software underpinning data driven research?
If we’re going to publish as open data:
How do we help researchers to store, annotate and discover the datasets they create?
How do you sustain and reuse that?
Biomedical Research Infrastructure Software Service Kit
A vision for cloud-based open source research applications
#BRISSKit
http://www.brisskit.le.ac.uk
BRISSKit context: The I4Health goal of applying knowledge engineering to close the
‘ICT gap’ between research and healthcare (Beck, T. et al 2012)
www.brisskit.le.ac.uk Email: [email protected]
http://www.brisskit.le.ac.uk
The semantic bridge
?
OBiBa Onyx
Records participant consent, questionnaire data and primary specimen IDs
i2b2
Cohort selection and data querying
Bio-ontology!
Research Software Sustainability
• OS community engagement • standards compliance • consortium approach • work with grain of researchers • discipline specific forks? • Github versioning an example for research data?
• OS Community Engagement Charter
• defining engagement with existing & new OS communities
• including adoption & code commitments
See Rob Baxter blog: “The research software engineer”
• http://dirkgorissen.com/2012/09/13/the-research-software-engineer/
Lessons for institutions? Can’t do it all in house!
But many disciplines don’t have data centres
Build coalition of institutional actors
Essential to have high level support
Take and shape Identify what you do have in-house
Access external tools, standards where possible
Active storage, collaboration, eprints…
Propose best of breed for (inter)national reuse
Share benefits (and costs) over acacdemic networks
Sustainability the key challenge
As much cultural as technical – needs networks…
But institutions alone aren’t enough – we need an alliance!
Accepted Research Data Alliance Interest Group Publishing Data
• http://rd-alliance.org/
• Close coordination with ICSU-WDS working group, CODATA and other ongoing initiatives in data publication
– WDS under International Council of Science, RDA wider
– Avoid duplication within related RDA and WDS WGs – join up
– For WDS partnerships between publishers and data centres key
• scope the territory – gap analysis
• Use RDA Forum and new http://jiscmail.ac.uk/data-publication 350+ list
• Take findings from RDA / WDS group(s) and trial in other communities / disciplines / institutional repositories
15-9-2013 Launch meeting discussion 67
“Keep reaching for the stars”
• increase the trustworthiness and value of individual data sets
• strengthen the findings based on cited data sets
• increase the transparency and traceability of data and publications
• enable reuse and repurposing
i.e. Problems but extraordinary opportunities – all hands on deck!
Thank you for listening and thanks to CDL, D-Lab and the project partners
Dr Jonathan Tedds [email protected] @jtedds Senior Research Fellow,
D2K Data to Knowledge Research Group
(University of Leicester)
#PREPARDE http://www.le.ac.uk/projects/preparde
Mailing list: http://jiscmail.ac.uk/DATA-PUBLICATION