managing and sharing data - university of oxford...efficiency of the data management during...
TRANSCRIPT
MANAGING AND SHARING DATA
a best practice guide for researchers
Second Edition
Research Laboratory
© 2009 University of Essex
First published 2009 Second edition with minor revisions 2009
Authors: Veerle Van den Eynden, Louise Corti, Matthew Woollard and Libby Bishop.
All references available 18 September 2009.
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in anyform or by any means, electronic, mechanical, photocopying, recording or otherwise, without written permission fromthe publisher.
Published by: UK Data Archive University of Essex Wivenhoe Park Colchester Essex CO4 3SQ
ISBN: 1-904059-66-X
Printed by University of Essex Printing Services
The scientific process is enhanced by managing and sharing research data. Good data management practice
allows reliable verification of results and permits new and innovative research built on existing information.
This is important if the full value of public investment in research is to be realised.
JISC considers it a priority to promote and support good data management and sharing for the benefit of
UK Higher Education and Research. Through the Managing Research Data Programme, JISC is targeting a
number of key areas: helping higher education institutions plan their data management practice; piloting the
development of essential data management infrastructure; and promoting the acquisition of appropriate skills,
among academics and research support staff in Universities. JISC also funds the Digital Curation Centre,
which provides internationally recognised expertise in this area, as well as support and guidance for UK HE.
JISC recognises the UK Data Archive as a key partner and stakeholder in these strategic objectives.
Managing and Sharing Data: a best practice guide for researchers, is a timely and important publication,
and one which JISC wholeheartedly endorses.
Simon Hodson, JISC
September 2009
Endorsed by:
FOREWORD 2
DATA LIFECYCLE 3
SHARING DATA – WHY AND HOW 4
Why share research data? 4
How to share data 6
DATA DOCUMENTATION AND METADATA 6
Data documentation 6
Metadata 7
DATA FORMATS AND SOFTWARE 8
Choice of formats and conversions 8
Transcription 10
Data quality control 10
Version control 11
Authenticity 12
Adding value 12
DATA STORAGE, BACK-UP AND SECURITY 13
Making back-ups 13
Data storage 13
Data security 14
Encrypting data for transmission 15
ETHICS, CONSENT AND CONFIDENTIALITY 16
Personal, confidential and sensitive data 16
Informed consent and data sharing 18
Written or verbal consent? 18
One-off or process consent? 19
Anonymising data 20
Access control 21
Open access 21
COPYRIGHT 23
CONTENTS
2
Good data management is the foundation for good
research. If data are properly organised, preserved and
well documented, and their accuracy, validity and integrity
is controlled at all times, the result is high quality data,
efficient research, outputs based on solid evidence and
the saving of time and resources. Researchers benefit
greatly from properly managing their research data.
Data management should be planned from the start
of research. If it becomes part of standard research
practice, then it need not necessarily incur much
additional time or costs.
When it comes to sharing research data, good management
is essential to ensure that data can be preserved and
remain accessible in the long-term, so they can be re-used
and understood by other researchers. When managed
and preserved properly, research data can be successfully
used for future scientific and educational purposes, thus
maximising the investment made in generating the data and
increasing the visibility of the research.
Data management primarily occurs within the lifecycle
of a research project and is ideally carried out by all
members of the research team. Digital preservation,
which enables long-term data sharing, is often carried
out by a specialised data archive or centre. The value
of the data to be preserved depends on the quality and
efficiency of the data management during research.
The data management information provided in this
booklet is designed to help researchers and data
managers across a wide range of research disciplines
and research environments make sure that research
data are of the highest quality and have the greatest
potential for long-term use.
It is recognised that different types of data created and
managed across the research discipline spectrum may
require certain discipline-specific approaches to data
managing and sharing; and that data centres may differ
in their approach to specific data management and
preservation issues. This guidance has been written for
researchers spanning the natural and social sciences
and humanities.
The UK Data Archive thanks data management
experts from the National Environmental Research
Council (NERC), the NERC Environmental Bioinformatics
Centre (NEBC), the Environmental Information Data Centre
(EIDC) of the Centre for Ecology and Hydrology (CEH),
the British Library (BL), the Research Information Network
(RIN), the Archaeology Data Service (ADS), the History
Data Service (HDS), the Economic and Social Data Service
(ESDS), the London School of Economics Research
Laboratory, the Wellcome Trust and the Commission
for Rural Communities for reviewing the guidance and
providing valuable comments and case studies.
Expertise and support for producing this guidance
has been provided by the Data Support Service of the
interdisciplinary Rural Economy and Land Use (RELU)
programme, funded by the Economic and Social Research
Council (ESRC), Natural Environment Research Council
(NERC) and Biotechnology and Biological Sciences
Research Council (BBSRC); as well as by the long-term
expertise of the UK Data Archive. This second edition has
been co-funded by the Joint Information Systems
Committee (JISC).
This printed guide is complemented by detailed and
practical online information, available from the UK Data
Archive web site.
The UK Data Archive also provides training and
workshops on data management and sharing, including
bespoke advice via [email protected] or
+44 (0)1206 872974.
FOREWORD
www.data-archive.ac.uk/sharing
DATA LIFECYCLE
Data creation
• research design
• data management planning
• data collection
(surveying, experimentation, measuring etc.)
• data entry or digitisation
• data checking and cleaning
Data analysis
• analysis
• derived data creation
• creation of data documentation
End of research
• research outputs
• preparing data for preservation
Preservation of data
• storage of data
• migration to suitable format/medium
• metadata creation
Re-use of data
• by same researcher
• by other researchers
3
Distribution/publication
of data
Why share research data?
Research data are a valuable resource, usually requiring
much time and money to be produced. Many datasets
have a significant value beyond the original research.
Sharing research data:
• encourages scientific enquiry and debate
• enables scrutiny of research outcomes
• facilitates research beyond the scope of the
original research
• leads to new collaborations between data users and
data creators
• increases the impact and visibility of research
• reduces the cost of duplicating data collection
• provides important resources for education and training
• encourages the improvement and validation of
research methods
• promotes the research that created the data and
its outcomes
• can provide a direct credit to the researcher as a
research output in its own right
The ease with which digital data can be stored,
disseminated and made accessible to secondary users
via the internet means that many institutions embrace
the sharing of research data to increase the impact and
visibility of their research.
Research funders increasingly follow guidance from the
Organisation for Economic Co-operation and Development
(OECD) that publicly funded research data should be
openly available to the scientific community to the
maximum extent possible.1 They have adopted data
sharing policies and encourage or oblige researchers to
share research datasets and outputs. Data sharing policies
allow researchers exclusive data use for a reasonable
time period in which to publish the results of the data.
In the UK, the Economic and Social Research Council
(ESRC), the Natural Environment Research Council
(NERC) and the British Academy contractually require
researchers to offer all research data resulting from their
grants to designated data centres – UK Data Archive
and NERC data centres. Also the Biotechnology and
Biological Sciences Research Council (BBSRC), the
Medical Research Council (MRC) and the Wellcome
Trust have data policies which encourage researchers
to share their research data in a timely manner, with as
few restrictions as possible.
Journals increasingly require data that form the basis for
publications to be shared or deposited within an accessible
database or repository.
SHARING DATA – WHY AND HOW
4
1 Organisation for Economic Co-operation and Development (2007). OECD Principles and Guidelines for Access to Research Data from Public Funding.
Available at www.oecd.org/dataoecd/9/61/38500813.pdf
2 Nature Publishing Group (2009). Nature journals’ policy on availability of materials and data. Available at www.nature.com/authors/editorial_policies/availability.html
Nature journals have a policy that requires authors to
make data and materials available to readers, as a
condition of publication, preferably via public repositories.2
Appropriate discipline-specific repositories are suggested.
Specifications regarding data standards, compliance or
formats may also be provided.
For example, for research reporting on small molecule
crystal structures, authors should submit such structures
to the Cambridge Structural Database (CSD) as a
Crystallographic Information File, a standard file structure
for the archiving and distribution of crystallographic
information. After publication of the manuscript, deposited
structures are included in the CSD, from where bona fide
researchers can retrieve them for free. CSD has similar
deposition agreements with many other journals.
Similarly the Publishing Network for Geoscientific and
Environmental Data (PANGAEA) is an open access
repository for various journals. By giving each deposited
dataset a Digital Object Identifier (DOI), a deposited
dataset acquires a unique and persistent identifier, and
the underlying data can be directly connected to the
corresponding article.
Case study – journals’ data
sharing policies
5
3 RELU Data Support Service (2005). Project Communication and Data Management Plan. Available at www.data-archive.ac.uk/relu/DMP.doc
4 Wellcome Trust (2009). Policy on data management and sharing. Available at www.wellcome.ac.uk/About-us/Policy/Policy-and-position-
statements/WTX035043.htm
5 Digital Curation Centre (2009). Data Management Plan Content Checklist. Available at www.dcc.ac.uk/
The Rural Economy and Land Use (RELU) programme,
drawing on best practice in data management and sharing
across three research councils (ESRC, NERC and
BBSRC), requires all funded projects to develop and
implement a data management plan to ensure that data
are well managed throughout the duration of a research
project.3 In a data management plan researchers describe:
• the need for access to existing data sources
• data planned to be produced by the research project
• planned quality assurance and back-up procedures
• plans for management and archiving of collected data
• expected difficulties in making data available for
secondary research and measures to overcome
such difficulties
• who holds copyright and intellectual property rights of
the data
• data management and storage roles and responsibilities
within the research team
As part of its policy on data management and sharing,
the Wellcome Trust expects applicants to supply a data
management and sharing plan if the goal of the proposal
is to create or develop a research resource for the benefit
of the research community; or if the proposal involves
the generation of a significant quantity of data that could
potentially be shared for added benefit.4 The plan
should outline:
• how the data will be made available to the wider
community
• the proposed timeframe of data sharing
• data quality and standards
• the use of public data repositories (which the Trust
expects)
• intellectual property of the data
• protection of research participants and possible
limitations on data sharing
• long-term preservation and sustainability strategy
The Digital Curation Centre is in the process of
developing a general data management plan checklist
and template for use by researchers to fulfil research
funders’ requirements for data management planning.5
How to share data
Research data can be shared:
• by depositing data with a specialist data centre, data
archive or data bank
• by submitting data to a journal
• by depositing data in an institutional repository
• online via a project or institutional web site
• informally between researchers on a peer-to-peer basis
Sharing data by any of these means have their advantages
and disadvantages: data centres may not be able to accept
all data submitted to them; institutional repositories may not
afford adequate provision for the long-term maintenance of,
and support for, research data; and web sites may be
more ephemeral.
Exact approaches to data sharing may vary according to
different research environments and disciplines, due to the
varying nature of data types and their characteristics.
Depositing data with a specialist data centre has
additional advantages:
• helping to ensure that data meet set quality thresholds
• long-term preservation in standardised accessible data
formats, converting formats when needed due to software
upgrades or changes
• regular data back-ups
• online resource discovery of data through data catalogues
• licensing arrangements to acknowledge data rights
• standardised citation mechanism to acknowledge
data ownership
• promotion and dissemination of data to many users
• monitoring of the secondary usage of data
• safe-keeping of research data in a secure environment,
with ability to control access when needed
• management of access and user queries on behalf of the
data owner
Data centres may apply certain criteria to evaluate and
select datasets for preservation.
Data management plans
DATA DOCUMENTATION AND METADATA
A crucial part of making data user-friendly, shareable
and with long-lasting usability is ensuring that they can
be understood, interpreted and used at all times. This
requires clear and detailed data description and data
documentation.
Data documentation
Comprehensive data documentation is easiest when begun
at the onset of a project and continued throughout the
research process. It should be considered as part of best
practice in terms of creating, organising and managing data.
Data documentation explains how data were created or
digitised, what data mean, what their content and structure
is, and any manipulations that may have taken place. It
ensures that data can be understood during research
projects, that researchers continue to understand data in
the longer term and that re-users of data are able to
interpret the data. Good documentation is also vital for
successful data preservation.
Good data documentation includes information on:
• the context of data collection: project history, aims,
objectives and hypotheses
• data collection methods: data collection protocol,
sampling design, instruments, hardware and software
used, data scale and resolution, temporal coverage and
geographic coverage
• dataset structure of data files, cases, relationships
between files
• data sources used
• data validation, checking, proofing, cleaning and other
quality assurance procedures carried out
• modifications made to data over time since their original
creation and identification of different versions of datasets
• information on data confidentiality, access and use
conditions, where applicable
At data-level, datasets should also be documented with:
• names, labels and descriptions for variables, records and
their values
• explanation of codes and classification schemes used
• codes of, and reasons for, missing values
• derived data created after collection, with code, algorithm
or command file used to create them
• weighting and grossing variables created
• data listing with descriptions for cases, individuals or
items studied
Variable-level descriptions may be embedded within a
dataset itself as metadata. Other documentation may be
contained in user guides, reports, publications, working
papers and laboratory books.
6
The Stockholm Environmental Institute has created an
integrated spatial dataset, Social and Environmental
Conditions in Rural Areas (SECRA), containing socio-
economic and environmental characteristics of all rural
Census 2001 Super Output Areas.6 This dataset is
available online as an MS Access database and can be
downloaded together with accompanying metadata and
documentation files that clearly describe the data.
The dataset is organised in four data themes: natural
and constructed features; qualities of people and place;
living and working; and political and economic context.
The dataset is documented in detail by:
• four data list files providing descriptions of all variables
within each data theme
• four metadata files describing how each variable was
constructed and calculated, with references to the data
sources used
• a detailed report describing the rationale for the
dataset, the relevance of variables to rural conditions,
the methodology used, an overview of variables
included per data theme and examples of how the
dataset may be used
All data files and metadata files are clearly labelled and
coded according to the four data themes.
Case study – documenting data
6 M.Huby, S.Cinderby and A.Owen (2005). Social and Environmental Conditions in Rural Areas (SECRA). Available at www.sei.se/relu/secra/
7
Example of online study-level documentation available for a dataset in the UK Data
Archive catalogue
Format Name Size in Kilobytes Description
PDF guide.pdf 3557 User guide
PDF method.pdf 509 Methodology
PDF source.pdf 41 Example of sources
HTML UKDA Study 4177 Information.htm 18 Study information and citation
Metadata
In the context of data management, metadata are a subset
of core data documentation, which provides standardised
structured information explaining the purpose, origin, time
references, geographic location, creator, access conditions
and terms of use of a dataset. Metadata are typically used:
• for resource discovery, providing searchable information
that helps users to find existing data
• as a bibliographic record for citation
Metadata for online data catalogues or discovery portals
are often structured to international standards or schemes
such as Dublin Core, ISO 19115 for geographic
information, Data Documentation Initiative (DDI), Metadata
Encoding and Transmission Standard (METS) and General
International Standard Archival Description (ISAD(G)).
The use of standardised records in eXtensible Mark-up
Language (XML) brings key data documentation together
into a single document, creating rich and structured content
about the data. Metadata can be viewed with web browsers,
can be used for extract and analysis engines and can
enable field-specific searching. Disparate catalogues can
be shared and interactive browsing tools can be applied.
In addition, metadata can be harvested for data sharing
through the Open Archives Initiative Protocol for Metadata
Harvesting (OAI-PMH).
Researchers typically create metadata records for their
data by completing a data centre’s data deposit form or by
using a metadata creation tool, like Go-Geo! GeoDoc or
MetaGenie. Providing detailed and meaningful dataset
titles, descriptions and keywords, etc., enables data centres
to create rich resource-discovery metadata for archived
datasets. This should enable more comprehensible
resource discovery and data that are easier to use.
Data centres accompany each dataset with a bibliographic
citation that data users are required to state in research
outputs to reference and acknowledge accurately the data
source used. A citation gives credit to the data source and
distributor and identifies data sources for validation.
7 EDINA (n.d.). Go-Geo! GeoDoc metadata creator tool. Available at geodoc.edina.ac.uk/geodoc/editor/start
8 e-Government Unit (2004). UK GEMINI Standard Version 1.0 - A Geo-spatial Metadata Interoperability Initiative. e-Government Unit, Cabinet Office, London.
Available at www.gigateway.org.uk/metadata/pdf/UK_GEMINI_v1.pdf
Go-Geo! GeoDoc is an online metadata creation tool that
researchers can use to create metadata records for
spatial datasets.7 The metadata records are compliant
with the UK GEMINI standard and the ISO 19115
geospatial metadata standard.8 Researchers can create
and export metadata records or publish them in the
Go-Geo! portal, a discovery portal for geospatial data.
Case study – creating metadata
8
Choice of formats and conversions
The format and software in which research data are
created and digitised usually depend on how researchers
plan to analyse data, the hardware used, the availability
of software, or can be determined by discipline-specific
standards and customs. When considering the long-term
usability of data, attention needs to be given to the most
appropriate software and data format to use.
All digital information is designed to be interpreted by
computer programs to make it understandable and is, by
nature, software dependent. All digital data may thus be
endangered by the obsolescence of the hardware and
software environment on which access to data depends.
Despite the backward compatibility of many software
packages to import data created in previous software
versions and the interoperability between competing
popular software programs, the safest option to guarantee
long-term data access is to convert data to standard
formats that most software are capable of interpreting,
and that are suitable for data interchange and
transformation. This typically means using open or
standard formats – such as OpenDocument Format (ODF),
ASCII, tab-delimited format or comma-separated values –
as opposed to proprietary ones. Some proprietary formats,
such as MS Rich Text Format and MS Excel, are widely
used and likely to be accessible for a reasonable, but not
unlimited, time.
Thus, whilst researchers use the most suitable data
formats and software according to planned analyses,
once data analysis is completed and data are prepared
to be stored, researchers should consider converting their
research data to standard, interchangeable formats, in
order to avoid being unable to use the data in the future.
Similarly for back-ups of data, standard formats should
be considered.
For long-term digital preservation, data archives hold data
in such standard formats. At the same time, data are
offered to users by conversion to current common and
user-friendly data formats and may be migrated forward
when needed.
When researchers offer data to data archives for
preservation, researchers themselves should convert data
to a preferred data preservation format, as the person who
knows the data is in the best position to ensure data
integrity during conversions. Advice should be sought on
up-to-date formats from the intended place of deposit.
When data are converted from one format to another –
through export or by using data translation software –
certain changes may occur to the data. For data held in
spreadsheets or databases, some data or internal
metadata may be lost during conversions to another format,
e.g. missing value definitions, decimal numbers or variable
labels, or data may be truncated. For textual data, editing
such as highlighting or bold text may be lost. After software
conversions, data should therefore be checked for errors or
changes that may be caused by the export process.
The Wessex Archaeology Metric Archive Project
has brought together metric animal bone data from a
range of archaeological sites in England into a single
database format.9 The dataset contains a selection of
measurements commonly taken during Wessex
Archaeology zooarchaeological analysis of animal bone
fragments found during field investigations. The dataset
was created by the researchers in MS Excel and MS
Access formats and deposited with the Archaeology Data
Service (ADS) in the same formats. ADS has preserved
the dataset in Oracle and in comma-separated values
format (CSV) and disseminates the data via both an
Oracle/Cold Fusion live interface and as downloadable
CSV files.
Case study – data formats and
conversions
9 J.Grimm (2008). WAMAP: Wessex Archaeology Metric Archive Project. Available at ads.ahds.ac.uk/catalogue/resources.html?abmap_grimm_na_2008
DATA FORMATS AND SOFTWARE
8 9
Data formats currently recommended by UK Data Archive for long-term preservation of
research data
Note that other data centres or digital archives may recommend different formats.
Type of dataPreferred formats for management
back-ups and data preservation
Other acceptable formats
for data preservation
Quantitative tabular data with
extensive metadata
a dataset with variable labels, code labels, and
defined missing values, in addition to the matrix
of data
SPSS portable format (.por)
delimited text and command (‘setup’) file
(SPSS, Stata, SAS, etc.) containing metadata
information
structured text or mark-up file containing
metadata information, e.g. DDI XML file
proprietary formats of statistical packages
e.g. SPSS (.sav), Stata (.dta)
MS Access (.mdb/.accdb)
Quantitative tabular data with
minimal metadata
a matrix of data with or without column
headings or variable names, but no other
metadata or labelling
comma-separated values (CSV) file (.csv)
tab-delimited file (.tab)
including delimited text of given character set
with SQL data definition statements where
appropriate
delimited text of given character set – only
characters not present in the data should be
used as delimiters (.txt)
widely-used formats, e.g. MS Excel (.xls/.xlsx),
MS Access (.mdb/.accdb), dBase (.dbf) and
OpenDocument Spreadsheet (.ods)
Geospatial data
vector and raster data
ESRI Shapefile (essential – .shp,.shx, .dbf,
optional – .prj, .sbx, .sbn)
geo-referenced TIFF (.tif, .tfw)
CAD data (.dwg)
tabular GIS attribute data
ESRI Geodatabase format (.mdb)
MapInfo Interchange Format (.mif) for
vector data
Keyhole Mark-up Language (KML) (.kml)
Adobe Illustrator (.ai), CAD data (.dxf or .svg)
binary formats of GIS and CAD packages
Qualitative data
textual
eXtensible Mark-up Language (XML) text
according to an appropriate Document Type
Definition (DTD) or schema (.xml)
Rich Text Format (.rtf)
plain text data, ASCII (.txt)
Hypertext Mark-up Language (HTML) (.html)
widely-used proprietary formats, e.g. MS Word
(.doc/.docx)
proprietary/software-specific formats, e.g.
NUD*IST, NVivo and ATLAS.ti
Digital image data TIFF version 6 uncompressed (.tif)
JPEG (.jpeg, .jpg)
TIFF (other versions)(.tif, .tiff)
Adobe Portable Document Format (PDF/A,
PDF) (.pdf)
RAW image format (.raw)
software-specific formats, e.g. Photoshop files
(.psd)
Digital audio dataFree Lossless Audio Codec (FLAC) (.flac)
Waveform Audio Format (WAV) (.wav)
MPEG-1 Audio Layer 3 (.mp3)
Audio Interchange File Format (AIFF) (.aif)
Digital video data JPEG 2000 (.mj2)
Documentation and scripts
Rich Text Format (.rtf)
PDF/A or PDF (.pdf)
HTML (.htm)
Open Document Text (.odt)
plain text (.txt)
widely-used proprietary formats, e.g. MS Word
(.doc/.docx) or MS Excel (.xls/ .xlsx)
XML marked-up text (.xml) according to an
appropriate DTD or schema, e.g. XHMTL 1.0
10
Transcription
Where qualitative data are collected as audio or video
recordings, such as for interviews or focus groups, ideally
they are transcribed as textual files for archiving and
sharing. Transcripts should:
• have a unique identifier
• have a document header giving brief details of the data
collection event, including date, place, interviewer name
and interviewee details
• have a uniform layout throughout the research project
• make use of speaker tags indicating the question/
answer sequence
• use pseudonyms to anonymise personal identifying
information
• have line breaks
• be page numbered
Transcription of statistical tables from historical sources into
spreadsheets requires the digital data to be as close to the
original as possible, with attention to consistency in
transcribing and avoiding the use of formatting in data files.
Data quality control
Quality control of data is an integral part of all research
and takes place at various stages, during data collection,
data entry or digitisation, and data checking. It is important
to assign clear roles and responsibilities for data quality
assurance at all stages of research.
During data collection, researchers must ensure that the
data recorded reflect the actual facts, responses,
observations or events, for example:
• if data are collected with instruments:
- calibration of instruments is essential to check the
precision, bias and/or scale of measurement
- data are validated by checking for equipment as well
as digitisation errors
Study Title: ‘Healthy diets across generations’
Depositor: K. Jones
Interviewer: K. Jones
Interview number: 12
Interview ID: Chris Smith
Date of interview: 3 May 2007
Information about interviewee
Date of birth: 6 June 1949
Gender: male
Marital status: widowed
Occupation: bricklayer
Geographic region: North-East England
KJ: Just one or two factual details first of all before we go
on to your health and that... how old are you?
CS: I’m 58 in June.
KJ: What schools did you go to? Can you remember that
far back!
CS: Oh... the last school was at Longside... aye,
ken Longside?
Sample transcript
10 Register of Ecological Models (n.d.). Register of Ecological Models. Available at ecobas.org/www-server/index.html
11 K.Rennols, (2001). Forest Model Archive. Available at cms1.gre.ac.uk/conferences/iufro/FMA/index.htm
12 N.Le Novère, B.Bornstein, A.Broicher, M.Courtot, M.Donizelli, H.Dharuri, L.Li, H.Sauro, M.Schilstra, B.Shapiro, J.L.Snoep and M.Hucka (2006) BioModels Database:
A Free, Centralized Database of Curated, Published, Quantitative Kinetic Models of Biochemical and Cellular Systems. Nucleic Acids Res., 34: D689-D691. Available at
www.ebi.ac.uk/biomodels-main/
Various initiatives aim to preserve and share
modelling software and code. In ecology, meta-databases
such as the Register of Ecological Models provide
standardised metadata and documentation for existing
ecological and environmental models and, amongst a
variety of metadata, specifies the software used to
develop the model.10 The model code is usually
accessible by contacting the model creator. The Forest
Model Archive developed at the University of Greenwich
offers data providers the opportunity to archive models,
besides providing structured documentation.11
In biology, the BioModels Database of the European
Bioinformatics Institute (EBI) is a repository for published
mathematical models of biological processes and
molecular functions.12 All models are annotated and linked
to relevant data resources. Researchers are encouraged
to deposit models written in an open source XML-based
format, the Systems Biology Markup Language (SBML),
and models are curated for long-term preservation.
Case study – preserving and
sharing models
11
- data may be verified by checking the truth of the
record with an expert or by taking multiple
measurements, observations or samples
• standardised methods and protocols can be used for
capturing observations, alongside recording forms with
clear instructions
• computer-assisted interview software can be used to
standardise interviews, verify response consistency,
route and customise questions so that only appropriate
questions are asked, confirm responses against previous
answers where appropriate and detect inadmissible
responses
The quality of data collection methods used has a
significant bearing on data quality. Documenting in detail
how data are collected increases their quality.
When data are digitised, transcribed, entered in a
database or spreadsheet or coded, quality is ensured by
adhering to standardised and consistent procedures for
data entry with clear instructions. This may include setting
up validation rules or input masks in data entry software;
using data entry screens; using controlled vocabularies,
code lists and choice lists to minimise manual data entry;
detailed labelling of variable and record names to avoid
confusion; or designing a purpose-built database structure
to organise data and data files.
During data checking, data are edited, cleaned, verified,
cross-checked and validated. Checking typically involves
both automated and manual procedures. This may include:
double-checking coding of observations or responses and
out-of-range values; checking data completeness; verifying
random samples of the digital data against the original
data; double entry of data; statistical analyses such as
frequencies, means, ranges or clustering to detect errors
and anomalous values; or peer review.
Version control
It is important to ensure that different copies or versions
of files, data held in different formats or locations, and
information that is cross-referenced between files are all
subject to version control. Checks and procedures should be
put in place to make sure that if the information in one file is
altered, the related information in other files is also updated
if necessary. It is important to keep track of which version of
a file is the most current, especially where data files are
shared between people or held in different locations.
Best practice is to:
• uniquely identify files, preferably using a systematic
naming convention
• clearly record version and status of a file, e.g. draft,
interim, final, internal
• record what changes are made to a file when a new
version is created
• record relationships between items as, in many cases,
the information contained in a single file is supported by
information held in other files, e.g. relationship between
the code and the data file it is run against, or between the
data file and the documentation or metadata that relate to
it, or between multiple tables
• track the location of all files if stored in a variety of
locations
• regularly synchronise files in different locations, e.g. using
MS SyncToy software
• maintain single master files in a suitable format to remove
version control problems associated with multiple working
versions being developed in parallel
Version control can be maintained through:
• file naming conventions, using number sequences or
dates in file names – although avoid very long file names
or using spaces and special characters in file names
• including a file history or version control table at the start
of each file, in which versions, dates, authors and details
of changes to the file are recorded
• version control facilities within software
• controlling rights to file editing
• versioning software, e.g. Subversion (SVN), or file
sharing software such as Google Docs or Amazon S3
• manual merging of entries or edits by multiple users
12
Authenticity
Digital information can be copied, altered or deleted very
easily. It is therefore important to be able to demonstrate
the authenticity of data and to prevent unauthorised access
to data that may potentially lead to unauthorised changes.
Best practice to ensure authenticity and control access
is to:
• keep a master file of data
• assign responsibility for master files where possible to an
individual member of the project team
• restrict write access to master versions to specific
members of the project team
• create a formal procedure for the destruction of master
files
• record all changes to master files
• maintain old master files in case later ones contain errors
• archive copies of master files at regular intervals
Adding value
Researchers can add significant value to their datasets
by including additional variables or parameters that widen
the possible applications. Including standard parameters
or generic derived variables in data files may substantially
increase the potential re-use value of a dataset and provide
new avenues for research. For example, geo-referencing
data may allow other researchers to more easily add
value to data and apply the data in geographical
information systems.
The Commission for Rural Communities (CRC) often use
existing survey data to undertake rural and urban
analysis of national scale data in order to analyse policies
related to deprivation.
In order to undertake this type of spatial analysis,
original postcodes need to be accessed and
retrospectively recoded according to the type of rural or
urban settlements they fall into. This can be done with the
use of products such as the National Statistics Postcode
Directory, which contain a classification of rural and urban
settlements in England.
The task of applying these geographical markers to
datasets can often be a long and sometimes unfruitful
process – sometimes the CRC have to go through this
process to just find that the data do not have a
representative rural sample frame.
If rural and urban settlement markers, such as the
Rural/Urban Definition for England and Wales were
included in datasets, this would be of great benefit to
those undertaking rural and urban analysis.13
Case study – adding value to data
13 DEFRA, ODPM, NAW, ONS and Countryside Agency (2004). Rural and Urban Area Classification 2004. Available at www.statistics.gov.uk/geography/rudn.asp
10 13
DATA STORAGE, BACK-UP AND SECURITY
Making back-ups
Making back-ups of files is an essential element of data
management. Regular back-ups protect against accidental
or malicious data loss due to:
• hardware failure
• software or media faults
• virus infection or malicious hacking
• power failure
• human errors
Backing-up involves making copies of files which can be
used to restore originals if there is loss of data. Choosing a
precise back-up procedure to adopt depends on local
circumstances, the perceived value of the data and the
levels of risk considered appropriate for the circumstances.
Where data contain personal information, care should be
taken to only create the minimal number of copies needed,
e.g. a master file and one back-up copy.
When deciding upon the best back-up procedure for data
files, researchers need to consider:
• whether to back-up particular data files or back-up the
entire system
• the frequency of back-up needed
- whether to back-up after each change to data file or
at regular intervals
- frequently used and critical data files may be backed-
up daily using an automated back-up process
• which institutional back-up policy may be in place –
data on an institutional network space may be
automatically backed-up at regular intervals
• back-up strategies for all systems where data are held,
including portable computers or devices, non-network
computers and home-based computers if applicable
• whether to carry out incremental or differential back-up
- incremental back-up consists of first making a copy of
all relevant files, then making incremental back-ups of
the files which have altered since the last back-up;
removable media (CD/DVD) is recommended for this
- for differential back-ups a complete back-up is made
first, then back-ups are made of files changed or
created since the first full back-up and not just since
the last partial back-up; using fixed media such as
hard drives is recommended for this
• the choice of media on which to store back-up files
- depends on the quantity of files, type of data, and the
preferred method of backing-up
- examples include recordable CD/DVD, networked
hard drive, removable hard drive or magnetic tape
• location of back-up files – online back-up files or offline
storage on removable media or transportable hard drives
which can be physically removed to another location for
safe-keeping
• organising and clearly labelling all back-up files
• file formats
- back-ups of master copies should ideally be in formats
that are suitable for long-term digital preservation,
i.e. open as opposed to proprietary formats
• verifying and validating back-up files regularly
- fully restoring them to another location and comparing
them with the originals
- checking back-up copies for completeness and
integrity, for example by checking the MD5 checksum
value, file size and date
• not overwriting old back-ups with new ones
Data storage
The storage of any digital research data should be based
on two principles:
• digital storage media are inherently unreliable unless they
are stored appropriately
• all file formats and physical storage media will ultimately
become obsolete
Alongside a back-up strategy, a data storage strategy
should be in place. Media currently available for storing
data files are optical media (CDs and DVDs) and magnetic
media (hard drives and tapes).
Best practice is to:
• store data in formats which meet long-term software
readability requirements; in general this means non-
proprietary formats or open standard formats (see the
data formats table)
• obtain up-to-date guidance since changes to formats and
media may occur rapidly
• copy or migrate data files to new media between two and
five years after they were first created, since optical media
and magnetic media are subject to physical degradation
14
• check the data integrity of stored data files at
regular intervals
• ensure that any storage strategy, even for a short-term
project, involves at least two different forms of storage,
e.g. on hard drive and on CD
• create digital versions of paper documentation in PDF/A
format for long-term preservation and storage
• organise stored data well, ensuring they are easily
located and physically accessible
• ensure that areas and rooms designated for storage of
digital or non-digital data are suitable for the purpose, are
structurally sound and free from the risk of flood and fire
Note that optical and magnetic media are vulnerable to
poor handling and changes in temperature, relative
humidity, air quality and lighting conditions. The National
Preservation Office has published guidelines on caring for
CDs and DVDs.14
Non-digital printed materials and photographs are equally
subject to degradation, for example from sunlight, from acid
in paper or from acid in sweat on the skin.
Data security
Data security is the protection of data from unauthorised
access, use, change, disclosure and destruction; as well as
the prevention of unwanted changes that can affect the
integrity of data. Ensuring data security requires paying
attention to physical security, network security and security
of computer systems and files.
Data security may be needed to protect intellectual
property rights, commercial interests, or to keep sensitive
information safe. Where the safeguarding of personal data
is involved, data security is based on national legislation
(Data Protection Act 1998) which dictates that personal
data should only be accessible to authorised persons.
Personal data may also exist in non-digital format, for
example as patient records, completed consent forms or
interview cover sheets. These should be protected in the
same way as digital files.
Data which contain personal information should be
treated with higher levels of security than data which do
not. Data security arrangements need to be proportionate
to the nature of the data and the risks involved. Data
security can be made easier by separating data content
according to security needs. For example, where data
relate to human subjects, personal information such as
names and addresses can be removed from data files
and stored separately.
Physical data security requires:
• appropriate or restricted access to rooms and buildings
where data, computers or media are held
• logging the removal of, and access to, media or hardcopy
material in store rooms
• transporting sensitive data only under certain exceptional
circumstances, even for repair purposes, e.g. giving a
failed hard drive containing sensitive data to a computer
manufacturer may cause a breach of security
Network security means:
• not storing confidential data such as those containing
names or addresses on servers or computers connected
to an external network, particularly servers that host
internet services
14 L.Finch & J.Webster (2008). Caring for CDs and DVDs. NPO Preservation Guidance. Preservation in Practice Series. London,
National Preservation Office. Available at www.bl.uk/npo/pdf/cd.pdf
A research team carrying out coral reef research
collects field data using handheld Personal Digital
Assistants (PDAs). Digital data are transmitted daily to
the institution’s network drive, where they are held in
password protected files. All data files are identified by
an individual version number and creation date. Version
information (version numbers and notes detailing
differences between versions) is stored in a spreadsheet,
also on the network drive. The institution’s network
drive is fully backed-up onto Ultrium LTO2 data tapes.
Incremental back-ups are made daily Monday to
Thursday; full server back-ups are made over Friday
/Saturday/Sunday. Tapes are securely stored in a
separate building. Upon completion of the research the
datasets are deposited in the institution’s digital repository.
Case study – data back-up and storage
15
Security of computer systems and files may include:
• locking computer systems with a password and installing
a firewall system
• protecting from viruses and malicious code through
regularly updated virus detection software
• implementing password protection of, and access
permissions to, data files, e.g. no access, read only,
read and write or administrator-only permission
• controlling access to restricted materials with encryption
and/or password protection
• imposing confidentiality agreements for data users of
confidential data
• not sending personal or confidential data via email or
through File Transfer Protocol (FTP), but rather transmit
as encrypted data
• carrying out the destruction of data in a consistent
manner when this is needed
- paper should be shredded and computer files
permanently deleted from all systems
- deleting all files and reformatting a hard drive will not
prevent the possible recovery of data that have
previously been on that hard drive
- specialist advice may need to be sought where
needed, CD/DVD shredders can be used and hard
drives removed from their casings and disposed of
securely
It is worth remembering that file sharing services such as
Google Docs are not necessarily permanent.
The risk of security breaches and disclosure of confidential
data can also be removed or reduced by:
• anonymising digital data by removing identifiers or
aggregating data (see section on Anonymising data)
• separating disclosable from non-disclosable data
by obscuring, removing or hiding individual fields, records
or data files
Encrypting data for transmission Data encryption will maintain data security during
transmission. After testing a number of software
applications for encrypting data to enable secure data
transmission from government departments to the UK Data
Archive, the UK Data Archive recommends the use of
Pretty Good Privacy (PGP), an industry-standard
encryption technology. Available supporting encryption
software can be open source, e.g. GnuPG, or commercial,
e.g. PGP.
Encryption requires the creation of a Public and Private
Key pair and passphrase. The Private PGP Key and
passphrase are used to digitally sign each encrypted file,
and thus allow the recipient to validate the sender’s identity.
The recipient’s Public PGP Key is installed by the sender in
order to encrypt files so that only the authorised recipient
can decrypt them.
In February 2008 the British Library (BL) received
the recorded output of the Survey of Anglo-Welsh Dialects
(SAWD), carried out by University College, Swansea,
between 1969 and 1995. This survey recorded the English
spoken in Wales by interviewing and tape-recording
elderly speakers on topics including the farm and farming,
the house and housekeeping, nature, animals, social
activities and the weather. The collection was deposited in
the form of 503 digital audio files, which were accessioned
as .wav files in the BL’s Digital Library. Digital clones of all
files are held at the Archive of Welsh English, alongside
the original master recordings on 151 audio cassettes,
from which the digital copies were created.
The BL’s Digital Library is mirrored on four sites – at
Boston Spa, St Pancras, Aberystwyth and a ‘dark’ archive
which is provided by a third party. Each of these servers
has inbuilt integrity checks. The BL makes available
access copies for users, in the form of .mp3 audio files, in
the British Library Reading Rooms via the Soundserver
system. A small set of audio extracts from the SAWD
recordings are also available online on the BL’s Accents
and Dialects web site, Sounds Familiar.
Case study – data back-up, storage
and security
When research involves obtaining data from people, e.g.
in social, anthropological or medical research, researchers
are expected to maintain high ethical standards. Ethical
guidelines are typically issued by professional bodies,
institutions and funding organisations. Research will usually
require obtaining informed consent for people to participate
in research and for use of the information collected. It is
essential that consent also takes into account long-term
use of data, such as preserving and sharing data. Without
consent for data sharing, opportunities for sharing research
data with other researchers can be jeopardised.
Personal, confidential and sensitive data
At times data obtained from people may hold sensitive or
confidential information. This does not mean that all data
obtained by research with participants are confidential.
Personal data are defined in the Data Protection Act
1998 as data which relate to a living individual who can
be identified from those data, or from those data and
other information which is in the possession of, or is
likely to come into the possession of, the data controller
(e.g. researcher). This includes any expression of opinion
about the individual.
Confidential data are data that:
• can be connected to the person providing them or
that could lead to the identification of a person referred
to (names, addresses, occupation, photographs)
• are given in confidence, or data agreed to be kept
confidential (secret) between two parties, that are not
in the public domain
• are conditioned by factors such as ethical guidelines,
legal requirements or research-specific consent
agreements
Sensitive personal data are defined in the Data Protection
Act 1998 as data that may incriminate a participant or third
party, such as a person’s race, ethnic origin, political
opinion, religious beliefs, trade union membership, physical
or mental health, sexual orientation, criminal proceedings
or convictions.
Strategies for dealing with confidentiality depend upon
the nature of the research, but are essentially informed
by a researcher’s ethical obligations towards participants
and society and by legislation such as the Data Protection
Act 1998.
Legislation that may impact on the sharing of
confidential data:15
• duty of confidentiality
• Data Protection Act 1998
• Freedom of Information Act 2000
• Human Rights Act 1998
• Statistics and Registration Services Act 2007
• Environmental Information Regulations 2004
Sensitive and confidential data can be shared ethically if
researchers pay attention, from the planning stages of
research, to three important aspects:
• when gaining informed consent, include consent for
data sharing
• where needed, protect people’s identities by
anonymising data
• consider access restrictions to data
These measures should be considered jointly and never
in isolation. The same measures form part of good
research practice and data management, even if data
sharing is not envisioned.
1615 UK Data Archive (2009). Research ethics and legislation relevant to data sharing. Available at www.data-archive.ac.uk/sharing/legal.asp
ETHICS, CONSENT AND CONFIDENTIALITY
Thank you very much for agreeing to participate in this survey.
The information provided by you in this questionnaire will be used for research purposes. It will not be used in a
manner which would allow identification of your individual responses.
Anonymised research data will be archived at ………. in order to make them available to other researchers in
line with current data sharing practices.
I have read and understood the project information sheet dated DD/MM/YYYY. ❑
I have been given the opportunity to ask questions about the project. ❑
I agree to take part in the project. Taking part in the project will include being interviewed and audio
recorded [other forms of participation can be listed]. ❑
I understand that my taking part is voluntary; I can withdraw from the study at any time and I will not
be asked any questions about why I no longer want to take part. ❑
Select only one of the next two options:I would like my name used where what I have said or written as part of this study will be used in
reports, publications and other research outputs so that anything I have contributed to this project
can be recognised. ❑
I do not want my name used in this project. ❑
I understand my personal details such as phone number and address will not be revealed to people
outside the project. ❑
I understand that my words may be quoted in publications, reports, web pages, and other research
outputs but my name will not be used unless I requested it above. ❑
I agree for the data I provided to be archived at ………. [More detail can be provided here so thatdecisions can be made separately about audio, transcripts, etc.] ❑
I understand that other researchers will have access to this data only if they agree to preserve the
confidentiality of that data and if they agree to the terms I have specified in this form. ❑
I understand that other researchers may use my words in publications, reports, web pages, and
other research outputs according to the terms I have specified in this form. ❑
I agree to assign the copyright I hold in any materials related to this project to [name of researcher]. ❑
________________________ ________________ ________
Name of Participant Signature Date
________________________ ________________ ________
Researcher Signature Date
Contact details for further information: Names, phone, email addresses, etc.
Sample consent statement for quantitative surveys
Sample extensive consent form for interviews
17
Informed consent and data sharing
It is essential that gaining consent takes into account any
future uses of data, such as the sharing, preservation and
long-term use of research data. Researchers should:
• inform participants how research data will be stored,
preserved and used in the long-term
• inform participants how confidentiality will be maintained,
e.g. by anonymising data
• obtain informed consent (written or verbal) for
data sharing
To ensure that consent is informed, consent must be
freely given with sufficient information provided on all
aspects of participation and data use. There must be active
communication between the parties. Consent must never
be inferred from a non-response to a communication such
as a letter.
Written or verbal consent?
Whether informed consent is obtained in writing through
a detailed consent form, by means of an informative
statement, or verbally, depends on the nature of the
research, the kind of data gathered, the data format and
how the data will be used.
• For detailed interviews or research where personal,
sensitive or confidential data are gathered, the use of
written consent forms is recommended to assure
compliance with the Data Protection Act and with ethical
requirements. Written consent documentation typically
includes an information sheet and consent form signed
by the participant.
• For surveys or informal interviews, where no personal
data are gathered or personal identifiers are removed
from the data, obtaining written consent may not be
required. At a minimum an information sheet should be
provided to participants detailing the nature and scope of
the study, the identity of the researcher(s) and what will
happen to the data collected (including any data sharing).
• If data are collected verbally through audio or video
recordings, verbal consent agreements can be recorded
together with the data.
• For audio-visual data where the identity of people
may be disclosed from the data, it may be important that
informed consent is obtained to use the data unaltered
for research purposes, sharing and preservation.
Voice alteration or image blurring are usually labour
and cost intensive and may decrease the research
potential of data.
Sample consent forms are available from the UK
Data Archive.16
There is a potential tension between data sharing
and data protection. Data archives work to increase
availability of, and access to, research data, while the
primary purpose of Research Ethics Committees (RECs)
is to ensure ethical conduct in research and to protect the
safety, rights and well being of research participants.
The need to protect personal data and preserve
confidentiality — where explicitly required — cannot be
overstated. This does not mean, however, that all
research data obtained from research with people should
be kept confidential, cannot be shared or, even worse,
are destroyed. It is important to distinguish between
personal or sensitive data collected in research, and
research data in general. Personal data should not be
disclosed, unless consent has been given for disclosure.
Identifiable information may be excluded from data
sharing. A REC should, however, not object to the sharing
of research data in general. If research data contain
sensitive or confidential information, then the sharing of
such data must be considered carefully, but should not be
dismissed as being impossible.
18 16 UK Data Archive (2009). Consent forms. Available at www.data-archive.ac.uk/sharing/consentforms.asp
Research Ethics Committees and
data sharing
20 19
One-off or process consent?
Discussing and obtaining consent for:
• participation in research
• the use of the information gathered for analyses,
publications and outputs
• data sharing beyond the research
can be a one-off occurrence or an ongoing process.
One-off consent is simple, practical, avoids repeated
requests to participants, and meets the formal requirements
of most Research Ethics Committees. However, it may
place too much emphasis on ‘ticking boxes’.
If consent is considered throughout the research
process, it assures active informed consent from
participants. Thus, consent for participation in research,
for data use and for data sharing can be considered at
different stages of the research, giving participants a
clearer view of what participating in the research involves
and what the data to be shared consist of. It may, however,
be too repetitive and annoying for some participants.
Special consent considerations are needed for:17
• medical research
• research with children and young adults
• research with people with learning difficulties
• research within organisations or the workplace
• research into crime
• internet research
The Biological Records Centre (BRC) is the national
custodian of data on the distribution of wildlife in the
British Isles.18 Data are provided by volunteers,
researchers and organisations. BRC disseminates data
for environmental decision-making, education and
research. Data whose publication could present a
significant threat to a species or habitat (e.g. nesting
location of birds of prey) will be treated as confidential.
The BRC provides access to the data it holds via the
National Biodiversity Network Gateway. Standard access
controls are as follows:
• public access to view and download all records at a
minimum 10 km2 level of resolution, and at higher
resolution if the data provider agrees
• registered users have access to view and download
all except confidential records at the 1 km2 level
of resolution
• conservation organisations have access to view and
download all except confidential records at full
resolution with attributes
• conservation officers in statutory conservation agencies
have access to view and download all records,
including confidential records at full resolution with
attributes
• records that have been signified as confidential
by a data provider will not be made available to the
conservation agencies without the consent of the
data provider
Case study – sharing confidential data
17 UK Data Archive (2009). Special cases of consent. Available at www.data-archive.ac.uk/sharing/consentspecial.asp
18 Biological Records Centre (n.d.). Biological Records Centre. Available at www.brc.ac.uk
Anonymising data
Before data obtained from research with people can be
disseminated, published or shared with other researchers,
they may need to be anonymised so that individuals,
organisations or businesses cannot be identified from the
data.19 Anonymisation may be needed for ethical reasons to
protect people’s identities, for legal reasons to not disclose
personal data, or for commercial reasons. Personal data
should not be disclosed from research information, unless a
respondent has given specific consent to do so. In some
forms of research, for example where oral histories are
recorded or in anthropological research, it is customary to
publish and share the names of people studied, for which
they have given their consent.
Quantitative datasets may be anonymised by:
• removing direct identifiers, e.g. name or address
• aggregating or reducing the precision of a variable,
e.g. replacing the date of birth by age groups or
replacing full postcodes with postcode sectors
• generalising the meaning of a detailed text variable,
e.g. replacing a doctor’s detailed area of medical
expertise with ‘an area of medical speciality’
• restricting the upper or lower ranges of a variable to
hide outliers, e.g. top-coding salaries
Special attention may be needed for:
• relational data, where relations between variables in
related datasets can disclose identities
• geo-referenced data, where identifying spatial references
such as point co-ordinates also have a geographical value
Simply removing spatial references prevents disclosure,
but it also means that all geographical, locational and
related information is lost. A better option may be to keep
spatial references intact and to impose access restrictions
on the data instead. As an alternative, point co-ordinates
may be replaced by larger, non-disclosing geographical
areas or by meaningful alternative variables that typify the
geographical position.
When anonymising qualitative material such as
transcriptions of textual data, identifiers should not be
crudely removed or aggregated, as this can distort the
data or even make them unusable. Instead pseudonyms,
replacement terms or vaguer descriptors should be used.
The objective should be to achieve a reasonable level of
anonymisation, avoiding unrealistic or overly harsh editing,
whilst maintaining maximum content. Procedures to
anonymise data should always be considered alongside
obtaining informed consent for data sharing.
Best practice for qualitative data is to:
• plan anonymisation at the time of transcription or initial
write up
• use pseudonyms or replacements
• retain unedited versions of data for use within the
research team and for preservation
• create an anonymisation log of all replacements,
aggregations or removals made; care should be taken to
store such a log separately from the anonymised data files
• identify replacements in a meaningful way,
e.g. with [brackets]
Digital manipulation of audio and image files can be
used to remove personal identifiers. However, techniques
such as voice alteration and image blurring are labour-
intensive and expensive to apply to large quantities of data
and are likely to damage the research potential of the data.
If confidentiality of audio-visual data is an issue, it is better
to obtain the participant’s consent to use and share the
data unaltered.
A person’s identity can be disclosed from:
• direct identifiers, e.g. name, address, postcode
information or telephone number
• indirect identifiers that, when linked with other
publicly available information sources, could identify
someone, e.g. information on workplace, occupation or
exceptional values of characteristics like salary or age
20 19 UK Data Archive (2009). Anonymising research data. Available at www.data-archive.ac.uk/sharing/anonymise.asp
Access control
Under certain circumstances, sensitive and confidential
data can also be safeguarded by regulating or restricting
access to and use of such data, while at the same time
enabling data sharing for research and educational
purposes.
Data held at data centres and archives are not generally in
the public domain. Their use is restricted to specific
purposes after user registration. Users sign an End User
Licence in which they agree to certain conditions, i.e. not to
use data for commercial purposes or identify any potentially
identifiable individuals.
Data centres may impose stricter access regulations for
confidential data, such as:
• providing access to approved researchers only
• requiring data access authorisation from the data owner
prior to release
• placing confidential data under embargo for a given
period of time until confidentiality is no longer pertinent
• providing secure access to data, which enables analysis
of confidential data but excludes access to the data or the
ability to download the data
Mixed levels of access regulations may be put in place for
some datasets, combining regulated access to confidential
data with user access to non-confidential data.
Data centres typically liaise with the researchers who own
the data in selecting the most suitable type of access for
data. Access regulations should always be proportionate to
the kind of data and confidentiality involved.
Open access
The digital revolution has caused a strong drive towards
open access of information, with the internet making
information sharing fast, easy, powerful and empowering.
Scholarly publishing has seen a strong move towards open
access to increase the impact of research, with e-journals,
open access journals and copyright policies enabling the
deposit of outputs in open access repositories. The same
open access movement also steers towards more open
access to the underlying data and evidence on which
research publications are based. More and more journals
require for data that underpin research findings to be
published in open access repositories before manuscripts
are submitted.
22 21
UK Biobank aims to collect medical and genetic data from
500,000 middle-aged people across the UK in order to
create a research resource to study the prevention and
treatment of serious diseases. Stringent security,
confidentiality and anonymisation measures are in place.20
Biobank holds personal data on recruited patients, their
medical records and blood, urine and genetic samples;
with data made available to approved researchers.
Data or samples provided to researchers never include
personal identifying details.
All information and samples are stored anonymously
by removing any identifying information. This identifying
information is encrypted and stored separately in a
restricted access database that is controlled by senior UK
Biobank staff. Identifying data and samples are only linked
using a code that has no external meaning. Only a few
people within UK Biobank have access to the key to the
code for re-linking participants’ identifying information with
data and samples. All staff sign confidentiality agreements
as part of their employment contracts.
Case study – data security and
anonymisation
20 UK Biobank (2007). UK Biobank ethics and governance framework. Available at www.ukbiobank.ac.uk/docs/EGFlatestJan20082.pdf
22
a
The Secure Data Service is a secure environment to provide researchers with controlled and restricted access to disclosive
micro data.21 Its operation is legally framed by the 2007 Statistics Act which makes access to the confidential data for
statistical purposes possible.
The technical model shares many similarities with the ONS Virtual Micro Laboratory and the NORC Secure Data Enclave.22
It is based around a Citrix infrastructure which turns the end user’s computer into a remote terminal giving access to data,
to statistical software and to collaborative spaces on a central secure server held within the UK Data Archive.
It is secure because all data processing is carried out on a central secure server, which processes all requests centrally and
returns information about the results. No data travels over the network. No data can be imported or exported by users, except
the statistical results — sent from the central server to the remote location by an encrypted email after the final outputs are
checked against statistical disclosure controls.
The clearing-house mechanism established following the Convention on Biological Diversity to promote information sharing,
has resulted in an exponential increase in openly accessible biodiversity and ecosystem data since 1992. The Forest Spatial
Information Catalogue is a web-based portal, developed by the Center for International Forestry Research (CIFOR), for public
access to spatial data and maps.23
The catalogue holds satellite images, aerial photographs, land usage and forest cover maps, maps of protected areas,
agricultural and demographic atlases and forest boundaries. For example, forest cover maps for the entire world, produced
by the World Conservation Monitoring Centre in 1997 can be downloaded freely as digital vector data.
The Global Biodiversity Information Framework (GBIF) strives to make the world’s biodiversity data accessible everywhere in
the world.24 The framework holds millions of species occurrence records based on specimens and observations, scientific and
common names and classifications of living organisms and map references for species records. Data are contributed by
numerous international data providers. Geo-referenced records can be mapped to Google Earth.
Case study – access restrictions vs. open access
21 UK Data Archive (2009). Secure Data Service. Available at securedata.ukda.ac.uk/
22 Office for National Statistics (n.d.). Virtual Microdata Laboratory. Available at www.ons.gov.uk/about/who-we-are/our-services/unpublished-
data/business-data/vml/index.html
National Opinion Research Center (n.d.). NORC Data Enclave. Available at www.norc.org/DataEnclave
23 CIFOR (2005). Forest Spatial Information Catalogue. Available at gislab.cifor.cgiar.org/fsic/goHome.do
24 GBIF (n.d.). Global Biodiversity Information Framework Data Portal. Available at data.gbif.org/welcome.htm
COPYRIGHT
Researchers ‘creating’ data hold copyright over such
data. Copyright is an intellectual property right that
protects the owner of a work from its unauthorised
copying. Most research materials including spreadsheets,
publications, reports and computer programmes, fall under
literary work and are therefore protected by copyright.
Facts, however, cannot be copyrighted.
If information is structured in a database, the structure
acquires a database right, alongside the copyright in the
content of the database. A database may be protected by
both copyright and database right. For database right to
apply, the database must be the result of substantial
intellectual investment in obtaining, verifying or presenting
the content.
In the case of interviews, the interviewee holds the
copyright in the spoken word. If a transcription is a
substantial reproduction of the words spoken, the speaker
will own copyright in the words and the transcriber will
have separate copyright of the transcription.
In the case of collaborative research or derived data,
copyright may be held jointly by various researchers or
institutions. Copyright should be assigned correctly,
especially if datasets have been created from a variety
of sources; for example those which have been bought
or ‘lent’ by other researchers.
In academia, in theory the employer is the first owner
of the copyright in a work made during the course of the
employee’s employment. Many academic institutions,
however, waive copyright in research materials, data and
publications and give ownership to the researchers.
Researchers should check with their institutions if their
institution retains copyright or waives it.
When data are archived or shared, the researcher or
data creator keeps the copyright over data. A data archive
cannot effectively archive data unless all the rights holders
are identified and give their permission for the data to be
archived.
Secondary users of data must obtain copyright clearance
from the rights holder before data can be reproduced.
Data can be copied for non-commercial teaching or
research purposes without infringing copyright, under
the fair dealing concept, providing that the ownership of
the data is acknowledged to the copyright holder. An
acknowledgement should give credit to the data source
used, the data distributor and the copyright holder. Data
centres typically specify how data use should be
acknowledged and cited, either within the metadata record
for a dataset or in a data use licence.
Some researchers may have come across the concept
of Creative Commons licences which, allow creators to
communicate the rights which they wish to keep and the
rights which they wish to waive in order for other people
to make re-use of their intellectual properties more
straightforward. For some datasets Creative Commons
licences may be appropriate, but for others it is not.
Creative Commons itself does not recommend using
Creative Commons licences for informational databases,
rather to use the Science Commons Open Access
Data Protocol.25
Scenario: A researcher has collated articles about the
Prime Minister from The Guardian over the past ten
years, using the LexisNexis database to source articles.
They are then transcribed/copied by the researcher into
a database so that content analysis can be applied.
The researcher offers a copy of the database with the
original transcribed text to a data centre.
Rights Issue: Researchers cannot share either of
these data sources as they do not have copyright in the
original material. A data centre cannot accept these data
as to do so would be breach of copyright. The rights
holders, in this case The Guardian and LexisNexis, would
need to provide consent for archiving.
Case study – copyright of media sources
2325 Science Commons (n.d.). Protocol for Implementing Open Access Data. Available at sciencecommons.org/projects/publishing/open-access-data-protocol/
24
Scenario: A researcher has used the National Diet and Nutrition Survey (NDNS) data, obtained via the UKDA. NDNS data are
Crown Copyright. The researcher has processed the NDNS data (filtered, integrated and aggregated data across variables,
while maintaining individual records) and used the processed data to model food chain risks. The researcher would like to
archive the processed data that were used as input data for the modelling, as well as the modelling code, at the UK Data
Archive.
Rights Issue: There is joint copyright over the processed data, shared between the researcher and the Crown (holding
copyright over the NDNS data). The researcher should declare this joint copyright for the modelling input data and requires no
further permission from the Crown. The UK Data Archive End User Licence, which the researcher signed when obtaining the
NDNS data from the UK Data Archive, specifically states “offer for deposit any new data collections derived from the data
supplied or created by the combination of the data supplied with other data”. Thus the UK Data Archive can archive the
processed data with a joint copyright declaration.
Scenario: A researcher has interviewed five retired cabinet ministers about their careers, producing audio recordings and full
transcripts. The researcher then analyses the data and offers them to a data centre for preserving. However the researcher
did not get signed copyright transfers for the interviewees’ words.
Rights Issue: In this case it would be problematic for a data centre to accept the data. Large extracts of the data cannot be
quoted by secondary users. To do so would breach the interviewees’ copyright over their words. This is equally a problem for
the primary researcher. The researcher should have asked for transfer of copyright or a licence to use the data obtained
through interviews, as the possibility exists that the interviewee may at some point wish to assert the right over their words,
e.g. when publishing memoirs.
Scenario: A researcher subscribes to access spatial AgCensus data from EDINA. These data are then integrated with data
collected by the researcher. As part of the ESRC award contract the data has to be offered for archiving at the UK Data
Archive. Can such integrated data be offered?
Rights Issue: The subscription agreement on accessing AgCensus data states that data may not be transferred to any other
person or body without prior written permission from EDINA. Therefore, the UK Data Archive cannot accept the integrated
data, unless the researcher obtains permission from EDINA. The researcher’s partial data, with the AgCensus data removed,
can be archived. Secondary users could then re-combine these data with the AgCensus data, if they were to obtain their own
AgCensus subscription.
Case study – copyright of licensed data
Case study – copyright of interviews with ‘elites’
Case study – data obtained from UK Data Archive
Examples of data research centres
Antarctic Environmental Data Centre
Archaeology Data Service
Biomedical Informatics Research Network Data Repository
British Atmospheric Data Centre
British Library
British Oceanographic Data Centre
Cambridge Crystallographic Data Centre
Economic and Social Data Service
Environmental Information Data Centre
European Bioinformatics Institute
Geospatial Repository for Academic Deposit and Extraction
History Data Service
Infrared Space Observatory
National Biodiversity Network
National Geoscience Data Centre
NERC Earth Observation Data Centre
NERC Environmental Bioinformatics Centre
Petrological Database of the Ocean Floor
Publishing Network for Geoscientific and
Environmental Data (PANGAEA)
Scran
The Oxford Text Archive
UK Data Archive
UK Solar System Data Centre
Visual Arts Data Service
www.data-archive.ac.uk/sharing
UK Data ArchiveUniversity of EssexWivenhoe ParkColchesterCO4 3SQEmail: [email protected]: +44 (0)1206 872974
www.dataarchive.ac.uk/sharing