managing and sharing data - university of oxford...efficiency of the data management during...

MANAGING AND SHARING DATA

a best practice guide for researchers

Second Edition

Research Laboratory

© 2009 University of Essex

First published 2009 Second edition with minor revisions 2009

Authors: Veerle Van den Eynden, Louise Corti, Matthew Woollard and Libby Bishop.

All references available 18 September 2009.

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in anyform or by any means, electronic, mechanical, photocopying, recording or otherwise, without written permission fromthe publisher.

Published by: UK Data Archive University of Essex Wivenhoe Park Colchester Essex CO4 3SQ

ISBN: 1-904059-66-X

Printed by University of Essex Printing Services

The scientific process is enhanced by managing and sharing research data. Good data management practice

allows reliable verification of results and permits new and innovative research built on existing information.

This is important if the full value of public investment in research is to be realised.

JISC considers it a priority to promote and support good data management and sharing for the benefit of

UK Higher Education and Research. Through the Managing Research Data Programme, JISC is targeting a

number of key areas: helping higher education institutions plan their data management practice; piloting the

development of essential data management infrastructure; and promoting the acquisition of appropriate skills,

among academics and research support staff in Universities. JISC also funds the Digital Curation Centre,

which provides internationally recognised expertise in this area, as well as support and guidance for UK HE.

JISC recognises the UK Data Archive as a key partner and stakeholder in these strategic objectives.

Managing and Sharing Data: a best practice guide for researchers, is a timely and important publication,

and one which JISC wholeheartedly endorses.

Simon Hodson, JISC

September 2009

Endorsed by:

FOREWORD 2

DATA LIFECYCLE 3

SHARING DATA – WHY AND HOW 4

Why share research data? 4

How to share data 6

DATA DOCUMENTATION AND METADATA 6

Data documentation 6

Metadata 7

DATA FORMATS AND SOFTWARE 8

Choice of formats and conversions 8

Transcription 10

Data quality control 10

Version control 11

Authenticity 12

Adding value 12

DATA STORAGE, BACK-UP AND SECURITY 13

Making back-ups 13

Data storage 13

Data security 14

Encrypting data for transmission 15

ETHICS, CONSENT AND CONFIDENTIALITY 16

Personal, confidential and sensitive data 16

Informed consent and data sharing 18

Written or verbal consent? 18

One-off or process consent? 19

Anonymising data 20

Access control 21

Open access 21

COPYRIGHT 23

CONTENTS

2

Good data management is the foundation for good

research. If data are properly organised, preserved and

well documented, and their accuracy, validity and integrity

is controlled at all times, the result is high quality data,

efficient research, outputs based on solid evidence and

the saving of time and resources. Researchers benefit

greatly from properly managing their research data.

Data management should be planned from the start

of research. If it becomes part of standard research

practice, then it need not necessarily incur much

additional time or costs.

When it comes to sharing research data, good management

is essential to ensure that data can be preserved and

remain accessible in the long-term, so they can be re-used

and understood by other researchers. When managed

and preserved properly, research data can be successfully

used for future scientific and educational purposes, thus

maximising the investment made in generating the data and

increasing the visibility of the research.

Data management primarily occurs within the lifecycle

of a research project and is ideally carried out by all

members of the research team. Digital preservation,

which enables long-term data sharing, is often carried

out by a specialised data archive or centre. The value

of the data to be preserved depends on the quality and

efficiency of the data management during research.

The data management information provided in this

booklet is designed to help researchers and data

managers across a wide range of research disciplines

and research environments make sure that research

data are of the highest quality and have the greatest

potential for long-term use.

It is recognised that different types of data created and

managed across the research discipline spectrum may

require certain discipline-specific approaches to data

managing and sharing; and that data centres may differ

in their approach to specific data management and

preservation issues. This guidance has been written for

researchers spanning the natural and social sciences

and humanities.

The UK Data Archive thanks data management

experts from the National Environmental Research

Council (NERC), the NERC Environmental Bioinformatics

Centre (NEBC), the Environmental Information Data Centre

(EIDC) of the Centre for Ecology and Hydrology (CEH),

the British Library (BL), the Research Information Network

(RIN), the Archaeology Data Service (ADS), the History

Data Service (HDS), the Economic and Social Data Service

(ESDS), the London School of Economics Research

Laboratory, the Wellcome Trust and the Commission

for Rural Communities for reviewing the guidance and

providing valuable comments and case studies.

Expertise and support for producing this guidance

has been provided by the Data Support Service of the

interdisciplinary Rural Economy and Land Use (RELU)

programme, funded by the Economic and Social Research

Council (ESRC), Natural Environment Research Council

(NERC) and Biotechnology and Biological Sciences

Research Council (BBSRC); as well as by the long-term

expertise of the UK Data Archive. This second edition has

been co-funded by the Joint Information Systems

Committee (JISC).

This printed guide is complemented by detailed and

practical online information, available from the UK Data

Archive web site.

The UK Data Archive also provides training and

workshops on data management and sharing, including

bespoke advice via [email protected] or

+44 (0)1206 872974.

FOREWORD

www.data-archive.ac.uk/sharing

DATA LIFECYCLE

Data creation

• research design

• data management planning

• data collection

(surveying, experimentation, measuring etc.)

• data entry or digitisation

• data checking and cleaning

Data analysis

• analysis

• derived data creation

• creation of data documentation

End of research

• research outputs

• preparing data for preservation

Preservation of data

• storage of data

• migration to suitable format/medium

• metadata creation

Re-use of data

• by same researcher

• by other researchers

3

Distribution/publication

of data

Why share research data?

Research data are a valuable resource, usually requiring

much time and money to be produced. Many datasets

have a significant value beyond the original research.

Sharing research data:

• encourages scientific enquiry and debate

• enables scrutiny of research outcomes

• facilitates research beyond the scope of the

original research

• leads to new collaborations between data users and

data creators

• increases the impact and visibility of research

• reduces the cost of duplicating data collection

• provides important resources for education and training

• encourages the improvement and validation of

research methods

• promotes the research that created the data and

its outcomes

• can provide a direct credit to the researcher as a

research output in its own right

The ease with which digital data can be stored,

disseminated and made accessible to secondary users

via the internet means that many institutions embrace

the sharing of research data to increase the impact and

visibility of their research.

Research funders increasingly follow guidance from the

Organisation for Economic Co-operation and Development

(OECD) that publicly funded research data should be

openly available to the scientific community to the

maximum extent possible.1 They have adopted data

sharing policies and encourage or oblige researchers to

share research datasets and outputs. Data sharing policies

allow researchers exclusive data use for a reasonable

time period in which to publish the results of the data.

In the UK, the Economic and Social Research Council

(ESRC), the Natural Environment Research Council

(NERC) and the British Academy contractually require

researchers to offer all research data resulting from their

grants to designated data centres – UK Data Archive

and NERC data centres. Also the Biotechnology and

Biological Sciences Research Council (BBSRC), the

Medical Research Council (MRC) and the Wellcome

Trust have data policies which encourage researchers

to share their research data in a timely manner, with as

few restrictions as possible.

Journals increasingly require data that form the basis for

publications to be shared or deposited within an accessible

database or repository.

SHARING DATA – WHY AND HOW

4

1 Organisation for Economic Co-operation and Development (2007). OECD Principles and Guidelines for Access to Research Data from Public Funding.

Available at www.oecd.org/dataoecd/9/61/38500813.pdf

2 Nature Publishing Group (2009). Nature journals’ policy on availability of materials and data. Available at www.nature.com/authors/editorial_policies/availability.html

Nature journals have a policy that requires authors to

make data and materials available to readers, as a

condition of publication, preferably via public repositories.2

Appropriate discipline-specific repositories are suggested.

Specifications regarding data standards, compliance or

formats may also be provided.

For example, for research reporting on small molecule

crystal structures, authors should submit such structures

to the Cambridge Structural Database (CSD) as a

Crystallographic Information File, a standard file structure

for the archiving and distribution of crystallographic

information. After publication of the manuscript, deposited

structures are included in the CSD, from where bona fide

researchers can retrieve them for free. CSD has similar

deposition agreements with many other journals.

Similarly the Publishing Network for Geoscientific and

Environmental Data (PANGAEA) is an open access

repository for various journals. By giving each deposited

dataset a Digital Object Identifier (DOI), a deposited

dataset acquires a unique and persistent identifier, and

the underlying data can be directly connected to the

corresponding article.

Case study – journals’ data

sharing policies

5

3 RELU Data Support Service (2005). Project Communication and Data Management Plan. Available at www.data-archive.ac.uk/relu/DMP.doc

4 Wellcome Trust (2009). Policy on data management and sharing. Available at www.wellcome.ac.uk/About-us/Policy/Policy-and-position-

statements/WTX035043.htm

5 Digital Curation Centre (2009). Data Management Plan Content Checklist. Available at www.dcc.ac.uk/

The Rural Economy and Land Use (RELU) programme,

drawing on best practice in data management and sharing

across three research councils (ESRC, NERC and

BBSRC), requires all funded projects to develop and

implement a data management plan to ensure that data

are well managed throughout the duration of a research

project.3 In a data management plan researchers describe:

• the need for access to existing data sources

• data planned to be produced by the research project

• planned quality assurance and back-up procedures

• plans for management and archiving of collected data

• expected difficulties in making data available for

secondary research and measures to overcome

such difficulties

• who holds copyright and intellectual property rights of

the data

• data management and storage roles and responsibilities

within the research team

As part of its policy on data management and sharing,

the Wellcome Trust expects applicants to supply a data

management and sharing plan if the goal of the proposal

is to create or develop a research resource for the benefit

of the research community; or if the proposal involves

the generation of a significant quantity of data that could

potentially be shared for added benefit.4 The plan

should outline:

• how the data will be made available to the wider

community

• the proposed timeframe of data sharing

• data quality and standards

• the use of public data repositories (which the Trust

expects)

• intellectual property of the data

• protection of research participants and possible

limitations on data sharing

• long-term preservation and sustainability strategy

The Digital Curation Centre is in the process of

developing a general data management plan checklist

and template for use by researchers to fulfil research

funders’ requirements for data management planning.5

How to share data

Research data can be shared:

• by depositing data with a specialist data centre, data

archive or data bank

• by submitting data to a journal

• by depositing data in an institutional repository

• online via a project or institutional web site

• informally between researchers on a peer-to-peer basis

Sharing data by any of these means have their advantages

and disadvantages: data centres may not be able to accept

all data submitted to them; institutional repositories may not

afford adequate provision for the long-term maintenance of,

and support for, research data; and web sites may be

more ephemeral.

Exact approaches to data sharing may vary according to

different research environments and disciplines, due to the

varying nature of data types and their characteristics.

Depositing data with a specialist data centre has

additional advantages:

• helping to ensure that data meet set quality thresholds

• long-term preservation in standardised accessible data

formats, converting formats when needed due to software

upgrades or changes

• regular data back-ups

• online resource discovery of data through data catalogues

• licensing arrangements to acknowledge data rights

• standardised citation mechanism to acknowledge

data ownership

• promotion and dissemination of data to many users

• monitoring of the secondary usage of data

• safe-keeping of research data in a secure environment,

with ability to control access when needed

• management of access and user queries on behalf of the

data owner

Data centres may apply certain criteria to evaluate and

select datasets for preservation.

Data management plans

DATA DOCUMENTATION AND METADATA

A crucial part of making data user-friendly, shareable

and with long-lasting usability is ensuring that they can

be understood, interpreted and used at all times. This

requires clear and detailed data description and data

documentation.

Data documentation

Comprehensive data documentation is easiest when begun

at the onset of a project and continued throughout the

research process. It should be considered as part of best

practice in terms of creating, organising and managing data.

Data documentation explains how data were created or

digitised, what data mean, what their content and structure

is, and any manipulations that may have taken place. It

ensures that data can be understood during research

projects, that researchers continue to understand data in

the longer term and that re-users of data are able to

interpret the data. Good documentation is also vital for

successful data preservation.

Good data documentation includes information on:

• the context of data collection: project history, aims,

objectives and hypotheses

• data collection methods: data collection protocol,

sampling design, instruments, hardware and software

used, data scale and resolution, temporal coverage and

geographic coverage

• dataset structure of data files, cases, relationships

between files

• data sources used

• data validation, checking, proofing, cleaning and other

quality assurance procedures carried out

• modifications made to data over time since their original

creation and identification of different versions of datasets

• information on data confidentiality, access and use

conditions, where applicable

At data-level, datasets should also be documented with:

• names, labels and descriptions for variables, records and

their values

• explanation of codes and classification schemes used

• codes of, and reasons for, missing values

• derived data created after collection, with code, algorithm

or command file used to create them

• weighting and grossing variables created

• data listing with descriptions for cases, individuals or

items studied

Variable-level descriptions may be embedded within a

dataset itself as metadata. Other documentation may be

contained in user guides, reports, publications, working

papers and laboratory books.

6

The Stockholm Environmental Institute has created an

integrated spatial dataset, Social and Environmental

Conditions in Rural Areas (SECRA), containing socio-

economic and environmental characteristics of all rural

Census 2001 Super Output Areas.6 This dataset is

available online as an MS Access database and can be

downloaded together with accompanying metadata and

documentation files that clearly describe the data.

The dataset is organised in four data themes: natural

and constructed features; qualities of people and place;

living and working; and political and economic context.

The dataset is documented in detail by:

• four data list files providing descriptions of all variables

within each data theme

• four metadata files describing how each variable was

constructed and calculated, with references to the data

sources used

• a detailed report describing the rationale for the

dataset, the relevance of variables to rural conditions,

the methodology used, an overview of variables

included per data theme and examples of how the

dataset may be used

All data files and metadata files are clearly labelled and

coded according to the four data themes.

Case study – documenting data

6 M.Huby, S.Cinderby and A.Owen (2005). Social and Environmental Conditions in Rural Areas (SECRA). Available at www.sei.se/relu/secra/

7

Example of online study-level documentation available for a dataset in the UK Data

Archive catalogue

Format Name Size in Kilobytes Description

PDF guide.pdf 3557 User guide

PDF method.pdf 509 Methodology

PDF source.pdf 41 Example of sources

HTML UKDA Study 4177 Information.htm 18 Study information and citation

Metadata

In the context of data management, metadata are a subset

of core data documentation, which provides standardised

structured information explaining the purpose, origin, time

references, geographic location, creator, access conditions

and terms of use of a dataset. Metadata are typically used:

• for resource discovery, providing searchable information

that helps users to find existing data

• as a bibliographic record for citation

Metadata for online data catalogues or discovery portals

are often structured to international standards or schemes

such as Dublin Core, ISO 19115 for geographic

information, Data Documentation Initiative (DDI), Metadata

Encoding and Transmission Standard (METS) and General

International Standard Archival Description (ISAD(G)).

The use of standardised records in eXtensible Mark-up

Language (XML) brings key data documentation together

into a single document, creating rich and structured content

about the data. Metadata can be viewed with web browsers,

can be used for extract and analysis engines and can

enable field-specific searching. Disparate catalogues can

be shared and interactive browsing tools can be applied.

In addition, metadata can be harvested for data sharing

through the Open Archives Initiative Protocol for Metadata

Harvesting (OAI-PMH).

Researchers typically create metadata records for their

data by completing a data centre’s data deposit form or by

using a metadata creation tool, like Go-Geo! GeoDoc or

MetaGenie. Providing detailed and meaningful dataset

titles, descriptions and keywords, etc., enables data centres

to create rich resource-discovery metadata for archived

datasets. This should enable more comprehensible

resource discovery and data that are easier to use.

Data centres accompany each dataset with a bibliographic

citation that data users are required to state in research

outputs to reference and acknowledge accurately the data

source used. A citation gives credit to the data source and

distributor and identifies data sources for validation.

7 EDINA (n.d.). Go-Geo! GeoDoc metadata creator tool. Available at geodoc.edina.ac.uk/geodoc/editor/start

8 e-Government Unit (2004). UK GEMINI Standard Version 1.0 - A Geo-spatial Metadata Interoperability Initiative. e-Government Unit, Cabinet Office, London.

Available at www.gigateway.org.uk/metadata/pdf/UK_GEMINI_v1.pdf

Go-Geo! GeoDoc is an online metadata creation tool that

researchers can use to create metadata records for

spatial datasets.7 The metadata records are compliant

with the UK GEMINI standard and the ISO 19115

geospatial metadata standard.8 Researchers can create

and export metadata records or publish them in the

Go-Geo! portal, a discovery portal for geospatial data.

Case study – creating metadata

8

Choice of formats and conversions

The format and software in which research data are

created and digitised usually depend on how researchers

plan to analyse data, the hardware used, the availability

of software, or can be determined by discipline-specific

standards and customs. When considering the long-term

usability of data, attention needs to be given to the most

appropriate software and data format to use.

All digital information is designed to be interpreted by

computer programs to make it understandable and is, by

nature, software dependent. All digital data may thus be

endangered by the obsolescence of the hardware and

software environment on which access to data depends.

Despite the backward compatibility of many software

packages to import data created in previous software

versions and the interoperability between competing

popular software programs, the safest option to guarantee

long-term data access is to convert data to standard

formats that most software are capable of interpreting,

and that are suitable for data interchange and

transformation. This typically means using open or

standard formats – such as OpenDocument Format (ODF),

ASCII, tab-delimited format or comma-separated values –

as opposed to proprietary ones. Some proprietary formats,

such as MS Rich Text Format and MS Excel, are widely

used and likely to be accessible for a reasonable, but not

unlimited, time.

Thus, whilst researchers use the most suitable data

formats and software according to planned analyses,

once data analysis is completed and data are prepared

to be stored, researchers should consider converting their

research data to standard, interchangeable formats, in

order to avoid being unable to use the data in the future.

Similarly for back-ups of data, standard formats should

be considered.

For long-term digital preservation, data archives hold data

in such standard formats. At the same time, data are

offered to users by conversion to current common and

user-friendly data formats and may be migrated forward

when needed.

When researchers offer data to data archives for

preservation, researchers themselves should convert data

to a preferred data preservation format, as the person who

knows the data is in the best position to ensure data

integrity during conversions. Advice should be sought on

up-to-date formats from the intended place of deposit.

When data are converted from one format to another –

through export or by using data translation software –

certain changes may occur to the data. For data held in

spreadsheets or databases, some data or internal

metadata may be lost during conversions to another format,

e.g. missing value definitions, decimal numbers or variable

labels, or data may be truncated. For textual data, editing

such as highlighting or bold text may be lost. After software

conversions, data should therefore be checked for errors or

changes that may be caused by the export process.

The Wessex Archaeology Metric Archive Project

has brought together metric animal bone data from a

range of archaeological sites in England into a single

database format.9 The dataset contains a selection of

measurements commonly taken during Wessex

Archaeology zooarchaeological analysis of animal bone

fragments found during field investigations. The dataset

was created by the researchers in MS Excel and MS

Access formats and deposited with the Archaeology Data

Service (ADS) in the same formats. ADS has preserved

the dataset in Oracle and in comma-separated values

format (CSV) and disseminates the data via both an

Oracle/Cold Fusion live interface and as downloadable

CSV files.

Case study – data formats and

conversions

9 J.Grimm (2008). WAMAP: Wessex Archaeology Metric Archive Project. Available at ads.ahds.ac.uk/catalogue/resources.html?abmap_grimm_na_2008

DATA FORMATS AND SOFTWARE

8 9

Data formats currently recommended by UK Data Archive for long-term preservation of

research data

Note that other data centres or digital archives may recommend different formats.

Type of dataPreferred formats for management

back-ups and data preservation

Other acceptable formats

for data preservation

Quantitative tabular data with

extensive metadata

a dataset with variable labels, code labels, and

defined missing values, in addition to the matrix

of data

SPSS portable format (.por)

delimited text and command (‘setup’) file

(SPSS, Stata, SAS, etc.) containing metadata

information

structured text or mark-up file containing

metadata information, e.g. DDI XML file

proprietary formats of statistical packages

e.g. SPSS (.sav), Stata (.dta)

MS Access (.mdb/.accdb)

Quantitative tabular data with

minimal metadata

a matrix of data with or without column

headings or variable names, but no other

metadata or labelling

comma-separated values (CSV) file (.csv)

tab-delimited file (.tab)

including delimited text of given character set

with SQL data definition statements where

appropriate

delimited text of given character set – only

characters not present in the data should be

used as delimiters (.txt)

widely-used formats, e.g. MS Excel (.xls/.xlsx),

MS Access (.mdb/.accdb), dBase (.dbf) and

OpenDocument Spreadsheet (.ods)

Geospatial data

vector and raster data

ESRI Shapefile (essential – .shp,.shx, .dbf,

optional – .prj, .sbx, .sbn)

geo-referenced TIFF (.tif, .tfw)

CAD data (.dwg)

tabular GIS attribute data

ESRI Geodatabase format (.mdb)

MapInfo Interchange Format (.mif) for

vector data

Keyhole Mark-up Language (KML) (.kml)

Adobe Illustrator (.ai), CAD data (.dxf or .svg)

binary formats of GIS and CAD packages

Qualitative data

textual

eXtensible Mark-up Language (XML) text

according to an appropriate Document Type

Definition (DTD) or schema (.xml)

Rich Text Format (.rtf)

plain text data, ASCII (.txt)

Hypertext Mark-up Language (HTML) (.html)

widely-used proprietary formats, e.g. MS Word

(.doc/.docx)

proprietary/software-specific formats, e.g.

NUD*IST, NVivo and ATLAS.ti

Digital image data TIFF version 6 uncompressed (.tif)

JPEG (.jpeg, .jpg)

TIFF (other versions)(.tif, .tiff)

Adobe Portable Document Format (PDF/A,

PDF) (.pdf)

RAW image format (.raw)

software-specific formats, e.g. Photoshop files

(.psd)

Digital audio dataFree Lossless Audio Codec (FLAC) (.flac)

Waveform Audio Format (WAV) (.wav)

MPEG-1 Audio Layer 3 (.mp3)

Audio Interchange File Format (AIFF) (.aif)

Digital video data JPEG 2000 (.mj2)

Documentation and scripts

Rich Text Format (.rtf)

PDF/A or PDF (.pdf)

HTML (.htm)

Open Document Text (.odt)

plain text (.txt)

widely-used proprietary formats, e.g. MS Word

(.doc/.docx) or MS Excel (.xls/ .xlsx)

XML marked-up text (.xml) according to an

appropriate DTD or schema, e.g. XHMTL 1.0

10

Transcription

Where qualitative data are collected as audio or video

recordings, such as for interviews or focus groups, ideally

they are transcribed as textual files for archiving and

sharing. Transcripts should:

• have a unique identifier

• have a document header giving brief details of the data

collection event, including date, place, interviewer name

and interviewee details

• have a uniform layout throughout the research project

• make use of speaker tags indicating the question/

answer sequence

• use pseudonyms to anonymise personal identifying

information

• have line breaks

• be page numbered

Transcription of statistical tables from historical sources into

spreadsheets requires the digital data to be as close to the

original as possible, with attention to consistency in

transcribing and avoiding the use of formatting in data files.

Data quality control

Quality control of data is an integral part of all research

and takes place at various stages, during data collection,

data entry or digitisation, and data checking. It is important

to assign clear roles and responsibilities for data quality

assurance at all stages of research.

During data collection, researchers must ensure that the

data recorded reflect the actual facts, responses,

observations or events, for example:

• if data are collected with instruments:

- calibration of instruments is essential to check the

precision, bias and/or scale of measurement

- data are validated by checking for equipment as well

as digitisation errors

Study Title: ‘Healthy diets across generations’

Depositor: K. Jones

Interviewer: K. Jones

Interview number: 12

Interview ID: Chris Smith

Date of interview: 3 May 2007

Information about interviewee

Date of birth: 6 June 1949

Gender: male

Marital status: widowed

Occupation: bricklayer

Geographic region: North-East England

KJ: Just one or two factual details first of all before we go

on to your health and that... how old are you?

CS: I’m 58 in June.

KJ: What schools did you go to? Can you remember that

far back!

CS: Oh... the last school was at Longside... aye,

ken Longside?

Sample transcript

10 Register of Ecological Models (n.d.). Register of Ecological Models. Available at ecobas.org/www-server/index.html

11 K.Rennols, (2001). Forest Model Archive. Available at cms1.gre.ac.uk/conferences/iufro/FMA/index.htm

12 N.Le Novère, B.Bornstein, A.Broicher, M.Courtot, M.Donizelli, H.Dharuri, L.Li, H.Sauro, M.Schilstra, B.Shapiro, J.L.Snoep and M.Hucka (2006) BioModels Database:

A Free, Centralized Database of Curated, Published, Quantitative Kinetic Models of Biochemical and Cellular Systems. Nucleic Acids Res., 34: D689-D691. Available at

www.ebi.ac.uk/biomodels-main/

Various initiatives aim to preserve and share

modelling software and code. In ecology, meta-databases

such as the Register of Ecological Models provide

standardised metadata and documentation for existing

ecological and environmental models and, amongst a

variety of metadata, specifies the software used to

develop the model.10 The model code is usually

accessible by contacting the model creator. The Forest

Model Archive developed at the University of Greenwich

offers data providers the opportunity to archive models,

besides providing structured documentation.11

In biology, the BioModels Database of the European

Bioinformatics Institute (EBI) is a repository for published

mathematical models of biological processes and

molecular functions.12 All models are annotated and linked

to relevant data resources. Researchers are encouraged

to deposit models written in an open source XML-based

format, the Systems Biology Markup Language (SBML),

and models are curated for long-term preservation.

Case study – preserving and

sharing models

11

- data may be verified by checking the truth of the

record with an expert or by taking multiple

measurements, observations or samples

• standardised methods and protocols can be used for

capturing observations, alongside recording forms with

clear instructions

• computer-assisted interview software can be used to

standardise interviews, verify response consistency,

route and customise questions so that only appropriate

questions are asked, confirm responses against previous

answers where appropriate and detect inadmissible

responses

The quality of data collection methods used has a

significant bearing on data quality. Documenting in detail

how data are collected increases their quality.

When data are digitised, transcribed, entered in a

database or spreadsheet or coded, quality is ensured by

adhering to standardised and consistent procedures for

data entry with clear instructions. This may include setting

up validation rules or input masks in data entry software;

using data entry screens; using controlled vocabularies,

code lists and choice lists to minimise manual data entry;

detailed labelling of variable and record names to avoid

confusion; or designing a purpose-built database structure

to organise data and data files.

During data checking, data are edited, cleaned, verified,

cross-checked and validated. Checking typically involves

both automated and manual procedures. This may include:

double-checking coding of observations or responses and

out-of-range values; checking data completeness; verifying

random samples of the digital data against the original

data; double entry of data; statistical analyses such as

frequencies, means, ranges or clustering to detect errors

and anomalous values; or peer review.

Version control

It is important to ensure that different copies or versions

of files, data held in different formats or locations, and

information that is cross-referenced between files are all

subject to version control. Checks and procedures should be

put in place to make sure that if the information in one file is

altered, the related information in other files is also updated

if necessary. It is important to keep track of which version of

a file is the most current, especially where data files are

shared between people or held in different locations.

Best practice is to:

• uniquely identify files, preferably using a systematic

naming convention

• clearly record version and status of a file, e.g. draft,

interim, final, internal

• record what changes are made to a file when a new

version is created

• record relationships between items as, in many cases,

the information contained in a single file is supported by

information held in other files, e.g. relationship between

the code and the data file it is run against, or between the

data file and the documentation or metadata that relate to

it, or between multiple tables

• track the location of all files if stored in a variety of

locations

• regularly synchronise files in different locations, e.g. using

MS SyncToy software

• maintain single master files in a suitable format to remove

version control problems associated with multiple working

versions being developed in parallel

Version control can be maintained through:

• file naming conventions, using number sequences or

dates in file names – although avoid very long file names

or using spaces and special characters in file names

• including a file history or version control table at the start

of each file, in which versions, dates, authors and details

of changes to the file are recorded

• version control facilities within software

• controlling rights to file editing

• versioning software, e.g. Subversion (SVN), or file

sharing software such as Google Docs or Amazon S3

• manual merging of entries or edits by multiple users

12

Authenticity

Digital information can be copied, altered or deleted very

easily. It is therefore important to be able to demonstrate

the authenticity of data and to prevent unauthorised access

to data that may potentially lead to unauthorised changes.

Best practice to ensure authenticity and control access

is to:

• keep a master file of data

• assign responsibility for master files where possible to an

individual member of the project team

• restrict write access to master versions to specific

members of the project team

• create a formal procedure for the destruction of master

files

• record all changes to master files

• maintain old master files in case later ones contain errors

• archive copies of master files at regular intervals

Adding value

Researchers can add significant value to their datasets

by including additional variables or parameters that widen

the possible applications. Including standard parameters

or generic derived variables in data files may substantially

increase the potential re-use value of a dataset and provide

new avenues for research. For example, geo-referencing

data may allow other researchers to more easily add

value to data and apply the data in geographical

information systems.

The Commission for Rural Communities (CRC) often use

existing survey data to undertake rural and urban

analysis of national scale data in order to analyse policies

related to deprivation.

In order to undertake this type of spatial analysis,

original postcodes need to be accessed and

retrospectively recoded according to the type of rural or

urban settlements they fall into. This can be done with the

use of products such as the National Statistics Postcode

Directory, which contain a classification of rural and urban

settlements in England.

The task of applying these geographical markers to

datasets can often be a long and sometimes unfruitful

process – sometimes the CRC have to go through this

process to just find that the data do not have a

representative rural sample frame.

If rural and urban settlement markers, such as the

Rural/Urban Definition for England and Wales were

included in datasets, this would be of great benefit to

those undertaking rural and urban analysis.13

Case study – adding value to data

13 DEFRA, ODPM, NAW, ONS and Countryside Agency (2004). Rural and Urban Area Classification 2004. Available at www.statistics.gov.uk/geography/rudn.asp

10 13

DATA STORAGE, BACK-UP AND SECURITY

Making back-ups

Making back-ups of files is an essential element of data

management. Regular back-ups protect against accidental

or malicious data loss due to:

• hardware failure

• software or media faults

• virus infection or malicious hacking

• power failure

• human errors

Backing-up involves making copies of files which can be

used to restore originals if there is loss of data. Choosing a

precise back-up procedure to adopt depends on local

circumstances, the perceived value of the data and the

levels of risk considered appropriate for the circumstances.

Where data contain personal information, care should be

taken to only create the minimal number of copies needed,

e.g. a master file and one back-up copy.

When deciding upon the best back-up procedure for data

files, researchers need to consider:

• whether to back-up particular data files or back-up the

entire system

• the frequency of back-up needed

- whether to back-up after each change to data file or

at regular intervals

- frequently used and critical data files may be backed-

up daily using an automated back-up process

• which institutional back-up policy may be in place –

data on an institutional network space may be

automatically backed-up at regular intervals

• back-up strategies for all systems where data are held,

including portable computers or devices, non-network

computers and home-based computers if applicable

• whether to carry out incremental or differential back-up

- incremental back-up consists of first making a copy of

all relevant files, then making incremental back-ups of

the files which have altered since the last back-up;

removable media (CD/DVD) is recommended for this

- for differential back-ups a complete back-up is made

first, then back-ups are made of files changed or

created since the first full back-up and not just since

the last partial back-up; using fixed media such as

hard drives is recommended for this

• the choice of media on which to store back-up files

- depends on the quantity of files, type of data, and the

preferred method of backing-up

- examples include recordable CD/DVD, networked

hard drive, removable hard drive or magnetic tape

• location of back-up files – online back-up files or offline

storage on removable media or transportable hard drives

which can be physically removed to another location for

safe-keeping

• organising and clearly labelling all back-up files

• file formats

- back-ups of master copies should ideally be in formats

that are suitable for long-term digital preservation,

i.e. open as opposed to proprietary formats

• verifying and validating back-up files regularly

- fully restoring them to another location and comparing

them with the originals

- checking back-up copies for completeness and

integrity, for example by checking the MD5 checksum

value, file size and date

• not overwriting old back-ups with new ones

Data storage

The storage of any digital research data should be based

on two principles:

• digital storage media are inherently unreliable unless they

are stored appropriately

• all file formats and physical storage media will ultimately

become obsolete

Alongside a back-up strategy, a data storage strategy

should be in place. Media currently available for storing

data files are optical media (CDs and DVDs) and magnetic

media (hard drives and tapes).

Best practice is to:

• store data in formats which meet long-term software

readability requirements; in general this means non-

proprietary formats or open standard formats (see the

data formats table)

• obtain up-to-date guidance since changes to formats and

media may occur rapidly

• copy or migrate data files to new media between two and

five years after they were first created, since optical media

and magnetic media are subject to physical degradation

14

• check the data integrity of stored data files at

regular intervals

• ensure that any storage strategy, even for a short-term

project, involves at least two different forms of storage,

e.g. on hard drive and on CD

• create digital versions of paper documentation in PDF/A

format for long-term preservation and storage

• organise stored data well, ensuring they are easily

located and physically accessible

• ensure that areas and rooms designated for storage of

digital or non-digital data are suitable for the purpose, are

structurally sound and free from the risk of flood and fire

Note that optical and magnetic media are vulnerable to

poor handling and changes in temperature, relative

humidity, air quality and lighting conditions. The National

Preservation Office has published guidelines on caring for

CDs and DVDs.14

Non-digital printed materials and photographs are equally

subject to degradation, for example from sunlight, from acid

in paper or from acid in sweat on the skin.

Data security

Data security is the protection of data from unauthorised

access, use, change, disclosure and destruction; as well as

the prevention of unwanted changes that can affect the

integrity of data. Ensuring data security requires paying

attention to physical security, network security and security

of computer systems and files.

Data security may be needed to protect intellectual

property rights, commercial interests, or to keep sensitive

information safe. Where the safeguarding of personal data

is involved, data security is based on national legislation

(Data Protection Act 1998) which dictates that personal

data should only be accessible to authorised persons.

Personal data may also exist in non-digital format, for

example as patient records, completed consent forms or

interview cover sheets. These should be protected in the

same way as digital files.

Data which contain personal information should be

treated with higher levels of security than data which do

not. Data security arrangements need to be proportionate

to the nature of the data and the risks involved. Data

security can be made easier by separating data content

according to security needs. For example, where data

relate to human subjects, personal information such as

names and addresses can be removed from data files

and stored separately.

Physical data security requires:

• appropriate or restricted access to rooms and buildings

where data, computers or media are held

• logging the removal of, and access to, media or hardcopy

material in store rooms

• transporting sensitive data only under certain exceptional

circumstances, even for repair purposes, e.g. giving a

failed hard drive containing sensitive data to a computer

manufacturer may cause a breach of security

Network security means:

• not storing confidential data such as those containing

names or addresses on servers or computers connected

to an external network, particularly servers that host

internet services

14 L.Finch & J.Webster (2008). Caring for CDs and DVDs. NPO Preservation Guidance. Preservation in Practice Series. London,

National Preservation Office. Available at www.bl.uk/npo/pdf/cd.pdf

A research team carrying out coral reef research

collects field data using handheld Personal Digital

Assistants (PDAs). Digital data are transmitted daily to

the institution’s network drive, where they are held in

password protected files. All data files are identified by

an individual version number and creation date. Version

information (version numbers and notes detailing

differences between versions) is stored in a spreadsheet,

also on the network drive. The institution’s network

drive is fully backed-up onto Ultrium LTO2 data tapes.

Incremental back-ups are made daily Monday to

Thursday; full server back-ups are made over Friday

/Saturday/Sunday. Tapes are securely stored in a

separate building. Upon completion of the research the

datasets are deposited in the institution’s digital repository.

Case study – data back-up and storage

15

Security of computer systems and files may include:

• locking computer systems with a password and installing

a firewall system

• protecting from viruses and malicious code through

regularly updated virus detection software

• implementing password protection of, and access

permissions to, data files, e.g. no access, read only,

read and write or administrator-only permission

• controlling access to restricted materials with encryption

and/or password protection

• imposing confidentiality agreements for data users of

confidential data

• not sending personal or confidential data via email or

through File Transfer Protocol (FTP), but rather transmit

as encrypted data

• carrying out the destruction of data in a consistent

manner when this is needed

- paper should be shredded and computer files

permanently deleted from all systems

- deleting all files and reformatting a hard drive will not

prevent the possible recovery of data that have

previously been on that hard drive

- specialist advice may need to be sought where

needed, CD/DVD shredders can be used and hard

drives removed from their casings and disposed of

securely

It is worth remembering that file sharing services such as

Google Docs are not necessarily permanent.

The risk of security breaches and disclosure of confidential

data can also be removed or reduced by:

• anonymising digital data by removing identifiers or

aggregating data (see section on Anonymising data)

• separating disclosable from non-disclosable data

by obscuring, removing or hiding individual fields, records

or data files

Encrypting data for transmission Data encryption will maintain data security during

transmission. After testing a number of software

applications for encrypting data to enable secure data

transmission from government departments to the UK Data

Archive, the UK Data Archive recommends the use of

Pretty Good Privacy (PGP), an industry-standard

encryption technology. Available supporting encryption

software can be open source, e.g. GnuPG, or commercial,

e.g. PGP.

Encryption requires the creation of a Public and Private

Key pair and passphrase. The Private PGP Key and

passphrase are used to digitally sign each encrypted file,

and thus allow the recipient to validate the sender’s identity.

The recipient’s Public PGP Key is installed by the sender in

order to encrypt files so that only the authorised recipient

can decrypt them.

In February 2008 the British Library (BL) received

the recorded output of the Survey of Anglo-Welsh Dialects

(SAWD), carried out by University College, Swansea,

between 1969 and 1995. This survey recorded the English

spoken in Wales by interviewing and tape-recording

elderly speakers on topics including the farm and farming,

the house and housekeeping, nature, animals, social

activities and the weather. The collection was deposited in

the form of 503 digital audio files, which were accessioned

as .wav files in the BL’s Digital Library. Digital clones of all

files are held at the Archive of Welsh English, alongside

the original master recordings on 151 audio cassettes,

from which the digital copies were created.

The BL’s Digital Library is mirrored on four sites – at

Boston Spa, St Pancras, Aberystwyth and a ‘dark’ archive

which is provided by a third party. Each of these servers

has inbuilt integrity checks. The BL makes available

access copies for users, in the form of .mp3 audio files, in

the British Library Reading Rooms via the Soundserver

system. A small set of audio extracts from the SAWD

recordings are also available online on the BL’s Accents

and Dialects web site, Sounds Familiar.

Case study – data back-up, storage

and security

When research involves obtaining data from people, e.g.

in social, anthropological or medical research, researchers

are expected to maintain high ethical standards. Ethical

guidelines are typically issued by professional bodies,

institutions and funding organisations. Research will usually

require obtaining informed consent for people to participate

in research and for use of the information collected. It is

essential that consent also takes into account long-term

use of data, such as preserving and sharing data. Without

consent for data sharing, opportunities for sharing research

data with other researchers can be jeopardised.

Personal, confidential and sensitive data

At times data obtained from people may hold sensitive or

confidential information. This does not mean that all data

obtained by research with participants are confidential.

Personal data are defined in the Data Protection Act

1998 as data which relate to a living individual who can

be identified from those data, or from those data and

other information which is in the possession of, or is

likely to come into the possession of, the data controller

(e.g. researcher). This includes any expression of opinion

about the individual.

Confidential data are data that:

• can be connected to the person providing them or

that could lead to the identification of a person referred

to (names, addresses, occupation, photographs)

• are given in confidence, or data agreed to be kept

confidential (secret) between two parties, that are not

in the public domain

• are conditioned by factors such as ethical guidelines,

legal requirements or research-specific consent

agreements

Sensitive personal data are defined in the Data Protection

Act 1998 as data that may incriminate a participant or third

party, such as a person’s race, ethnic origin, political

opinion, religious beliefs, trade union membership, physical

or mental health, sexual orientation, criminal proceedings

or convictions.

Strategies for dealing with confidentiality depend upon

the nature of the research, but are essentially informed

by a researcher’s ethical obligations towards participants

and society and by legislation such as the Data Protection

Act 1998.

Legislation that may impact on the sharing of

confidential data:15

• duty of confidentiality

• Data Protection Act 1998

• Freedom of Information Act 2000

• Human Rights Act 1998

• Statistics and Registration Services Act 2007

• Environmental Information Regulations 2004

Sensitive and confidential data can be shared ethically if

researchers pay attention, from the planning stages of

research, to three important aspects:

• when gaining informed consent, include consent for

data sharing

• where needed, protect people’s identities by

anonymising data

• consider access restrictions to data

These measures should be considered jointly and never

in isolation. The same measures form part of good

research practice and data management, even if data

sharing is not envisioned.

1615 UK Data Archive (2009). Research ethics and legislation relevant to data sharing. Available at www.data-archive.ac.uk/sharing/legal.asp

ETHICS, CONSENT AND CONFIDENTIALITY

Thank you very much for agreeing to participate in this survey.

The information provided by you in this questionnaire will be used for research purposes. It will not be used in a

manner which would allow identification of your individual responses.

Anonymised research data will be archived at ………. in order to make them available to other researchers in

line with current data sharing practices.

I have read and understood the project information sheet dated DD/MM/YYYY. ❑

I have been given the opportunity to ask questions about the project. ❑

I agree to take part in the project. Taking part in the project will include being interviewed and audio

recorded [other forms of participation can be listed]. ❑

I understand that my taking part is voluntary; I can withdraw from the study at any time and I will not

be asked any questions about why I no longer want to take part. ❑

Select only one of the next two options:I would like my name used where what I have said or written as part of this study will be used in

reports, publications and other research outputs so that anything I have contributed to this project

can be recognised. ❑

I do not want my name used in this project. ❑

I understand my personal details such as phone number and address will not be revealed to people

outside the project. ❑

I understand that my words may be quoted in publications, reports, web pages, and other research

outputs but my name will not be used unless I requested it above. ❑

I agree for the data I provided to be archived at ………. [More detail can be provided here so thatdecisions can be made separately about audio, transcripts, etc.] ❑

I understand that other researchers will have access to this data only if they agree to preserve the

confidentiality of that data and if they agree to the terms I have specified in this form. ❑

I understand that other researchers may use my words in publications, reports, web pages, and

other research outputs according to the terms I have specified in this form. ❑

I agree to assign the copyright I hold in any materials related to this project to [name of researcher]. ❑

________________________ ________________ ________

Name of Participant Signature Date

________________________ ________________ ________

Researcher Signature Date

Contact details for further information: Names, phone, email addresses, etc.

Sample consent statement for quantitative surveys

Sample extensive consent form for interviews

17

Informed consent and data sharing

It is essential that gaining consent takes into account any

future uses of data, such as the sharing, preservation and

long-term use of research data. Researchers should:

• inform participants how research data will be stored,

preserved and used in the long-term

• inform participants how confidentiality will be maintained,

e.g. by anonymising data

• obtain informed consent (written or verbal) for

data sharing

To ensure that consent is informed, consent must be

freely given with sufficient information provided on all

aspects of participation and data use. There must be active

communication between the parties. Consent must never

be inferred from a non-response to a communication such

as a letter.

Written or verbal consent?

Whether informed consent is obtained in writing through

a detailed consent form, by means of an informative

statement, or verbally, depends on the nature of the

research, the kind of data gathered, the data format and

how the data will be used.

• For detailed interviews or research where personal,

sensitive or confidential data are gathered, the use of

written consent forms is recommended to assure

compliance with the Data Protection Act and with ethical

requirements. Written consent documentation typically

includes an information sheet and consent form signed

by the participant.

• For surveys or informal interviews, where no personal

data are gathered or personal identifiers are removed

from the data, obtaining written consent may not be

required. At a minimum an information sheet should be

provided to participants detailing the nature and scope of

the study, the identity of the researcher(s) and what will

happen to the data collected (including any data sharing).

• If data are collected verbally through audio or video

recordings, verbal consent agreements can be recorded

together with the data.

• For audio-visual data where the identity of people

may be disclosed from the data, it may be important that

informed consent is obtained to use the data unaltered

for research purposes, sharing and preservation.

Voice alteration or image blurring are usually labour

and cost intensive and may decrease the research

potential of data.

Sample consent forms are available from the UK

Data Archive.16

There is a potential tension between data sharing

and data protection. Data archives work to increase

availability of, and access to, research data, while the

primary purpose of Research Ethics Committees (RECs)

is to ensure ethical conduct in research and to protect the

safety, rights and well being of research participants.

The need to protect personal data and preserve

confidentiality — where explicitly required — cannot be

overstated. This does not mean, however, that all

research data obtained from research with people should

be kept confidential, cannot be shared or, even worse,

are destroyed. It is important to distinguish between

personal or sensitive data collected in research, and

research data in general. Personal data should not be

disclosed, unless consent has been given for disclosure.

Identifiable information may be excluded from data

sharing. A REC should, however, not object to the sharing

of research data in general. If research data contain

sensitive or confidential information, then the sharing of

such data must be considered carefully, but should not be

dismissed as being impossible.

18 16 UK Data Archive (2009). Consent forms. Available at www.data-archive.ac.uk/sharing/consentforms.asp

Research Ethics Committees and

data sharing

20 19

One-off or process consent?

Discussing and obtaining consent for:

• participation in research

• the use of the information gathered for analyses,

publications and outputs

• data sharing beyond the research

can be a one-off occurrence or an ongoing process.

One-off consent is simple, practical, avoids repeated

requests to participants, and meets the formal requirements

of most Research Ethics Committees. However, it may

place too much emphasis on ‘ticking boxes’.

If consent is considered throughout the research

process, it assures active informed consent from

participants. Thus, consent for participation in research,

for data use and for data sharing can be considered at

different stages of the research, giving participants a

clearer view of what participating in the research involves

and what the data to be shared consist of. It may, however,

be too repetitive and annoying for some participants.

Special consent considerations are needed for:17

• medical research

• research with children and young adults

• research with people with learning difficulties

• research within organisations or the workplace

• research into crime

• internet research

The Biological Records Centre (BRC) is the national

custodian of data on the distribution of wildlife in the

British Isles.18 Data are provided by volunteers,

researchers and organisations. BRC disseminates data

for environmental decision-making, education and

research. Data whose publication could present a

significant threat to a species or habitat (e.g. nesting

location of birds of prey) will be treated as confidential.

The BRC provides access to the data it holds via the

National Biodiversity Network Gateway. Standard access

controls are as follows:

• public access to view and download all records at a

minimum 10 km2 level of resolution, and at higher

resolution if the data provider agrees

• registered users have access to view and download

all except confidential records at the 1 km2 level

of resolution

• conservation organisations have access to view and

download all except confidential records at full

resolution with attributes

• conservation officers in statutory conservation agencies

have access to view and download all records,

including confidential records at full resolution with

attributes

• records that have been signified as confidential

by a data provider will not be made available to the

conservation agencies without the consent of the

data provider

Case study – sharing confidential data

17 UK Data Archive (2009). Special cases of consent. Available at www.data-archive.ac.uk/sharing/consentspecial.asp

18 Biological Records Centre (n.d.). Biological Records Centre. Available at www.brc.ac.uk

Anonymising data

Before data obtained from research with people can be

disseminated, published or shared with other researchers,

they may need to be anonymised so that individuals,

organisations or businesses cannot be identified from the

data.19 Anonymisation may be needed for ethical reasons to

protect people’s identities, for legal reasons to not disclose

personal data, or for commercial reasons. Personal data

should not be disclosed from research information, unless a

respondent has given specific consent to do so. In some

forms of research, for example where oral histories are

recorded or in anthropological research, it is customary to

publish and share the names of people studied, for which

they have given their consent.

Quantitative datasets may be anonymised by:

• removing direct identifiers, e.g. name or address

• aggregating or reducing the precision of a variable,

e.g. replacing the date of birth by age groups or

replacing full postcodes with postcode sectors

• generalising the meaning of a detailed text variable,

e.g. replacing a doctor’s detailed area of medical

expertise with ‘an area of medical speciality’

• restricting the upper or lower ranges of a variable to

hide outliers, e.g. top-coding salaries

Special attention may be needed for:

• relational data, where relations between variables in

related datasets can disclose identities

• geo-referenced data, where identifying spatial references

such as point co-ordinates also have a geographical value

Simply removing spatial references prevents disclosure,

but it also means that all geographical, locational and

related information is lost. A better option may be to keep

spatial references intact and to impose access restrictions

on the data instead. As an alternative, point co-ordinates

may be replaced by larger, non-disclosing geographical

areas or by meaningful alternative variables that typify the

geographical position.

When anonymising qualitative material such as

transcriptions of textual data, identifiers should not be

crudely removed or aggregated, as this can distort the

data or even make them unusable. Instead pseudonyms,

replacement terms or vaguer descriptors should be used.

The objective should be to achieve a reasonable level of

anonymisation, avoiding unrealistic or overly harsh editing,

whilst maintaining maximum content. Procedures to

anonymise data should always be considered alongside

obtaining informed consent for data sharing.

Best practice for qualitative data is to:

• plan anonymisation at the time of transcription or initial

write up

• use pseudonyms or replacements

• retain unedited versions of data for use within the

research team and for preservation

• create an anonymisation log of all replacements,

aggregations or removals made; care should be taken to

store such a log separately from the anonymised data files

• identify replacements in a meaningful way,

e.g. with [brackets]

Digital manipulation of audio and image files can be

used to remove personal identifiers. However, techniques

such as voice alteration and image blurring are labour-

intensive and expensive to apply to large quantities of data

and are likely to damage the research potential of the data.

If confidentiality of audio-visual data is an issue, it is better

to obtain the participant’s consent to use and share the

data unaltered.

A person’s identity can be disclosed from:

• direct identifiers, e.g. name, address, postcode

information or telephone number

• indirect identifiers that, when linked with other

publicly available information sources, could identify

someone, e.g. information on workplace, occupation or

exceptional values of characteristics like salary or age

20 19 UK Data Archive (2009). Anonymising research data. Available at www.data-archive.ac.uk/sharing/anonymise.asp

Access control

Under certain circumstances, sensitive and confidential

data can also be safeguarded by regulating or restricting

access to and use of such data, while at the same time

enabling data sharing for research and educational

purposes.

Data held at data centres and archives are not generally in

the public domain. Their use is restricted to specific

purposes after user registration. Users sign an End User

Licence in which they agree to certain conditions, i.e. not to

use data for commercial purposes or identify any potentially

identifiable individuals.

Data centres may impose stricter access regulations for

confidential data, such as:

• providing access to approved researchers only

• requiring data access authorisation from the data owner

prior to release

• placing confidential data under embargo for a given

period of time until confidentiality is no longer pertinent

• providing secure access to data, which enables analysis

of confidential data but excludes access to the data or the

ability to download the data

Mixed levels of access regulations may be put in place for

some datasets, combining regulated access to confidential

data with user access to non-confidential data.

Data centres typically liaise with the researchers who own

the data in selecting the most suitable type of access for

data. Access regulations should always be proportionate to

the kind of data and confidentiality involved.

Open access

The digital revolution has caused a strong drive towards

open access of information, with the internet making

information sharing fast, easy, powerful and empowering.

Scholarly publishing has seen a strong move towards open

access to increase the impact of research, with e-journals,

open access journals and copyright policies enabling the

deposit of outputs in open access repositories. The same

open access movement also steers towards more open

access to the underlying data and evidence on which

research publications are based. More and more journals

require for data that underpin research findings to be

published in open access repositories before manuscripts

are submitted.

22 21

UK Biobank aims to collect medical and genetic data from

500,000 middle-aged people across the UK in order to

create a research resource to study the prevention and

treatment of serious diseases. Stringent security,

confidentiality and anonymisation measures are in place.20

Biobank holds personal data on recruited patients, their

medical records and blood, urine and genetic samples;

with data made available to approved researchers.

Data or samples provided to researchers never include

personal identifying details.

All information and samples are stored anonymously

by removing any identifying information. This identifying

information is encrypted and stored separately in a

restricted access database that is controlled by senior UK

Biobank staff. Identifying data and samples are only linked

using a code that has no external meaning. Only a few

people within UK Biobank have access to the key to the

code for re-linking participants’ identifying information with

data and samples. All staff sign confidentiality agreements

as part of their employment contracts.

Case study – data security and

anonymisation

20 UK Biobank (2007). UK Biobank ethics and governance framework. Available at www.ukbiobank.ac.uk/docs/EGFlatestJan20082.pdf

22

a

The Secure Data Service is a secure environment to provide researchers with controlled and restricted access to disclosive

micro data.21 Its operation is legally framed by the 2007 Statistics Act which makes access to the confidential data for

statistical purposes possible.

The technical model shares many similarities with the ONS Virtual Micro Laboratory and the NORC Secure Data Enclave.22

It is based around a Citrix infrastructure which turns the end user’s computer into a remote terminal giving access to data,

to statistical software and to collaborative spaces on a central secure server held within the UK Data Archive.

It is secure because all data processing is carried out on a central secure server, which processes all requests centrally and

returns information about the results. No data travels over the network. No data can be imported or exported by users, except

the statistical results — sent from the central server to the remote location by an encrypted email after the final outputs are

checked against statistical disclosure controls.

The clearing-house mechanism established following the Convention on Biological Diversity to promote information sharing,

has resulted in an exponential increase in openly accessible biodiversity and ecosystem data since 1992. The Forest Spatial

Information Catalogue is a web-based portal, developed by the Center for International Forestry Research (CIFOR), for public

access to spatial data and maps.23

The catalogue holds satellite images, aerial photographs, land usage and forest cover maps, maps of protected areas,

agricultural and demographic atlases and forest boundaries. For example, forest cover maps for the entire world, produced

by the World Conservation Monitoring Centre in 1997 can be downloaded freely as digital vector data.

The Global Biodiversity Information Framework (GBIF) strives to make the world’s biodiversity data accessible everywhere in

the world.24 The framework holds millions of species occurrence records based on specimens and observations, scientific and

common names and classifications of living organisms and map references for species records. Data are contributed by

numerous international data providers. Geo-referenced records can be mapped to Google Earth.

Case study – access restrictions vs. open access

21 UK Data Archive (2009). Secure Data Service. Available at securedata.ukda.ac.uk/

22 Office for National Statistics (n.d.). Virtual Microdata Laboratory. Available at www.ons.gov.uk/about/who-we-are/our-services/unpublished-

data/business-data/vml/index.html

National Opinion Research Center (n.d.). NORC Data Enclave. Available at www.norc.org/DataEnclave

23 CIFOR (2005). Forest Spatial Information Catalogue. Available at gislab.cifor.cgiar.org/fsic/goHome.do

24 GBIF (n.d.). Global Biodiversity Information Framework Data Portal. Available at data.gbif.org/welcome.htm

COPYRIGHT

Researchers ‘creating’ data hold copyright over such

data. Copyright is an intellectual property right that

protects the owner of a work from its unauthorised

copying. Most research materials including spreadsheets,

publications, reports and computer programmes, fall under

literary work and are therefore protected by copyright.

Facts, however, cannot be copyrighted.

If information is structured in a database, the structure

acquires a database right, alongside the copyright in the

content of the database. A database may be protected by

both copyright and database right. For database right to

apply, the database must be the result of substantial

intellectual investment in obtaining, verifying or presenting

the content.

In the case of interviews, the interviewee holds the

copyright in the spoken word. If a transcription is a

substantial reproduction of the words spoken, the speaker

will own copyright in the words and the transcriber will

have separate copyright of the transcription.

In the case of collaborative research or derived data,

copyright may be held jointly by various researchers or

institutions. Copyright should be assigned correctly,

especially if datasets have been created from a variety

of sources; for example those which have been bought

or ‘lent’ by other researchers.

In academia, in theory the employer is the first owner

of the copyright in a work made during the course of the

employee’s employment. Many academic institutions,

however, waive copyright in research materials, data and

publications and give ownership to the researchers.

Researchers should check with their institutions if their

institution retains copyright or waives it.

When data are archived or shared, the researcher or

data creator keeps the copyright over data. A data archive

cannot effectively archive data unless all the rights holders

are identified and give their permission for the data to be

archived.

Secondary users of data must obtain copyright clearance

from the rights holder before data can be reproduced.

Data can be copied for non-commercial teaching or

research purposes without infringing copyright, under

the fair dealing concept, providing that the ownership of

the data is acknowledged to the copyright holder. An

acknowledgement should give credit to the data source

used, the data distributor and the copyright holder. Data

centres typically specify how data use should be

acknowledged and cited, either within the metadata record

for a dataset or in a data use licence.

Some researchers may have come across the concept

of Creative Commons licences which, allow creators to

communicate the rights which they wish to keep and the

rights which they wish to waive in order for other people

to make re-use of their intellectual properties more

straightforward. For some datasets Creative Commons

licences may be appropriate, but for others it is not.

Creative Commons itself does not recommend using

Creative Commons licences for informational databases,

rather to use the Science Commons Open Access

Data Protocol.25

Scenario: A researcher has collated articles about the

Prime Minister from The Guardian over the past ten

years, using the LexisNexis database to source articles.

They are then transcribed/copied by the researcher into

a database so that content analysis can be applied.

The researcher offers a copy of the database with the

original transcribed text to a data centre.

Rights Issue: Researchers cannot share either of

these data sources as they do not have copyright in the

original material. A data centre cannot accept these data

as to do so would be breach of copyright. The rights

holders, in this case The Guardian and LexisNexis, would

need to provide consent for archiving.

Case study – copyright of media sources

2325 Science Commons (n.d.). Protocol for Implementing Open Access Data. Available at sciencecommons.org/projects/publishing/open-access-data-protocol/

24

Scenario: A researcher has used the National Diet and Nutrition Survey (NDNS) data, obtained via the UKDA. NDNS data are

Crown Copyright. The researcher has processed the NDNS data (filtered, integrated and aggregated data across variables,

while maintaining individual records) and used the processed data to model food chain risks. The researcher would like to

archive the processed data that were used as input data for the modelling, as well as the modelling code, at the UK Data

Archive.

Rights Issue: There is joint copyright over the processed data, shared between the researcher and the Crown (holding

copyright over the NDNS data). The researcher should declare this joint copyright for the modelling input data and requires no

further permission from the Crown. The UK Data Archive End User Licence, which the researcher signed when obtaining the

NDNS data from the UK Data Archive, specifically states “offer for deposit any new data collections derived from the data

supplied or created by the combination of the data supplied with other data”. Thus the UK Data Archive can archive the

processed data with a joint copyright declaration.

Scenario: A researcher has interviewed five retired cabinet ministers about their careers, producing audio recordings and full

transcripts. The researcher then analyses the data and offers them to a data centre for preserving. However the researcher

did not get signed copyright transfers for the interviewees’ words.

Rights Issue: In this case it would be problematic for a data centre to accept the data. Large extracts of the data cannot be

quoted by secondary users. To do so would breach the interviewees’ copyright over their words. This is equally a problem for

the primary researcher. The researcher should have asked for transfer of copyright or a licence to use the data obtained

through interviews, as the possibility exists that the interviewee may at some point wish to assert the right over their words,

e.g. when publishing memoirs.

Scenario: A researcher subscribes to access spatial AgCensus data from EDINA. These data are then integrated with data

collected by the researcher. As part of the ESRC award contract the data has to be offered for archiving at the UK Data

Archive. Can such integrated data be offered?

Rights Issue: The subscription agreement on accessing AgCensus data states that data may not be transferred to any other

person or body without prior written permission from EDINA. Therefore, the UK Data Archive cannot accept the integrated

data, unless the researcher obtains permission from EDINA. The researcher’s partial data, with the AgCensus data removed,

can be archived. Secondary users could then re-combine these data with the AgCensus data, if they were to obtain their own

AgCensus subscription.

Case study – copyright of licensed data

Case study – copyright of interviews with ‘elites’

Case study – data obtained from UK Data Archive

Examples of data research centres

Antarctic Environmental Data Centre

Archaeology Data Service

Biomedical Informatics Research Network Data Repository

British Atmospheric Data Centre

British Library

British Oceanographic Data Centre

Cambridge Crystallographic Data Centre

Economic and Social Data Service

Environmental Information Data Centre

European Bioinformatics Institute

Geospatial Repository for Academic Deposit and Extraction

History Data Service

Infrared Space Observatory

National Biodiversity Network

National Geoscience Data Centre

NERC Earth Observation Data Centre

NERC Environmental Bioinformatics Centre

Petrological Database of the Ocean Floor

Publishing Network for Geoscientific and

Environmental Data (PANGAEA)

Scran

The Oxford Text Archive

UK Data Archive

UK Solar System Data Centre

Visual Arts Data Service

www.data-archive.ac.uk/sharing

UK Data ArchiveUniversity of EssexWivenhoe ParkColchesterCO4 3SQEmail: [email protected]: +44 (0)1206 872974

www.dataarchive.ac.uk/sharing

managing and sharing data - university of oxford...efficiency of the data management during...

Documents