scientific data management

Post on 19-May-2015

830 Views

Category:

Education

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

DatamanagementResponsible Conduct of Research

Seminar SeriesUC Berkeley

April 16, 2012

Who are you?

Jeffery Loo, PhD

“Flying books”Installation by J. Ignacio Diaz de Rabago

UC Berkeley Library

NSF data management plan

Requirement as of January 18,

2011

Your plans to organize, store, and share data

http://www.nsf.gov/bfa/dias/policy/dmp.jsp

“My Data Management Plan – a satire”

Dr. C. Titus BrownAssistant ProfessorMichigan State University

Source

Dear NSF,

I am happy to respond to your request for a 2-page Data Management Plan.

First of all, let me say how enthusiastic I am that you have embraced this new field of "large scale data analysis". Ever since I started working with large Avida data sets in 1993, […] I have seen the need for a systematic plan to manage the data. It is nice to see NSF stepping up to the plate in such a timely manner, and I am happy to comply.

Now, as to my actual data management plan, here is how I plan to deal with research data in the future.

I will store all data on at least one, and possibly up to 50, hard drives in my lab.

The directory structure will be custom, not self-explanatory, and in no way documented or described. Students working with the data will be encouraged to make their own copies and modify them as they please, in order to ensure that no one can ever figure out what the actual real raw data is.

Backups will rarely, if ever, be done.

When required to make the data available by my program manager, my collaborators, and ultimately by law, I will grudgingly do so by placing the raw data on an FTP site, named with UUIDs like 4e283d36-61c4-11df-9a26-edddf420622d. I will under no circumstances make any attempt to provide analysis source code, documentation for formats, or any metadata with the raw data. When requested (and ONLY when requested), I will provide an Excel spreadsheet linking the names to data sets with published results. This spreadsheet will likely be wrong -- but since no one will be able to analyze the data, that won't matter.

[….]

Note, we didn't use a version control system, either. […] And our repository is not publicly available - you have to beg for permission. Note, I only answer e-mail on every other Tuesday.

Any design notes on the data analysis are in our private e-mail, and we will fight to the death -- up to and including ignoring FOIA requests -- to prevent you from obtaining them.

Meanwhile we will continue publishing exciting sounding (but irerproducible) analyses, and submitting grants based on them, because that's the only thing that the reviewers care about.

sincerely yours,

--titus

(representing every computational scientist in the world.)

Data challenges

Distributed, uncoordinated effort

Concerns about data re-use

Data management may be ad lib“Can’t you ever relax?”

Informal data management practices

Lots to do!

Ensure long-term access

Facilitate sharing

Prepare for future re-use

Data activities in

the research workflow

Source:http://www2.lib.virginia.edu/brown/data/lifecycle.html

Lots of different research products

Models and computational simulations Images, photographs, audio, and video

Instrument readings Maps

Software Artifacts and samples

Physical collections And more …

Goal for this lunch hour

Review “first steps” in data management

Saving dataDescribing/documenting dataSharing dataData management planningData ethics

Common sense versus common practice

Saving data

Hall of fame anecdote

http://www.youtube.com/watch?v=J6HtRWyiL98

Where do you store data safely?

Traditional storage not always sufficientPersonal computersDepartmental/university servers

Two additional types of storageArchives and repositoriesCloud storage (storing files in an online site)

Archives and repositories

Special types of online storage sites

Long-term storage, management, and preservation

Search, download, and analytic functionalities

Institutional archives and repositories

Merritthttp://merritt.cdlib.org/

Data repository management services at UCBhttp://ist.berkeley.edu/ds

Public archive and repository

Long-term access, open to the public

GenBankhttp://www.ncbi.nlm.nih.gov/genbank/

3rd party cloud storage

Amazon S3Google Docs

Dropbox

Beware of posting sensitive data/files

Deciding on storage

Consider:Permanence Oversight Security

Save for long-term access

Recommended file formats• Non-proprietary• Uncompressed and unencrypted (okay to encrypt sensitive

data)• Common usage by your research community• Standard representation (e.g., ASCII text, Unicode)

1 2 3

Original master Local external storage

Remote external storage

UC Berkeley IST backup services

3rd party services (Amazon S3,  Elephant Drive,  Jungle Disk, Mozy, Carbonite Free, Dropbox)

Email a copy to yourself

Backup 3 copies

Describing anddocumenting data

(metadata)

What countries have a five-pointed star on their national flag?

DOI: 10.1126/science.1207745

“outsourcing” our memory

“we don’t remember information as well, when we expect to find it on a computer later”

If we outsource our memory to computers …

We need good organization structures toFind data from the past quickly and completelyUnderstand data from the past

It helps toDocument and describe data“Assign metdata”

What do you document?

Descriptivemetadata elements

Administrative metadata elements

Structural metadata elements

Title Creator or contact Date Experimental conditions MethodologyVersion

Dictionary or codebook to explain the data variables

Tools and software needed for processing or visualizing the data

File formats

File names

How to record metadata

writemetadata

save asreadme.txt

store in file folder with data

Option 1

Metadata form/file in an archive/repository

http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html

Option 2

Annotate

<title>Effect of salt on ice cream production efficiency</title> <temperature>0</temperature>

XML, a popular system for annotating datahttp://www.w3schools.com/xml/

Option 3

Assign descriptive namesDescriptive file names

Descriptive folder names 

  Consider these elements:• project title• experimental

conditions and group• trial numbers• file version number

indicating data modifications• date or time stamps• author initials

data1.csv 75-celsius-trial_control_ver002.csv

Data > 1 > raw    >> part A    >> 110904 > readings

Project-title > Trial 1    >> Experimental    >> Control > Trial 2 > Trial 3

Australia

Brazil

Cape Verde

Ethiopia

United States of America

Sharing data

Historic data sharing

Anagrams to secure discoveriesVersus the “open science revolution” of journals

today

Galileo Newton Huygens Hooke

Open scienceShare research data, products, and

communications openly

Potential benefitsProtects unique data that cannot be readily replicatedReinforces open scientific inquiryEncourages diversity of analysis and opinionPromotes new lines of researchMakes possible the testing of new or alternative

hypotheses and methods of analysisSupports studies on data collection methods and

measurementFacilitates the education of new researchersEnables the exploration of topics not envisioned by the

initial investigatorsPermits the creation of new datasets when data from

multiple sources are combinedProvides content for scientific education

Data sharing examples

Crystal structure of M-PMV retroviral protease

Private sector too!

Cross-sector data sharing for

Alzheimer’s researchhttp://www.adni-info.org

(News story)

Increased citation rate

Funding agency policies

NIH Data Sharing Policy

NSF Data Sharing PolicyData management plan for grant

applications

Journal expectations

Data sharing as a term of publication

How do I share?

Personal sharing

Share-upon-requestEmail me for a copy!

Self-archiveDownload from my personal website!

Journal publishing

Institutional archive or repository

UC3 Merritt repository

Public archive or repository

The Ancient Agora of Athens

Ideal characteristicsPopular with national/global coverageSpecific to your disciplineOffers long-term preservation

Find an archive/repositoryAsk colleaguesSearch http://databib.lib.purdue.edu/

Public versus institutionalarchives and repositories

Institutional archives/repositoriesMay restrict to a smaller audienceMay offer greater control of your data

Public archives/repositoriesCreate comprehensive dataset for a larger

research problem spaceDomain-specific archives/repositories may

provide better support

Help others find your data

Berkeleywww.berkeley.edu/mystuff/super-data.csv

Stanfordwww.stanford.edu/mystuff/super-data.csv

file moves to

old URL is kaput

DOI Digital object identifier

Resolve DOIby visiting http://dx.doi.org/ followed by DOI

File can move, but DOI remains the sameThe DOI record stores location details

Try permanent identifiers

Generate permanent identifiers

request your free account, by emailing data-consult@lists.berkeley.edu

http://n2t.net/ezidSubscription through the UCB Library

Final tips for sharing

Be selective

Recognize restrictions (privacy and confidentiality)

Online services for sharing among your teamResearch Hub3rd party services

Data management planning

What is a data management plan?

A plan for organizing, storing, and sharing data

Planning associated with greater self-control for exercisemedical adherenceself-health exams sunscreen useschoolworkrefraining from a negative

behavior

Source: Townsend and Liu, 2012

Perhaps planning helps for data management

Why have a plan?

Prepare for efficient and quality data collection that is safe and shareable

NSF and NIH requirements

requirements

Data management plan≤ 2 pagesdescribes how data will be managed, disseminated, and shared

Plan undergoes peer review

Writing an NSF data management plan

Specific requirements vary by NSF divisions

In general, describe:Types of research data and materials producedStandards for data format, content, and metadataPolicies for access and sharingPolicies for re-use, re-distribution, and derivativesPlans for archiving and preserving

You can explain why data will not be shared

Examples1 and 2

NIH requirements

Timely data sharing encouraged

If requesting ≥ $500k per year, a plan is required

Describe how data will be sharedor why sharing is not possible

In the final progress report, describe data sharing actions taken

Writing an NIH data sharing plan

A brief paragraph

Suggested topicsSchedule for sharingFormat of the dataDocumentation of the dataAnalytic tools providedData-sharing agreements (criteria and conditions)Mode of data sharing

there was a

beautiful scientist

NIH plan example 1

The proposed research will involve a small sample (less than 20 subjects) recruited from clinical facilities in the New York City area with Williams syndrome. This rare craniofacial disorder is associated with distinguishing facial features, as well as mental retardation. Even with the removal of all identifiers,

we believe that it would be difficult if not impossible to protect the identities of subjects given the physical characteristics of subjects, the type of clinical data (including imaging) that we will be collecting, and the relatively restricted area from which we are recruiting subjects.

Therefore, we are not planning to share the data.

NIH plan example 2This application requests support to collect public-use data from a survey of more than 22,000 Americans over the age of 50 every 2 years.

Data products from this study will be made available without cost to researchers and analysts. https://ssl.isr.umich.edu/hrs/

User registration is required in order to access

or download files. As part of the registration process, users must agree to the conditions of use governing access to the public release data, including restrictions against attempting to identify study participants, destruction of the data after analyses are completed, reporting responsibilities, restrictions on redistribution of the data to third parties, and proper acknowledgement of the data resource.

Registered users will receive user support, as well as information related to errors in the data, future releases, workshops, and publication lists. The information provided to

users will not be used for commercial purposes, and will not be redistributed to third parties.

Library guidance

Guides, templates, exampleshttp://www.lib.berkeley.edu/sciences/data/guide

Online service for building data plans

https://dmp.cdlib.org/

Step-by-step instructions for meeting funding

agency requirements

Data ethics

Study by Martinson et al., 2005Source - doi:10.1038/435737a

Motivated by increasing pressureto publish papers and win grants?

3247 respondents

0.3% admitted to falsification or “cooking” research data

About 1 in 3 confessed to committing at least one of 10 serious misbehaviors

Citing data

Prevent distortions and manipulations

Keep raw original data

Log all changes made

Data licensing

Restrictions on data use, for example

No for-profit useNo re-sharingGive attribution

Check for license/terms of use

Stay current with data requirements

Review for changes to policies byFunding agenciesUniversity regulationsFederal and state governments

Haiku summary

Data is precious Safely store and share widelyGood for your career

top related